I know that I need to initialize Spark Context to create resilient distributed datasets (RDDs) in PySpark. However, different sources give different code for how to do so. To resolve this once and for all, what is the right code?
1) Code from Tutorials Point: https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm
from pyspark import SparkContext
sc = SparkContext("local", "First App")
2) Code from Apache: https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#resilient-distributed-datasets-rdds
from pyspark import SparkContext, SparkConf
Then, later down the page, there is:
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
These are just two examples. I can list more, but the main problem for me is the lack of uniformity for something so simple and basic. Please help and clarify.
In local[N]
- N is the maximum number of cores can be used in a node at any point of time. This will use your local host resources.
In cluster mode (when you specify a Master node IP) you can set --executor-cores N
. It means that each executor can run a maximum of N tasks at the same time in an executor.
And when you don't specify an app name, it could be left blank or spark could ne creating a random name. I am trying to get the source code for setAppName()
but not able to find any meat
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments