How does Spark prepare executors on Hadoop YARN?

St.Antario

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked

Thread.currentThread().getContextClassLoader.getResource("")

It points out to the following directory:

/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/

Looking at the directory I found the following files:

default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__

The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?

I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.

Jacek Laskowski

Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...


You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).

By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.

You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.

While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).

enter image description here

who delivers the files to each executor

That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.

That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.

The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.

enter image description here

and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.

Quoting Mastering Apache Spark 2 gitbook:

CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.

ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.

enter image description here

What are the scripts? Are they all YARN-autogenerated?

Kind of.

Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.

Enable DEBUG logging level in your Spark application and you'll see the file transfer.


You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to set amount of Spark executors?

How does Spark running on YARN account for Python memory usage?

How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

How does Apache Spark handles system failure when deployed in YARN?

How does Spark paralellize slices to tasks/executors/workers?

Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode

How does Spark on Yarn store shuffled files?

How spark driver serializes the task that is sent to executors?

How to initialize a new Spark Context and executors number on YARN from RStudio

What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?

Spark - How many Executors and Cores are allocated to my spark job

Why does spark-submit in YARN cluster mode not find python packages on executors?

Hadoop Yarn: How to limit dynamic self allocation of resources with Spark?

How are Spark Executors launched if Spark (on YARN) is not installed on the worker nodes?

How does spark choose nodes to run executors?(spark on yarn)

Does YARN calculate num-executors ever?

What does "red" executors in the Spark UI mean?

Dynamic Allocation with spark streaming on yarn not scaling down executors

Does Yarn allocates one container for the application master from the number of executors that we pass in our spark-submit command

What does "% of Queue" refer to in the hadoop yarn UI

Where does Hadoop store the logs of YARN applications?

How to prevent Spark Executors from getting Lost when using YARN client mode?

Why does launching spark-shell with yarn-client fail with "java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream"?

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

Spark/Yarn: Ramp up number of executors slowly over a period of time

Issue with Apache Spark working on Hadoop YARN

Spark with yarn-client on HDP multi nodes cluster only starts executors on the same single node

Spark - How many executors for application master in Yarn client mode

How to handle dynamic port in Apache Spark / Hadoop Yarn