spark-submit 任务提交过程分析

https://blog.csdn.net/u013332124/article/details/91456422

一、spark-submit脚本分析

spark-submit的脚本内容很简单:

# 如果没设置SPARK_HOME的环境变量，调用find-spark-home文件寻找spark-home
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
# 直接将所有参数传递给spark-class
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

最后又调用spark-class。其实不光spark-submit，几乎所有的spark服务最终都是调用spark-class来启动的。spark-class的代码也不多：

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# 调用Main类生成命令
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

spark-class主要是将参数交给org.apache.spark.launcher.Main类执行，然后获取到一个新的命令，之后我们拿着这个命令执行。

比如我们执行下面的spark-submit语句:

spark-submit --queue up --deploy-mode cluster --master yarn --class org.apache.spark.examples.SparkPi /www/harbinger-spark/examples/jars/spark-examples_2.11-2.1.0.jar 10

经过Main类解析后，就会变成下面的命令：

/www/jdk1.8.0_51/bin/java -cp /www/harbinger-spark/conf/:/www/harbinger-spark/jars/*:/www/harbinger-hadoop/etc/hadoop/ -Xmx52m org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --queue up /www/harbinger-spark/examples/jars/spark-examples_2.11-2.1.0.jar 10

我们发现，最终又绕回来了，还是通过java命令调用SparkSubmit类。

那么，为什么spark不直接运行SparkSubmit，而是绕了一大圈通过Main类解析获得命令然后再运行呢？

二、Main类的作用

spark-submit的命令解析主要是经过SparkSubmitCommandBuilder#buildSparkSubmitCommand()方法，我们可以看一下源码：

  private List buildSparkSubmitCommand(Map env)
      throws IOException, IllegalArgumentException {
    //加载配置文件的配置
    Map config = getEffectiveConfig();
    boolean isClientMode = isClientMode(config);
      //获取用户指定的classPath
    String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;

    List cmd = buildJavaCommand(extraClassPath);
    // Take Thrift Server as daemon
    if (isThriftServer(mainClass)) {
      addOptionString(cmd, System.getenv("SPARK_DAEMON_JAVA_OPTS"));
    }
    addOptionString(cmd, System.getenv("SPARK_SUBMIT_OPTS"));

    // We don't want the client to specify Xmx. These have to be set by their corresponding
    // memory flag --driver-memory or configuration entry spark.driver.memory
    String driverExtraJavaOptions = config.get(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS);
    if (!isEmpty(driverExtraJavaOptions) && driverExtraJavaOptions.contains("Xmx")) {
      String msg = String.format("Not allowed to specify max heap(Xmx) memory settings through " +
                   "java options (was %s). Use the corresponding --driver-memory or " +
                   "spark.driver.memory configuration instead.", driverExtraJavaOptions);
      throw new IllegalArgumentException(msg);
    }

    if (isClientMode) {
      String tsMemory =
        isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;
      String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),
        System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);
      cmd.add("-Xmx" + memory);
      addOptionString(cmd, driverExtraJavaOptions);
      mergeEnvPathList(env, getLibPathEnvName(),
        config.get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));
    }

    cmd.add("org.apache.spark.deploy.SparkSubmit");
    cmd.addAll(buildSparkSubmitArgs());
    return cmd;
  }

主要做的事情其实就是读取各种配置然后往命令中添加一些参数。也就是对命令进行加工。

其实添加参数这种事情直接在shell中也能做，但是这个过程需要读取配置文件，shell可能做起来比较麻烦。另外其他服务也会经过Main类进行加工，一些公共的代码也可以抽象出来。所以，这个Main类主要用于对命令的加工和转换。

一些spark服务，如果要修改一些服务的参数，比如调整堆大小，就是在Main类中读取相关的环境变量来设置的。比如SparkHistoryServer，Main类中会读取环境变量SPARK_HISTORY_OPTS的值，然后在启动SparkHistoryServer时加上去。其他的服务也类似。另外，环境变量可以在"${SPARK_HOME}"/bin/load-spark-env.sh中设置，spark-class中会加载这个文件的配置。

三、SparkSubmit类提交任务的过程

SparkSubmit做的事情就是提交任务运行。我们这里讨论一下yarn模式的任务提交。

整个任务提交流程也比较好理解，主要就是收集ApplicationMaster的上下文，比如ApplicationMaster的启动命令、资源文件、环境变量等，然后和yarn建立连接，通过yarnClient提交ApplicationMaster到yarn上运行。之后，不断向yarn轮询任务的状态直到任务运行结束。

因为整个过程代码比较多，我们挑一些关键点进行分析。

如何和ResourceManger建立连接

在yarn的模式下，spark会去读取环境变量HADOOP_CONF_DIR或者YARN_CONF_DIR目录下的配置文件，如果这两个环境变量都没找到，运行spark-submit命令时就会报错：

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
    at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:256)
    at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:233)
    at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:110)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

spark主要是为了读取该目录下的3个文件：core-site.xml、yarn-site.xml、hdfs-site.xml。

其中core-site.xml是hadoop的核心配置。读取yarn-site.xml配置主要是为了获取ResourceManger的地址，之后就可以通过rpc建立连接。而读取hdfs-site.xml主要是要上传需要资源文件到hdfs用。

所以，运行spark-submit其实并不需要整个hadoop安装包，只需要将这三个配置文件放好然后设置一下HADOOP_CONF_DIR或者YARN_CONF_DIR环境变量即可。

提交任务到yarn的相关代码在spark源码的resource-managers/yarn目录下。在使用maven编译时，需要带上 -Pyarn 才会将这些代码打包进去

spark任务配置的优先级

在spark中，有三种方式可以设置参数,这三种方法的优先级从低到高依次是：

在 spark_default.conf 文件中配置
执行spark-submit 时通过参数指定配置
在代码中直接通过SparkConf的方法设置参数

比如我们在 spark_default.conf 中设置了spark.executor.cores = 1，但是在spark-submit时又指定了--executor-cores 2，这时真正的executor的core数量就是2，spark_default.conf 中的配置被覆盖。

但是也有一些情况，可能只会用到spark_default.conf 文件中的配置或者spark-submit的参数配置。在代码中设置是没用的，比如在client模式下，spark.driver.extraClassPath这参数必须在启动Driver的时候立马设置，这时通过SparkConf设置等于没设置。

还有一种情况，我们在spark-submit中设置appName为"a"，但是在SparkConf中又设置了appname为"b"。这时我们去yarn的页面就会发现这个app的name还是"a"，不会被覆盖。去SparkHisotryServer中这个app的name就是"b"。这个主要是因为spark向yarn提交任务时Driver还未运行，此时获取到的spark.app.name还是spark-submit设置的"a"。到了真正执行，spark.app.name配置就变成"b"了。

所以，虽然大多数的配置优先级是那样，但是如果我们发现哪个配置没生效，还是需要具体情况具体分析的。

spark寻找spark_default.conf文件的过程主要是先读取SPARK_CONF_DIR环境变量，然后读取目录下面的spark_default.conf文件。获取SPARK_CONF_DIR没设置，就读取SPARK_HOME/conf目录下的配置文件。这时如果SPARK_HOME环境变量也没设置，就会报错

client模式的真正运行方式

spark提交请求的Application上下文中有一个command参数，也就是告诉yarn怎么启动ApplicationMaster。我们发现在cluster模式下，启动的ApplicationMaster是org.apache.spark.deploy.yarn.ApplicationMaster类，而在client模式下，启动的ApplicationMaster是org.apache.spark.deploy.yarn.ExecutorLauncher。

其实ExecutorLauncher的main方法还是直接调用ApplicationMaster的main方法。之后在ApplicationMaster#run()方法中，如果是client模式，会去连接运行的客户端机器上的Driver。之后做的事就是根据Driver的命令(也就是rpc请求)申请或者释放Container资源了。

之前经常以为client模式下，Driver就是ApplicationMaster，只是AppcationMaster运行在客户端服务器上而已。但是实际并不是这样。client模式下，Driver运行在客户端上，ApplicationMaster还是运行在yarn的Container中，只是这时这个ApplicationMaster只负责进行资源的调度而已。