Spark基础【RDD依赖关系--源码解析】

文章目录

  • 一 RDD依赖关系
    • 1 RDD血缘关系
    • 2 RDD依赖关系
    • 3 RDD阶段划分
    • 4 RDD任务划分

一 RDD依赖关系

1 RDD血缘关系

相邻两个RDD之间的关系,称之为依赖关系,多个连续的依赖关系称之为血缘关系

RDD只支持粗粒度转换,即在大量记录上执行的单个操作。将创建RDD的一系列Lineage(血统)记录下来,以便恢复丢失的分区。RDD的Lineage会记录RDD的元数据信息和转换行为,当该RDD的部分分区数据丢失时,它可以根据这些信息来重新运算和恢复丢失的数据分区

def main(args: Array[String]): Unit = {
   

  val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")

  val sc = new SparkContext(conf)

  val lines: RDD[String] = sc.textFile("data/word.txt")
  println(lines.toDebugString)
  println("******************")
  /**
   * (2) data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
   * |  data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
   */

  val words: RDD[String] = lines.flatMap(_.split(" "))
  println(words.toDebugString)
  println("******************")
  /**
   * (2) MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
   * |  data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
   * |  data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
   */

  val wordToOne: RDD[(String, Int)] = words.map((_,1))
  println(wordToOne.toDebugString)
  println("******************")
  /**
   * (2) MapPartitionsRDD[3] at map at Spark01_WordCount_Dep.scala:21 []
   * |  MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
   * |  data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
   * |  data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
   */

  val wordCount: RDD[(String, Int)] = wordToOne.reduceByKey(_ + _)
  println(wordCount.toDebugString)  // +-:shuffle,存在落盘操作,(2)为分区数量
  println("******************")
  /**
   * (2) ShuffledRDD[4] at reduceByKey at Spark01_WordCount_Dep.scala:25 []
   * +-(2) MapPartitionsRDD[3] at map at Spark01_WordCount_Dep.scala:21 []
   * |  MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
   * |  data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
   * |  data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
   */

  wordCount.collect().foreach(println)

  sc.stop()
}

2 RDD依赖关系

所谓的依赖关系,其实就是两个相邻RDD之间的关系

RDD的依赖关系主要分为两大类:

  • 窄依赖 OneToOneDependency,上游(旧)RDD一个分区的数据被下游(新)RDD的一个分区独享,多个上游RDD分区的数据被一个下游RDD分区独享,也称之为窄依赖,类比独生子女
  • 宽依赖 ShuffleDependency,上游(旧)RDD一个分区的数据被下游(新)RDD的多个分区共享

因为shuffle存在将分区数据打乱重新组合的操作,所以shuffle属于宽依赖,类比二胎三胎

def main(args: Array[String]): Unit = {
   

  val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")

  val sc = new SparkContext(conf)

  val lines: RDD[String] = sc.textFile("data/word.txt")
  println(lines.dependencies)
  println("******************")
  //List(org.apache.spark.OneToOne

你可能感兴趣的:(Spark,spark,scala,大数据)