Spark-Core(累加器)

累加器

实现原理

累加器用来把 Executor 端变量信息聚合到 Driver 端。在 Driver 程序中定义的变量,在Executor 端的每个 Task 都会得到这个变量的一份新的副本,每个 task 更新这些副本的值后,传回 Driver 端进行 merge。

val rdd = sparkContext.makeRDD(List(1,2,3,4,5))
    // 声明累加器
    var sum = sparkContext.longAccumulator("sum");
    rdd.foreach(
      num => {
        // 使用累加器
        sum.add(num)
      }
    )
    // 获取累加器的值
    println("sum = " + sum.value)

 自定义累加器实现wordcount

创建自定义累加器

import org.apache.spark.util.AccumulatorV2

import scala.collection.mutable

object demo01 {
  def main(args: Array[String]): Unit = {
    class WordCountAccumulator extends AccumulatorV2[String,mutable.Map[String,Long]] {
      var map:mutable.Map[String,Long] = mutable.Map()
      override def isZero: Boolean = map.isEmpty
      override def copy(): AccumulatorV2[String, mutable.Map[String,Long]] = new WordCountAccumulator
      override def reset(): Unit = map.clear()
      override def add(v: String): Unit = {
        map(v) = map.getOrElse(v,0L)+1L
      }
      override def merge(other: AccumulatorV2[String, mutable.Map[String,Long]
      ]): Unit = {
        val map1 = map
        val map2 = other.value
        map = map1.foldLeft(map2)(
          (innerMap,kv)=>{
            innerMap(kv._1) = innerMap.getOrElse(kv._1,0L)+kv._2
            innerMap
          }
        )
      }
      override def value: mutable.Map[String,Long] = map
    }
  }
}

 调用自定义累加器:

val rdd = sparkContext.makeRDD(
      List("spark","scala","spark hadoop","hadoop")
    )
    val acc = new WordCountAccumulator
    sparkContext.register(acc)

    rdd.flatMap(_.split(" ")).foreach(
      word=>acc.add(word)
    )
    println(acc.value)

广播变量 

实现原理

广播变量用来高效分发较大的对象。向所有工作节点发送一个较大的只读值,以供一个或多个 Spark 操作使用。比如,如果你的应用需要向所有节点发送一个较大的只读查询表,广播变量用起来都很顺手。在多个并行操作中使用同一个变量,但是 Spark 会为每个任务分别发送。

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD

object demo01 {
  def main(args: Array[String]): Unit = {
    val rdd1 = sparkContext.makeRDD(List( ("a",1), ("b", 2), ("c", 3), ("d", 4) ),4)
    val list = List( ("a",4), ("b", 5), ("c", 6), ("d", 7))
    val broadcast :Broadcast[List[(String,Int)]] = sparkContext.broadcast(list)
    val resultRDD :RDD[(String,(Int,Int))] = rdd1.map{
      case (key,num)=> {
        var num2 = 0
        for((k,v)<-broadcast.value){
          if(k == key) {
            num2 = v
          }
        }
        (key,(num,num2))
      }
    }
    resultRDD.collect().foreach(println)
    sparkContext.stop()
  }
}

你可能感兴趣的:(spark,大数据,分布式)