spark RDD基础装换操作--distinct操作

4.distinct操作

将RDD中用户数据按照“姓名”去重。

scala>  val rddData = sc.parallelize(Array("Alice","Nick","Alice","Kotlin","Catalina","Catalina"),3)
rddData: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at parallelize at :42

scala> rddData.collect
res4: Array[String] = Array(Alice, Nick, Alice, Kotlin, Catalina, Catalina)

scala> val rddData2 = rddData.distinct
rddData2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at distinct at :44

scala> rddData2.collect
res5: Array[String] = Array(Kotlin, Alice, Catalina, Nick)

说明:
sc.parallelize(Array(“Alice”,“Nick”,“Alice”,“Kotlin”,“Catalina”,“Catalina”),3):创建多个姓名,并分为3个分区。

rddData.distinct:进行去重。(distinct方法其实是对map方法以及reduceByKey方法的封装。)

你可能感兴趣的:(spark,spark)