【Spark SQL】两个DataFrame full join之后字段选择问题解决

最近工作上用到Spark SQL来处理数据。但是期间遇到两个表full join的问题,网上比较少关于spark SQL full join的资料,后面Google了一番找到了问题的核心。在这边做一个记录,方便他人也方便自己复盘。

工作的数据涉及保密,这边用两个dataframe来代替。

1. 先创建dataframe:

val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")
val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")

2. 对两个dataframe进行full join:

scala> left.join(right, Seq("id"), "fullouter").show
+---+----+-----+
| id|left|right|
+---+----+-----+
|  1| one| null|
|  3|null|three|
|  2|null|  two|
|  0|zero| zero|
+---+----+-----+

 注:这里Seq是join方法的一个参数,里面放表要join的公共的key。

3. 对这句sql进行explain:

scala> left.join(right, Seq("id"), "fullouter").explain
== Physical Plan ==
*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]
+- SortMergeJoin [id#50], [id#60], FullOuter
   :- *Sort [id#50 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#50, 200)
   :     +- LocalTableScan [id#50, left#51]
   +- *Sort [id#60 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#60, 200)
         +- LocalTableScan [id#60, right#61]

从上面可以看出,*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]中已经对公共的key--“id”进行了coalesce判断,所以在涉及表字段选择的时候,就不需要判断哪个字段是否为null了。

举例:

dataframe1: id, name, age
dataframe2: id, name, value

val result = dataframe1.as("d1").join(dataframe2.as("d2"), Seq("id","name"), "full")
    .selectExpr("id","name","d1.age","d2.value")

 这边的公共key,id和name,就不能再加别名了,或者写成下面这样都是不对的,这和hive的sql不一样了。

这么写是会报错的:
val result = dataframe1.as("d1").join(dataframe2.as("d2"), Seq("id","name"), "full")
    .selectExpr("coalesce(d1.id,d2.id)","coalesce(di.name,d2.name)","d1.age","d2.value")

 

你可能感兴趣的:(Spark,SQL)