SPARKSQL 处理结构化数据
同样SPARKSQL 处理机构化数据也存在2个大的API:transformation 和action
Transformation API 列表
Operation |
desc |
select |
对应传统SQL的select 语句 |
selectExpr |
select 语句中添加表达式 |
filter where |
过滤条件 |
distinct dropDuplicates |
去除掉重复的记录 |
sort orderBy |
排序操作 |
limit |
限制输出结果集条数 |
union |
合并多个结果集 |
withColumn |
结果集输出添加新的列 |
withColumnRenamed |
结果集输出重命名已存在的列 |
drop |
删除一些输出列 |
sample |
取样输出结果集 |
randomSplit |
分裂结果集 |
join |
结果集之间的连接 |
groupBy |
结果集的聚合操作 |
describe |
显示结果集的统计信息 |
下面的案例是对输出列的操作: 常规有如下5中方式
方式 |
举例 |
“” |
“columName” |
col |
col(“columnName”) |
column |
column(“columnName”) |
$ |
$“columnName” |
’ (tick mark) |
'columnName |
列的输出形式
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.select("Name").show();
df.select(col("Name")).show();
df.select(column("Name")).show();
输出结果
+-----+
| Name|
+-----+
|Jason|
| Tom|
+-----+
+-----+
| Name|
+-----+
|Jason|
| Tom|
+-----+
+-----+
| Name|
+-----+
|Jason|
| Tom|
+-----+
select(columns)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.select("Name","age").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
| Tom| 20|
+-----+---+
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("*","(2019-age) as Brith_Year").show();
输出结果
+-----+---+----------+
| Name|age|Brith_Year|
+-----+---+----------+
|Jason| 34| 1985|
| Tom| 20| 1999|
+-----+---+----------+
调用内置SQL的count,distinct 的方法
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("count(distinct(Name)) as count").show();
输出结果
+-----+
|count|
+-----+
| 2|
+-----+
filler(condition), where(condition)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.filter("age < 30").show();
输出结果
+----+---+
|Name|age|
+----+---+
| Tom| 20|
+----+---+
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.filter("age == 34 and Name == 'Jason'").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
+-----+---+
distinct, dropDuplicates
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","Dev"),("Jason","BigData"))).toDF("Name","Job")
df.select("Name").distinct().show();
df.select("Job").distinct().show();
df.select("Name").dropDuplicates().show();
df.select("Job").dropDuplicates().show();
输出结果
+-----+
| Name|
+-----+
|Jason|
+-----+
+-------+
| Job|
+-------+
| Dev|
|BigData|
| DBA|
+-------+
sort(columns), orderBy(columns)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df = spark.sparkContext.parallelize(List((2,"Jason"),(1,"Tom"),(5,"James"))).toDF("ID","Name");
df.sort("ID").show();
df.orderBy(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
| 1| Tom|
| 2|Jason|
| 5|James|
+---+-----+
+---+-----+
| ID| Name|
+---+-----+
| 5|James|
| 2|Jason|
| 1| Tom|
+---+-----+
limit(n)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Mike"),(4,"Tom"))).toDF("ID","Name");
df.orderBy(col("ID").desc).limit(1).show();
输出结果
+---+----+
| ID|Name|
+---+----+
| 4| Tom|
+---+----+
union(otherDataFrame)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
df1.union(df2).sort(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
| 5|James|
| 3| Tom|
| 2| Mike|
| 1|Jason|
+---+-----+
withColumn(colName, column)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
df1.union(df2).sort(col("ID").desc).withColumn("New Name",col("ID")+col("ID")) .show()
输出结果
+---+-----+--------+
| ID| Name|New Name|
+---+-----+--------+
| 5|James| 10|
| 3| Tom| 6|
| 2| Mike| 4|
| 1|Jason| 2|
+---+-----+--------+
withColumnRenamed(existingColName, newColName)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("one",1),("Two",2),("Three",3))).toDF("Num","Number");
df1.withColumnRenamed("Num","Number1").withColumnRenamed("Number","Num1")
.show();
输出结果
+-------+----+
|Number1|Num1|
+-------+----+
| one| 1|
| Two| 2|
| Three| 3|
+-------+----+
drop(columnName1, columnName2)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.drop("Name").sort(col("Job").desc).show();
输出结果
+-------+
| Job|
+-------+
| Dev|
| DBA|
|BigData|
+-------+
sample(fraction), sample(fraction, seed), sample(fraction, seed,withReplacement)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.sample(0.2).show();
输出结果
+-----+---+
| Name|Job|
+-----+---+
|Jason|Dev|
+-----+---+
randomSplit(weights)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
val splitDf = df.randomSplit(Array(0.5,0.3,0.2));
println("Total count:"+df.count);
println("split 0 count:"+splitDf(0).count);
println("split 1 count:"+splitDf(1).count);
println("split 2 count:"+splitDf(2).count);
输出结果
Total count:10
split 0 count:6
split 1 count:2
split 2 count:2
describe(columnNames)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
df.describe("Num").show();
输出结果
+-------+------------------+
|summary| Num|
+-------+------------------+
| count| 10|
| mean| 5.5|
| stddev|3.0276503540974917|
| min| 1|
| max| 10|
+-------+------------------+
Action API 列表
Operation |
desc |
show() |
显示DataFrame的内容,默认20条数据 |
show(numRows) |
显示N条DataFrame的内容 |
show(truncate) |
如果String 类型的column 的长度超过20个字符,将截断String |
show(numRows, truncate) |
显示N条DataFrame的内容,如果String 类型的column 的长度超过20个字符,将截断String |
head(),first(),head(n),take(n) |
返回前N条记录 |
takeAsList(n) |
返回一个Java List 类型数据,注意返回大量数据会out-of memory |
collect,collectAsList |
已数组或者Java list 形式返回所有的数据,注意数据量的大小 |
count |
返回DataFrame 的记录数 |
show(truncate)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("abcdefghijklmn123123132"))).toDF("Num");
df.show(true);
输出结果
+--------------------+
| Num|
+--------------------+
|abcdefghijklmn123...|
+--------------------+