SparkSQL之处理结构化数据

SPARKSQL 处理结构化数据

同样SPARKSQL 处理机构化数据也存在2个大的API:transformation 和action
Transformation API 列表
Operation desc
select 对应传统SQL的select 语句
selectExpr select 语句中添加表达式
filter where 过滤条件
distinct dropDuplicates 去除掉重复的记录
sort orderBy 排序操作
limit 限制输出结果集条数
union 合并多个结果集
withColumn 结果集输出添加新的列
withColumnRenamed 结果集输出重命名已存在的列
drop 删除一些输出列
sample 取样输出结果集
randomSplit 分裂结果集
join 结果集之间的连接
groupBy 结果集的聚合操作
describe 显示结果集的统计信息
下面的案例是对输出列的操作: 常规有如下5中方式
方式 举例
“” “columName”
col col(“columnName”)
column column(“columnName”)
$ $“columnName”
’ (tick mark) 'columnName
列的输出形式
 val spark = SparkSession.builder().config(conf).getOrCreate();
 val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
 df.select("Name").show();
 df.select(col("Name")).show();
 df.select(column("Name")).show();
输出结果
+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+

+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+

+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+
select(columns)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
 df.select("Name","age").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
|  Tom| 20|
+-----+---+
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("*","(2019-age) as Brith_Year").show();
输出结果
+-----+---+----------+
| Name|age|Brith_Year|
+-----+---+----------+
|Jason| 34|      1985|
|  Tom| 20|      1999|
+-----+---+----------+
调用内置SQL的count,distinct 的方法
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("count(distinct(Name)) as count").show();
输出结果
+-----+
|count|
+-----+
|    2|
+-----+
filler(condition), where(condition)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.filter("age < 30").show();
输出结果
+----+---+
|Name|age|
+----+---+
| Tom| 20|
+----+---+
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
//df.filter("age < 30").show();
df.filter("age == 34 and Name == 'Jason'").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
+-----+---+
distinct, dropDuplicates
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
 val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","Dev"),("Jason","BigData"))).toDF("Name","Job")
 df.select("Name").distinct().show();
 df.select("Job").distinct().show();
 df.select("Name").dropDuplicates().show();
 df.select("Job").dropDuplicates().show();
输出结果
+-----+
| Name|
+-----+
|Jason|
+-----+
+-------+
|    Job|
+-------+
|    Dev|
|BigData|
|    DBA|
+-------+
sort(columns), orderBy(columns)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df = spark.sparkContext.parallelize(List((2,"Jason"),(1,"Tom"),(5,"James"))).toDF("ID","Name");
df.sort("ID").show();
df.orderBy(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
|  1|  Tom|
|  2|Jason|
|  5|James|
+---+-----+
+---+-----+
| ID| Name|
+---+-----+
|  5|James|
|  2|Jason|
|  1|  Tom|
+---+-----+
limit(n)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 import spark.implicits._
 val df = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Mike"),(4,"Tom"))).toDF("ID","Name");
 df.orderBy(col("ID").desc).limit(1).show();
输出结果
+---+----+
| ID|Name|
+---+----+
|  4| Tom|
+---+----+
union(otherDataFrame)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 import spark.implicits._
 val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
 val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
 df1.union(df2).sort(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
|  5|James|
|  3|  Tom|
|  2| Mike|
|  1|Jason|
+---+-----+
withColumn(colName, column)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
df1.union(df2).sort(col("ID").desc).withColumn("New Name",col("ID")+col("ID")) .show()
输出结果
+---+-----+--------+
| ID| Name|New Name|
+---+-----+--------+
|  5|James|      10|
|  3|  Tom|       6|
|  2| Mike|       4|
|  1|Jason|       2|
+---+-----+--------+
withColumnRenamed(existingColName, newColName)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("one",1),("Two",2),("Three",3))).toDF("Num","Number");
df1.withColumnRenamed("Num","Number1").withColumnRenamed("Number","Num1")
      .show();
输出结果
+-------+----+
|Number1|Num1|
+-------+----+
|    one|   1|
|    Two|   2|
|  Three|   3|
+-------+----+
drop(columnName1, columnName2)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.drop("Name").sort(col("Job").desc).show();
输出结果
+-------+
|    Job|
+-------+
|    Dev|
|    DBA|
|BigData|
+-------+
sample(fraction), sample(fraction, seed), sample(fraction, seed,withReplacement)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.sample(0.2).show();
输出结果
+-----+---+
| Name|Job|
+-----+---+
|Jason|Dev|
+-----+---+
randomSplit(weights)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
val splitDf = df.randomSplit(Array(0.5,0.3,0.2));
println("Total count:"+df.count);
println("split 0 count:"+splitDf(0).count);
println("split 1 count:"+splitDf(1).count);
println("split 2 count:"+splitDf(2).count);
输出结果
Total count:10
split 0 count:6
split 1 count:2
split 2 count:2
describe(columnNames)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
df.describe("Num").show();
输出结果
+-------+------------------+
|summary|               Num|
+-------+------------------+
|  count|                10|
|   mean|               5.5|
| stddev|3.0276503540974917|
|    min|                 1|
|    max|                10|
+-------+------------------+
Action API 列表
Operation desc
show() 显示DataFrame的内容,默认20条数据
show(numRows) 显示N条DataFrame的内容
show(truncate) 如果String 类型的column 的长度超过20个字符,将截断String
show(numRows, truncate) 显示N条DataFrame的内容,如果String 类型的column 的长度超过20个字符,将截断String
head(),first(),head(n),take(n) 返回前N条记录
takeAsList(n) 返回一个Java List 类型数据,注意返回大量数据会out-of memory
collect,collectAsList 已数组或者Java list 形式返回所有的数据,注意数据量的大小
count 返回DataFrame 的记录数
show(truncate)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("abcdefghijklmn123123132"))).toDF("Num");
df.show(true);
输出结果
+--------------------+
|                 Num|
+--------------------+
|abcdefghijklmn123...|
+--------------------+

你可能感兴趣的:(Spark)