SparkSQL 组件在Spark 体系中架构图

DataFrame
Dataframe 的概念有点像传统数据库中的表,每一条记录都代表了一个 Row Object.
与RDD的API 类似,DataFrame 的API 可以分为2种: transformations and actions.
Dataframe 可以从 Hive 或者其他的 database 中读取创建。
如何创建 DataFrame?
案例1.Create DataFrame from RDD
在项目的pom.xml 加入spark-sql的依赖包
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_2.11artifactId>
<version>2.4.0version>
dependency>
val sc = new SparkContext(conf);
val spark = SparkSession.builder().config(conf).getOrCreate();
sc.setLogLevel("ERROR");
val strRDD = sc.parallelize(List(Row("jason",33),Row("Tom",50)));
val schema = StructType(List(
StructField("Name",StringType,true),
StructField("Age",IntegerType,true)
));
val df = spark.createDataFrame(strRDD,schema);
df.printSchema();
df.show();
输出日志
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
+-----+---+
| Name|Age|
+-----+---+
|jason| 33|
| Tom| 50|
+-----+---+
案例2. Creating DataFrames from a Range of Numbers
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.range(1,100).toDF("Num").show();
输出日志
+---+
|Num|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
| 20|
+---+
only showing top 20 rows
案例2. Creating DataFrames from Collection Tuple
val spark = SparkSession.builder().config(conf).getOrCreate();
val person = Seq(("Jason","DBA"),("Chen","Dev"))
val df = spark.createDataFrame(person).toDF("Name","jobs");
df.printSchema();
df.show();
输出日志
root
|-- Name: string (nullable = true)
|-- jobs: string (nullable = true)
+-----+----+
| Name|jobs|
+-----+----+
|Jason| DBA|
| Chen| Dev|
+-----+----+
接下来的案例将演示:如何从DATASOURCE 中创建RDD
案例1. 从TXT文本中创建 RDD
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.read.text("file:///c://README.txt");
df.show(300);
输出日志
+--------------------+
| value|
+--------------------+
| TCPDF - README|
|=================...|
| |
|I WISH TO IMPROVE...|
|PLEASE MAKE A DON...|
案例2. Creating DataFrames by Reading CSV Files
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.read.option("header",true).csv("file:///c://samle1.csv");
df.show();
输出日志
+---+-----------+----+
| |ID_PROVINCE|NAME|
+---+-----------+----+
| 1| 91|泰安|
| 2| 91|济宁|
| 3| 91|临沂|
| 4| 77|孝感|
| 5| 77|黄冈|
| 6| 69|娄底|
| 7| 69|益阳|
| 8| 69|怀化|
| 9| 69|永州|
| 10| 69|邵阳|
| 11| 91|莱芜|
| 12| 77|鄂州|
| 13| 77|随州|
| 14| 74|安庆|
| 15| 74|蚌埠|
| 16| 74|亳州|
| 17| 74|池州|
| 18| 74|滁州|
| 19| 74|阜阳|
| 20| 74|合肥|
+---+-----------+----+
only showing top 20 rows
案例2. Creating DataFrames by Reading TSV Files
val spark = SparkSession.builder().config(conf).getOrCreate();
val schema = StructType(List(
StructField("ID",IntegerType,true) ,
StructField("PID",IntegerType,true),
StructField("省份",StringType,true)
));
val df = spark.read.option("head",true).option("sep","\t").schema(schema).csv("file:///c://sample2.tsv");
df.show();
输出日志
+----+----+----+
| ID| PID|省份|
+----+----+----+
|null|null|null|
| 1| 91|泰安|
| 2| 91|济宁|
| 3| 91|临沂|
| 4| 77|孝感|
| 5| 77|黄冈|
| 6| 69|娄底|
| 7| 69|益阳|
| 8| 69|怀化|
| 9| 69|永州|
| 10| 69|邵阳|
| 11| 91|莱芜|
| 12| 77|鄂州|
| 13| 77|随州|
| 14| 74|安庆|
| 15| 74|蚌埠|
| 16| 74|亳州|
| 17| 74|池州|
| 18| 74|滁州|
| 19| 74|阜阳|
+----+----+----+
only showing top 20 rows
案例3. Creating DataFrames by Reading JSON Files
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.read.json("file:///c://notification.json");
df.printSchema();
df.show();
输出日志
root
|-- _class: string (nullable = true)
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- expireTime: struct (nullable = true)
| |-- $date: string (nullable = true)
|-- insertDate: struct (nullable = true)
| |-- $date: string (nullable = true)
|-- notificationId: string (nullable = true)
|-- notificationResult: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- idApp: string (nullable = true)
| |-- idNotification: string (nullable = true)
| |-- status: string (nullable = true)
| |-- statusDesc: string (nullable = true)
|-- notificationStatus: string (nullable = true)
|-- sendTime: struct (nullable = true)
| |-- $numberLong: string (nullable = true)
|-- status: string (nullable = true)
+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+------------------+---------------+------------+
| _class| _id| expireTime| insertDate|notificationId| notificationResult|notificationStatus| sendTime| status|
+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+------------------+---------------+------------+
|cn.homecredit.ngw...|[5c3d50f6e4b0a0b3...|[2019-02-13T16:00...|[2019-01-15T03:18...| 245884248|[181374678, 2, 24...| 0|[1547522294576]|Wait4Persist|
|cn.homecredit.ngw...|[5c3d50f6e4b0a0b3...|[2019-02-13T16:00...|[2019-01-15T03:18...| 245884249|[181374679, 2, 24...| 0|[1547522294886]|Wait4Persist|
|cn.homecredit.ngw...|[5c3d7a43e4b0673d...|[2019-02-13T16:00...|[2019-01-15T06:14...| 245884250|[181374680, 2, 24...| 0|[1547532867318]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884251|[181374681, 2, 24...| 0|[1547544950160]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884253|[181374683, 2, 24...| 0|[1547544950176]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884254|[181374682, 2, 24...| 0|[1547544950175]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884255|[181374684, 2, 24...| 0|[1547544950181]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884263|[181374693, 2, 24...| 0|[1547544950259]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884265|[181374694, 2, 24...| 0|[1547544950261]|Wait4Persist|
|cn.homecredit.ngw...|[5c3da976e4b043d0...|[2019-02-13T16:00...|[2019-01-15T09:35...| 245884266|[181374695, 2, 24...| 0|[1547544950262]|Wait4Persist|
+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+------------------+---------------+------------+
案例4. Creating DataFrames from JDBC
在项目中加入mysql 的驱动包
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>6.0.2version>
dependency>
val spark = SparkSession.builder().config(conf).getOrCreate();
val str_url = "jdbc:mysql://localhost:3306/miracleops";
val table_df = spark.read.format("jdbc").option("driver","com.mysql.jdbc.Driver").option("url",str_url)
.option("dbtable","user").option("user","root").option("password","root").load();
table_df.printSchema();
table_df.select("email","username").show();
输出日志
root
|-- id: integer (nullable = true)
|-- password: string (nullable = true)
|-- last_login: timestamp (nullable = true)
|-- email: string (nullable = true)
|-- username: string (nullable = true)
|-- wechat: string (nullable = true)
|-- avatar: string (nullable = true)
|-- job_title: string (nullable = true)
|-- reg_time: timestamp (nullable = true)
|-- is_active: boolean (nullable = true)
|-- is_admin: boolean (nullable = true)
|-- is_staff: boolean (nullable = true)
|-- group_id: integer (nullable = true)
+-------------------+------------+
| email| username|
+-------------------+------------+
|Jason.chen@sina.com| Jason|
|chenxu_0209@163.com|jason.chenTJ|
+-------------------+------------+