大数据Spark SQL慕课网日志分析
http://coding.imooc.com/class/112.html
5-2 -A SQLContext的使用 27:05
笔记
1.Idea2017下的程序源代码
----------------------------------------------------------------------------------------
package com.imooc.spark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
object SQLContextAPP {
def main(args: Array[String]): Unit = {
val path=args(0)
//1 创建相应的context
val sparkConf=new SparkConf()
// sparkConf.setAppName("SQLContextApp").setMaster("local[2]")
val sc=new SparkContext(sparkConf)
val sqlContext=new SQLContext(sc)
//2 相关的处理
val people=sqlContext.read.format("json").load(path)
people.printSchema()
people.show()
//3 关闭资源
sc.stop()
}
}
----------------------------------------------------------------------------------------
2.打jar包,上传
idea侧边栏Maven Project--> Lifecycle--->package--->run
不明白的请参考下图

3.运行
http://spark.apache.org/docs/latest/submitting-applications.html
./bin/spark-submit \
--name SQLContextApp \
--class com.imooc.spark.SQLContextAPP \
--master local[2] \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json
4.运行结果
spark默认是local模式,什么也不用设置
[root@ecs-1e68 spark-2.2.0-bin-2.6.0-cdh5.12.0]# ls
bin data jars licenses python RELEASE spark-warehouse yarn
conf examples LICENSE NOTICE README.md sbin sql-1.0-SNAPSHOT.jar
[root@ecs-1e68 spark-2.2.0-bin-2.6.0-cdh5.12.0]# ./bin/spark-submit \
> --name SQLContextApp \
> --class com.imooc.spark.SQLContextAPP \
> --master local[2] \
> /usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar \
> /usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/23 15:21:27 INFO SparkContext: Running Spark version 2.2.0
17/07/23 15:21:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/23 15:21:27 WARN Utils: Your hostname, ecs-1e68 resolves to a loopback address: 127.0.0.1; using 192.168.1.203 instead (on interface eth0)
17/07/23 15:21:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/07/23 15:21:27 INFO SparkContext: Submitted application: SQLContextApp
17/07/23 15:21:27 INFO SecurityManager: Changing view acls to: root
17/07/23 15:21:27 INFO SecurityManager: Changing modify acls to: root
17/07/23 15:21:27 INFO SecurityManager: Changing view acls groups to:
17/07/23 15:21:27 INFO SecurityManager: Changing modify acls groups to:
17/07/23 15:21:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/07/23 15:21:28 INFO Utils: Successfully started service 'sparkDriver' on port 7453.
17/07/23 15:21:28 INFO SparkEnv: Registering MapOutputTracker
17/07/23 15:21:28 INFO SparkEnv: Registering BlockManagerMaster
17/07/23 15:21:28 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/23 15:21:28 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/23 15:21:28 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-92a35468-d406-4818-b22c-cb58736343c5
17/07/23 15:21:28 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/07/23 15:21:28 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/23 15:21:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/23 15:21:28 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.203:4040
17/07/23 15:21:28 INFO SparkContext: Added JAR file:/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar at spark://192.168.1.203:7453/jars/sql-1.0-SNAPSHOT.jar with timestamp 1500794488628
17/07/23 15:21:28 INFO Executor: Starting executor ID driver on host localhost
17/07/23 15:21:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50869.
17/07/23 15:21:28 INFO NettyBlockTransferService: Server created on 192.168.1.203:50869
17/07/23 15:21:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/23 15:21:28 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.203, 50869, None)
17/07/23 15:21:28 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.203:50869 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.203, 50869, None)
17/07/23 15:21:28 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.203, 50869, None)
17/07/23 15:21:28 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.203, 50869, None)
17/07/23 15:21:28 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/spark-warehouse/').
17/07/23 15:21:28 INFO SharedState: Warehouse path is 'file:/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/spark-warehouse/'.
17/07/23 15:21:29 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/07/23 15:21:31 INFO FileSourceStrategy: Pruning directories with:
17/07/23 15:21:31 INFO FileSourceStrategy: Post-Scan Filters:
17/07/23 15:21:31 INFO FileSourceStrategy: Output Data Schema: struct
17/07/23 15:21:31 INFO FileSourceScanExec: Pushed Filters:
17/07/23 15:21:32 INFO CodeGenerator: Code generated in 176.839409 ms
17/07/23 15:21:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 271.9 KB, free 366.0 MB)
17/07/23 15:21:32 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.7 KB, free 366.0 MB)
17/07/23 15:21:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.203:50869 (size: 24.7 KB, free: 366.3 MB)
17/07/23 15:21:32 INFO SparkContext: Created broadcast 0 from load at SQLContextAPP.scala:15
17/07/23 15:21:32 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
17/07/23 15:21:32 INFO SparkContext: Starting job: load at SQLContextAPP.scala:15
17/07/23 15:21:32 INFO DAGScheduler: Got job 0 (load at SQLContextAPP.scala:15) with 1 output partitions
17/07/23 15:21:32 INFO DAGScheduler: Final stage: ResultStage 0 (load at SQLContextAPP.scala:15)
17/07/23 15:21:32 INFO DAGScheduler: Parents of final stage: List()
17/07/23 15:21:32 INFO DAGScheduler: Missing parents: List()
17/07/23 15:21:32 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at load at SQLContextAPP.scala:15), which has no missing parents
17/07/23 15:21:32 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.5 KB, free 366.0 MB)
17/07/23 15:21:32 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.3 KB, free 366.0 MB)
17/07/23 15:21:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.203:50869 (size: 5.3 KB, free: 366.3 MB)
17/07/23 15:21:32 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/07/23 15:21:32 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at load at SQLContextAPP.scala:15) (first 15 tasks are for partitions Vector(0))
17/07/23 15:21:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/07/23 15:21:33 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5334 bytes)
17/07/23 15:21:33 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/23 15:21:33 INFO Executor: Fetching spark://192.168.1.203:7453/jars/sql-1.0-SNAPSHOT.jar with timestamp 1500794488628
17/07/23 15:21:33 INFO TransportClientFactory: Successfully created connection to /192.168.1.203:7453 after 42 ms (0 ms spent in bootstraps)
17/07/23 15:21:33 INFO Utils: Fetching spark://192.168.1.203:7453/jars/sql-1.0-SNAPSHOT.jar to /tmp/spark-54a01d6b-263f-4d83-8fe2-5de3c9beeba6/userFiles-4decc4ce-1cad-45d4-a83c-1a54a7ba4aac/fetchFileTemp3418872028557488320.tmp
17/07/23 15:21:33 INFO Executor: Adding file:/tmp/spark-54a01d6b-263f-4d83-8fe2-5de3c9beeba6/userFiles-4decc4ce-1cad-45d4-a83c-1a54a7ba4aac/sql-1.0-SNAPSHOT.jar to class loader
17/07/23 15:21:33 INFO FileScanRDD: Reading File path: file:///usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json, range: 0-73, partition values: [empty row]
17/07/23 15:21:33 INFO CodeGenerator: Code generated in 14.79336 ms
17/07/23 15:21:33 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1953 bytes result sent to driver
17/07/23 15:21:33 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 330 ms on localhost (executor driver) (1/1)
17/07/23 15:21:33 INFO DAGScheduler: ResultStage 0 (load at SQLContextAPP.scala:15) finished in 0.356 s
17/07/23 15:21:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/07/23 15:21:33 INFO DAGScheduler: Job 0 finished: load at SQLContextAPP.scala:15, took 0.464410 s
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
17/07/23 15:21:33 INFO FileSourceStrategy: Pruning directories with:
17/07/23 15:21:33 INFO FileSourceStrategy: Post-Scan Filters:
17/07/23 15:21:33 INFO FileSourceStrategy: Output Data Schema: struct
17/07/23 15:21:33 INFO FileSourceScanExec: Pushed Filters:
17/07/23 15:21:33 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 271.9 KB, free 365.7 MB)
17/07/23 15:21:33 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.7 KB, free 365.7 MB)
17/07/23 15:21:33 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.203:50869 (size: 24.7 KB, free: 366.2 MB)
17/07/23 15:21:33 INFO SparkContext: Created broadcast 2 from show at SQLContextAPP.scala:17
17/07/23 15:21:33 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
17/07/23 15:21:33 INFO SparkContext: Starting job: show at SQLContextAPP.scala:17
17/07/23 15:21:33 INFO DAGScheduler: Got job 1 (show at SQLContextAPP.scala:17) with 1 output partitions
17/07/23 15:21:33 INFO DAGScheduler: Final stage: ResultStage 1 (show at SQLContextAPP.scala:17)
17/07/23 15:21:33 INFO DAGScheduler: Parents of final stage: List()
17/07/23 15:21:33 INFO DAGScheduler: Missing parents: List()
17/07/23 15:21:33 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at show at SQLContextAPP.scala:17), which has no missing parents
17/07/23 15:21:33 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 8.9 KB, free 365.7 MB)
17/07/23 15:21:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.1 KB, free 365.7 MB)
17/07/23 15:21:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.1.203:50869 (size: 5.1 KB, free: 366.2 MB)
17/07/23 15:21:33 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
17/07/23 15:21:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at show at SQLContextAPP.scala:17) (first 15 tasks are for partitions Vector(0))
17/07/23 15:21:33 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/07/23 15:21:33 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 5334 bytes)
17/07/23 15:21:33 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/07/23 15:21:33 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.1.203:50869 in memory (size: 5.3 KB, free: 366.2 MB)
17/07/23 15:21:33 INFO FileScanRDD: Reading File path: file:///usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json, range: 0-73, partition values: [empty row]
17/07/23 15:21:33 INFO CodeGenerator: Code generated in 15.083661 ms
17/07/23 15:21:33 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1255 bytes result sent to driver
17/07/23 15:21:33 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 68 ms on localhost (executor driver) (1/1)
17/07/23 15:21:33 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/07/23 15:21:33 INFO DAGScheduler: ResultStage 1 (show at SQLContextAPP.scala:17) finished in 0.072 s
17/07/23 15:21:33 INFO DAGScheduler: Job 1 finished: show at SQLContextAPP.scala:17, took 0.115689 s
17/07/23 15:21:33 INFO CodeGenerator: Code generated in 17.454719 ms
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
17/07/23 15:21:33 INFO SparkUI: Stopped Spark web UI at http://192.168.1.203:4040
17/07/23 15:21:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/23 15:21:33 INFO MemoryStore: MemoryStore cleared
17/07/23 15:21:33 INFO BlockManager: BlockManager stopped
17/07/23 15:21:33 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/23 15:21:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/07/23 15:21:33 INFO SparkContext: Successfully stopped SparkContext
17/07/23 15:21:33 INFO ShutdownHookManager: Shutdown hook called
17/07/23 15:21:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-54a01d6b-263f-4d83-8fe2-5de3c9beeba6
[root@ecs-1e68 spark-2.2.0-bin-2.6.0-cdh5.12.0]#
5.sh脚本执行更方便
[root@ecs-1e68 spark-2.2.0-bin-2.6.0-cdh5.12.0]# vim sqlContext.sh
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/bin/spark-submit \
--name SQLContextApp \
--class com.imooc.spark.SQLContextAPP \
--master local[2] \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json
[root@ecs-1e68 spark-2.2.0-bin-2.6.0-cdh5.12.0]# ./sqlContext.sh
补充:
chmod 0744 sqlContext.sh
sqlContext.sh
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/bin/spark-submit \
--name SQLContextApp \
--class com.imooc.spark.SQLContextAPP \
--master local[2] \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/examples/src/main/resources/people.json
package com.imooc.spark
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveContextApp {
def main(args: Array[String]): Unit = {
//1 创建相应的context
val sparkConf=new SparkConf()
val sc=new SparkContext(sparkConf)
val hiveContext=new HiveContext(sc)
//2相关的处理
hiveContext.table("emp").show()
//3 关闭资源
sc.stop()
}
}
hiveContext.sh
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/bin/spark-submit \
--name HiveContextApp \
--class com.imooc.spark.HiveContextApp \
--master local[2] \
--jars /usr/local/src/mysql-connector-java-5.1.42-bin.jar \
/usr/local/src/app/spark-2.2.0-bin-2.6.0-cdh5.12.0/sql-1.0-SNAPSHOT.jar