1、Spark 要接管 Hive 需要把 hive-site.xml 拷贝到 conf/ 目录下
[root@hadoop151 conf]# cp /opt/module/hive/conf/hive-site.xml /opt/module/spark/conf/
[root@hadoop151 conf]# pwd
/opt/module/spark/conf
[root@hadoop151 conf]# cat hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop151:3306/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>147258</value>
<description>password to use against metastore database</description>
</property>
</configuration>
2、把 Mysql 的驱动 copy 到 jars/ 目录下
[root@hadoop151 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar /opt/module/spark/jars/
3、运行 spark-shell
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| score| false|
+--------+---------+-----------+
scala> spark.sql("select * from score").show
+----+----------+-----+
| uid|subject_id|score|
+----+----------+-----+
|1001| 01| 90|
|1001| 02| 90|
|1001| 03| 90|
|1002| 01| 85|
|1002| 02| 85|
|1002| 03| 70|
|1003| 01| 70|
|1003| 02| 70|
|1003| 03| 85|
+----+----------+-----+
4、运行 spark-sql
Spark SQL CLI 可以很方便的在本地运行 Hive 元数据服务以及从命令行执行查询任务。在 Spark 目录下执行如下命令启动 Spark SQL CLI,直接执行 SQL 语句,类似 Hive 窗口。
spark-sql> show tables;
20/08/15 20:54:01 INFO HiveMetaStore: 0: get_database: global_temp
20/08/15 20:54:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: global_temp
20/08/15 20:54:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/08/15 20:54:01 INFO HiveMetaStore: 0: get_database: default
20/08/15 20:54:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
20/08/15 20:54:01 INFO HiveMetaStore: 0: get_database: default
20/08/15 20:54:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
20/08/15 20:54:01 INFO HiveMetaStore: 0: get_tables: db=default pat=*
20/08/15 20:54:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_tables: db=default pat=*
20/08/15 20:54:01 INFO CodeGenerator: Code generated in 153.182765 ms
default score false
Time taken: 1.671 seconds, Fetched 1 row(s)
20/08/15 20:54:01 INFO SparkSQLCLIDriver: Time taken: 1.671 seconds, Fetched 1 row(s)
spark-sql> select * from score;
20/08/15 20:55:15 INFO HiveMetaStore: 0: get_table : db=default tbl=score
20/08/15 20:55:15 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=score
20/08/15 20:55:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 282.0 KB, free 366.0 MB)
20/08/15 20:55:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.1 KB, free 366.0 MB)
20/08/15 20:55:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop151:58260 (size: 24.1 KB, free: 366.3 MB)
20/08/15 20:55:16 INFO SparkContext: Created broadcast 0 from
20/08/15 20:55:16 INFO FileInputFormat: Total input paths to process : 1
20/08/15 20:55:16 INFO SparkContext: Starting job: processCmd at CliDriver.java:376
20/08/15 20:55:16 INFO DAGScheduler: Got job 0 (processCmd at CliDriver.java:376) with 2 output partitions
20/08/15 20:55:16 INFO DAGScheduler: Final stage: ResultStage 0 (processCmd at CliDriver.java:376)
20/08/15 20:55:16 INFO DAGScheduler: Parents of final stage: List()
20/08/15 20:55:16 INFO DAGScheduler: Missing parents: List()
20/08/15 20:55:16 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at processCmd at CliDriver.java:376), which has no missing parents
20/08/15 20:55:16 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.3 KB, free 366.0 MB)
20/08/15 20:55:16 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 366.0 MB)
20/08/15 20:55:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop151:58260 (size: 4.5 KB, free: 366.3 MB)
20/08/15 20:55:16 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163
20/08/15 20:55:16 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1))
20/08/15 20:55:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
20/08/15 20:55:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.16.47.153, executor 1, partition 0, ANY, 7920 bytes)
20/08/15 20:55:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.16.47.151, executor 0, partition 1, ANY, 7920 bytes)
20/08/15 20:55:16 INFO ContextCleaner: Cleaned accumulator 0
20/08/15 20:55:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.47.153:39358 (size: 4.5 KB, free: 366.3 MB)
20/08/15 20:55:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.47.151:55633 (size: 4.5 KB, free: 366.3 MB)
20/08/15 20:55:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.47.153:39358 (size: 24.1 KB, free: 366.3 MB)
20/08/15 20:55:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.47.151:55633 (size: 24.1 KB, free: 366.3 MB)
20/08/15 20:55:18 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2268 ms on 172.16.47.151 (executor 0) (1/2)
20/08/15 20:55:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2611 ms on 172.16.47.153 (executor 1) (2/2)
20/08/15 20:55:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/08/15 20:55:19 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 2.674 s
20/08/15 20:55:19 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 2.748378 s
1001 01 90
1001 02 90
1001 03 90
1002 01 85
1002 02 85
1002 03 70
1003 01 70
1003 02 70
1003 03 85
Time taken: 3.561 seconds, Fetched 9 row(s)
20/08/15 20:55:19 INFO SparkSQLCLIDriver: Time taken: 3.561 seconds, Fetched 9 row(s)
1、导入依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>sparkstudy</artifactId>
<packaging>pom</packaging>
<version>1.0-SNAPSHOT</version>
<modules>
<module>spark</module>
</modules>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 该插件用于将Scala代码编译成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到maven的compile阶段 -->
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
2、将 hive-site.xml 文件拷贝到项目的 resources 目录中
3、编写代码
package spark.sparksql.hive
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* spark sql on hive
*/
object SparkSqlOnHive {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("spark sql on hive")
.setMaster("local[*]")
.set("spark.driver.host", "localhost")
val spark = SparkSession.
builder().
enableHiveSupport().
config(conf).
getOrCreate()
spark.sql("select * from score").show()
}
}