SparkSQL(1):SparkSQL和Hive集成

一、SparkSQL功能

SparkSQL需要的是hive表的元数据,如果配置了metastore的服务,直接去服务里面拿,如果没有,那就只能去mysql中间拿

二、配置和hive集成

1. 将hive的hive-site.xml文件复制或者软连接到spark的conf文件夹中

cd /opt/modules/spark-2.1.0-bin-2.7.3/conf/

ln -s /opt/modules/hive-1.2.1/conf/hive-site.xml

2.hive的配置文件hive-site.xml


			
		
		hive.exec.mode.local.auto
		true
		 Let Hive determine whether to run in local mode automatically 
		
		
		    javax.jdo.option.ConnectionURL
		    jdbc:mysql://bigdata.ibeifeng.com:3306/metastore?createDatabaseIfNotExist=true
		  
		  
		    javax.jdo.option.ConnectionDriverName
		    com.mysql.jdbc.Driver
		  
		  
		    javax.jdo.option.ConnectionUserName
		    root
		  
		  
		    javax.jdo.option.ConnectionPassword
		    123456
		  
		  
				hive.cli.print.current.db
				true
				Whether to include the current database in the Hive prompt.
		
		 
				hive.cli.print.header
				true
				Whether to print the names of the columns in query output.
		
		
			  hive.metastore.uris
			  thrift://bigdata.ibeifeng.com:9083
			  IP address (or fully-qualified domain name) and port of the metastore host
		

		
		  hive.exec.max.dynamic.partitions.pernode
		  100
		  Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.
		
		
		
		  hive.exec.max.dynamic.partitions
		  1000
		  Maximum number of dynamic partitions allowed to be created in total.
		

		
		  hive.exec.dynamic.partition
		  true
		  Whether or not to allow dynamic partitions in DML/DDL.
		

		
		  hive.exec.dynamic.partition.mode
		  nonstrict
		  In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.
		
		
		
		
		  hive.support.sql11.reserved.keywords
		  false
		
		
		
			hbase.zookeeper.quorum
			bigdata.ibeifeng.com
		
		
			hive.server2.thrift.bind.host
			bigdata.ibeifeng.com
		

3.根据hive的配置文件的hive.metastore.uris参数的配置值选择不同的操作方式

(1) 如果没有给定参数(默认情况)

将hive元数据数据库的驱动包(就是mysql)添加到spark的classpath环境变量中即可完成spark和hive的集成

(2) 给定具体的metastore服务所在的节点信息(值非空,一般情况下用这种比较多)

    (a)启动hive的metastore服务(目录下:/opt/modules/hive-1.2.1)

bin/hive --service metastore &

    (b)完成spark和hive的集成

三、测试SparkSQL和hive的集成

1.Hive创建表,插入测试数据

【员工表】
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';


load data local inpath '/opt/datas/emp.txt' into table emp;
加载数据:overwrite

load data local inpath '/opt/datas/emp.txt'  overwrite into table emp;


【部门表】
create table dept(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/dept.txt' into table dept;

2.测试两种方式

(1)spark-sql测试

  -》开启

 ./bin/spark-sql

  -》运行语句

spark-sql (default)> select * from default.emp;
spark-sql (default)> select * from default.emp a join default.dept b on a.deptno = b.deptno; 
spark-sql (default)> explain select * from default.emp a join default.dept b on a.deptno = b.deptno;

(2)spark-shell测试 

  -》开启

./bin/spark-shell

  -》运行语句

scala>spark.sqlContext
scala>spark.sqlContext.sql("select * from default.emp a join default.dept b on a.deptno = b.deptno").show()

结果:
+-----+------+---------+----+----------+------+------+------+------+----------+--------+
|empno| ename|      job| mgr|  hiredate|   sal|  comm|deptno|deptno|     dname|     loc|
+-----+------+---------+----+----------+------+------+------+------+----------+--------+
| 7369| SMITH|    CLERK|7902|1980-12-17| 800.0|  null|    20|    20|  RESEARCH|  DALLAS|
| 7499| ALLEN| SALESMAN|7698| 1981-2-20|1600.0| 300.0|    30|    30|     SALES| CHICAGO|
| 7521|  WARD| SALESMAN|7698| 1981-2-22|1250.0| 500.0|    30|    30|     SALES| CHICAGO|
| 7566| JONES|  MANAGER|7839|  1981-4-2|2975.0|  null|    20|    20|  RESEARCH|  DALLAS|
| 7654|MARTIN| SALESMAN|7698| 1981-9-28|1250.0|1400.0|    30|    30|     SALES| CHICAGO|
| 7698| BLAKE|  MANAGER|7839|  1981-5-1|2850.0|  null|    30|    30|     SALES| CHICAGO|
| 7782| CLARK|  MANAGER|7839|  1981-6-9|2450.0|  null|    10|    10|ACCOUNTING|NEW YORK|
| 7788| SCOTT|  ANALYST|7566| 1987-4-19|3000.0|  null|    20|    20|  RESEARCH|  DALLAS|
| 7839|  KING|PRESIDENT|null|1981-11-17|5000.0|  null|    10|    10|ACCOUNTING|NEW YORK|
| 7844|TURNER| SALESMAN|7698|  1981-9-8|1500.0|   0.0|    30|    30|     SALES| CHICAGO|
| 7876| ADAMS|    CLERK|7788| 1987-5-23|1100.0|  null|    20|    20|  RESEARCH|  DALLAS|
| 7900| JAMES|    CLERK|7698| 1981-12-3| 950.0|  null|    30|    30|     SALES| CHICAGO|
| 7902|  FORD|  ANALYST|7566| 1981-12-3|3000.0|  null|    20|    20|  RESEARCH|  DALLAS|
| 7934|MILLER|    CLERK|7782| 1982-1-23|1300.0|  null|    10|    10|ACCOUNTING|NEW YORK|
+-----+------+---------+----+----------+------+------+------+------+----------+--------+

四、代码实现

1.参考blog:https://blog.csdn.net/u010886217/article/details/82916520,注意外网访问服务器,需要配置datanode使用hostname的属性,否则无法访问

 

你可能感兴趣的:(Spark,大数据开发,SparkSQL,SparkSQL,Hive,集成)