目标:
1.jdbc到mysql,读mysql的表并load成dataframe
2.对dataframe执行dsl、sql语句
3.两张表的连接查询操作
4.另存dataframe为表,保存到mysql
spark自带的案例在:
/examples/src/.../sql/SQLDataSourceExample.scala
jar包:
jdbc的jar包为mysql-connector-java-5.1.47.jar,
可放在spark的jars目录下(采用此方法),也可在spark-submit时指定。
mysql的数据准备:
[root@hadoop01 ~]# mysql -u root -p
mysql> create database MyDB;
mysql> use MyDB;
第一张表StudentInfo:
mysql> create table StudentInfo(ID int(20),Name varchar(20),Gender char(1),birthday date);
mysql> insert into StudentInfo values(1,"A","F","1996-09-12"),(2,"b","M","1995-12-23"),(3,"C","M","1996-10-29"),(4,"D","M","1995-02-25"),(5,"E","F","1997-06-06");
mysql> select * from StudentInfo;
+------+------+--------+------------+
| ID | Name | Gender | birthday |
+------+------+--------+------------+
| 1 | A | F | 1996-09-12 |
| 2 | b | M | 1995-12-23 |
| 3 | C | M | 1996-10-29 |
| 4 | D | M | 1995-02-25 |
| 5 | E | F | 1997-06-06 |
+------+------+--------+------------+
5 rows in set (0.00 sec)
第二张表Score:
mysql> create table Score(ID int(20),Name varchar(20),score float(10));
mysql> insert into Score values(1,"A",91),(2,"B",87),(5,"E",88),(9,"H",89),(10,"P",97);
mysql> select * from Score;
+------+------+-------+
| ID | Name | score |
+------+------+-------+
| 1 | A | 91 |
| 2 | B | 87 |
| 5 | E | 88 |
| 9 | H | 89 |
| 10 | P | 97 |
+------+------+-------+
5 rows in set (0.03 sec)
mysql>
操作mysql表:
步骤:
1. 创建SparkSession
2. jdbc连接,load表
3. dsl风格操作
import org.apache.spark.sql.SparkSession
object mysql {
def main(args: Array[String]): Unit = {
// SparkSession
// Spark2.0始,spark使用SparkSession接口代替SQLcontext和HiveContext
val spark = SparkSession.builder()
.appName("mysql_test")
.master("local")
.getOrCreate()
//implicits隐士转换包,toDF、show方法等都需要这个包
import spark.implicits._
//jdbc连接,传入上面设置的参数
val mysqlDF = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3306/MyDB") //jdbc连接的地址
.option("driver","com.mysql.jdbc.Driver") //驱动
.option("dbtable","StudentInfo") //学生信息表
.option("user","root") //用户名
.option("password","root") //密码
.load()
//测试
mysqlDF.show()
//来个复杂的
mysqlDF.select("Name","Gender","ID").where("Gender='M'").filter("id>1").sort($"id".desc).limit(3).show()
}
}
结果:
mysqlDF.show()
mysqlDF.select("Name","Gender","ID").where("Gender='M'").filter("id>1").sort($"id".desc).limit(3).show()
sql语句:
想要使用sql语句,就要注册成临时表
scoreDF.createOrReplaceTempView("scoreTable")
spark.sql("select * from scoreTable").show
连表查询:
内连接和外连接:https://blog.csdn.net/coding_hello/article/details/75452436
简单,与上面一样,再导入一张表,连表查询即可
上面代码中的mysqlDF不好分辨,改成studentDF
命令:DF1.join(DF2,"colName").show()
因为studentDF和scoreDF有两列相同,用Seq包起来
studentDF.join(scoreDF,Seq("ID","Name")).select("*").show
详细代码与结果:
import org.apache.spark.sql.SparkSession
object mysql {
def main(args: Array[String]): Unit = {
// SparkSession
// Spark2.0始,spark使用SparkSession接口代替SQLcontext和HiveContext
val spark = SparkSession.builder()
.appName("mysql_test")
.master("local")
.getOrCreate()
//implicits隐士转换包,toDF、show方法等都需要这个包
import spark.implicits._
//jdbc连接,传入上面设置的参数
val studentDF = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3306/MyDB") //jdbc连接的地址
.option("driver","com.mysql.jdbc.Driver") //驱动
.option("dbtable","StudentInfo") //学生信息表
.option("user","root") //用户名
.option("password","root") //密码
.load()
val scoreDF = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3306/MyDB") //jdbc连接的地址
.option("driver","com.mysql.jdbc.Driver") //驱动
.option("dbtable","Score") //学生信息表
.option("user","root") //用户名
.option("password","root") //密码
.load()
//连表查询
studentDF.join(scoreDF,Seq("ID","Name")).select("Name","score").show
}
}
保存为表,存到mysql数据库:
import java.util.Properties
//创建Properties存储数据库相关属性
val prop = new Properties()
prop.put("user","root")
prop.put("password","root")
//写入
scoreDF.write.jdbc("jdbc:mysql://127.0.0.1:3306/MyDB","score_1",prop)
查看mysql:
mysql> select * from score_1;
+------+------+-------+
| ID | Name | score |
+------+------+-------+
| 1 | A | 91 |
| 2 | B | 87 |
| 5 | E | 88 |
| 9 | H | 89 |
| 10 | P | 97 |
+------+------+-------+
5 rows in set (0.00 sec)
也可以用和加载类似的方式:
jdbcDF.write .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .save()