SparkSQL 提供了强大的功能来连接和操作 MySQL 数据库,支持读取数据、写入数据以及执行 SQL 查询。下面将详细介绍如何使用 SparkSQL 与 MySQL 进行交互,并提供完整的代码示例。
安装 MySQL JDBC 驱动
下载 mysql-connector-java,并将 JAR 文件添加到 Spark 的 classpath 中。
启动 SparkSession
在创建 SparkSession 时,通过 config
参数指定 JDBC 驱动路径:
python
运行
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("SparkMySQLIntegration") \
.config("spark.jars", "/path/to/mysql-connector-java-*.jar") \
.getOrCreate()
使用 spark.read.jdbc()
方法读取 MySQL 表数据:
python
运行
# MySQL 连接配置
url = "jdbc:mysql://localhost:3306/mydatabase" # 数据库连接 URL
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver" # MySQL 驱动类
}
# 读取整张表
df = spark.read.jdbc(url=url, table="mytable", properties=properties)
# 读取时指定查询条件(推荐:减少数据传输)
query = "(SELECT * FROM mytable WHERE category = 'electronics') AS subquery"
df = spark.read.jdbc(url=url, table=query, properties=properties)
# 查看数据
df.show()
df.printSchema()
使用 DataFrame.write.jdbc()
方法将数据写入 MySQL:
python
运行
# 示例 DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
# 写入配置
url = "jdbc:mysql://localhost:3306/mydatabase"
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver"
}
# 写入模式:overwrite(覆盖)、append(追加)、ignore(存在则忽略)、error(默认,存在则报错)
df.write.jdbc(
url=url,
table="new_table",
mode="overwrite",
properties=properties
)
batchsize
参数控制每次写入的行数:python
运行
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver",
"batchsize": 1000 # 每批写入1000行
}
repartition()
增加并行度:python
运行
df.repartition(10).write.jdbc(
url=url,
table="new_table",
mode="append",
properties=properties
)
isolationLevel
参数减少事务开销:python
运行
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver",
"isolationLevel": "NONE" # 禁用自动提交
}
将 DataFrame 注册为临时表
python
运行
df.createOrReplaceTempView("temp_table")
result = spark.sql("SELECT name, age FROM temp_table WHERE age > 30")
result.show()
直接执行原生 SQL
python
运行
# 创建数据库连接
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)
# 执行原生 SQL
query = "SELECT * FROM mytable WHERE category = 'books'"
result = sqlContext.read.jdbc(
url=url,
table=f"({query}) AS subquery",
properties=properties
)
result.show()
Spark 与 MySQL 之间的数据类型映射需注意:
Spark 数据类型 | MySQL 数据类型 |
---|---|
StringType | VARCHAR, TEXT |
IntegerType | INT |
LongType | BIGINT |
DoubleType | DOUBLE |
BooleanType | TINYINT(1) |
TimestampType | DATETIME, TIMESTAMP |
DateType | DATE |
python
运行
from pyspark.sql import SparkSession
# 创建 SparkSession
spark = SparkSession.builder \
.appName("SparkMySQLExample") \
.config("spark.jars", "/path/to/mysql-connector-java-8.0.26.jar") \
.getOrCreate()
# 读取 MySQL 数据
url = "jdbc:mysql://localhost:3306/mydatabase"
properties = {
"user": "root",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver"
}
# 读取表
df = spark.read.jdbc(url=url, table="products", properties=properties)
df.show()
# 数据转换
from pyspark.sql.functions import col
df_filtered = df.filter(col("price") > 100)
df_new = df_filtered.withColumn("discounted_price", col("price") * 0.9)
# 写入新表
df_new.write.jdbc(
url=url,
table="discounted_products",
mode="overwrite",
properties=properties
)
# 执行 SQL 查询
df_new.createOrReplaceTempView("temp_products")
result = spark.sql("SELECT category, AVG(discounted_price) FROM temp_products GROUP BY category")
result.show()
# 停止 SparkSession
spark.stop()
ClassNotFoundException
数据倾斜
repartition()
或 coalesce()
调整分区数。写入性能低
batchsize
参数。isolationLevel=NONE
)。连接超时
connectTimeout
参数: python
运行
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.cj.jdbc.Driver",
"connectTimeout": "300" # 连接超时时间(秒)
}