下面是一个完整的指南,展示如何通过 PySpark 从 Kafka 消费数据流,并将其处理为可以执行 SQL 查询的表。
确保已安装:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, expr
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# 初始化SparkSession,启用Kafka支持
spark = SparkSession.builder \
.appName("KafkaToSparkSQL") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0") \
.getOrCreate()
# 定义数据的schema (根据你的实际数据结构调整)
schema = StructType([
StructField("user_id", StringType()),
StructField("item_id", StringType()),
StructField("price", IntegerType()),
StructField("timestamp", StringType())
])
# 1. 从Kafka读取数据流
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "your_topic_name") \
.option("startingOffsets", "latest") \
.load()
# 2. 将Kafka的value从二进制转为字符串,然后解析JSON
parsed_df = kafka_df \
.selectExpr("CAST(value AS STRING)") \
.select(from_json(col("value"), schema).alias("data")) \
.select("data.*")
# 3. 注册为临时视图以便执行SQL查询
def process_batch(df, epoch_id):
# 注册为临时视图
df.createOrReplaceTempView("kafka_stream_table")
# 执行SQL查询
result_df = spark.sql("""
SELECT
user_id,
item_id,
price,
timestamp,
COUNT(*) OVER (PARTITION BY user_id) as user_purchase_count
FROM kafka_stream_table
WHERE price > 100
""")
# 输出结果 (可根据需要改为其他sink)
result_df.show(truncate=False)
# 4. 启动流处理
query = parsed_df.writeStream \
.foreachBatch(process_batch) \
.outputMode("update") \
.start()
# 5. 等待终止
query.awaitTermination()
.option("kafka.bootstrap.servers", "localhost:9092") # Kafka broker地址
.option("subscribe", "your_topic_name") # 订阅的topic
.option("startingOffsets", "latest") # 从最新offset开始
foreachBatch
可以在每个微批次中将DataFrame注册为临时表.outputMode("update")
表示只输出有变化的行,其他选项包括:
append
: 只添加新行complete
: 输出所有结果(用于聚合操作)# 假设有一个静态的user_profile表
user_profile_df = spark.read.parquet("hdfs://path/to/user_profiles")
user_profile_df.createOrReplaceTempView("user_profiles")
# 在process_batch函数中可以这样join
result_df = spark.sql("""
SELECT
k.user_id,
u.user_name,
k.item_id,
k.price
FROM kafka_stream_table k
JOIN user_profiles u ON k.user_id = u.user_id
""")
result_df = spark.sql("""
SELECT
user_id,
window(timestamp, '5 minutes') as window,
SUM(price) as total_spent,
COUNT(*) as purchase_count
FROM kafka_stream_table
GROUP BY user_id, window(timestamp, '5 minutes')
""")
# 写入Hive表
query = result_df.writeStream \
.outputMode("complete") \
.format("hive") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start()
# 或写入Kafka
query = result_df.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "output_topic") \
.start()
.trigger(processingTime='10 seconds') # 每10秒处理一次
.option("maxOffsetsPerTrigger", 10000) # 每次最多处理10000条
spark.sql.shuffle.partitions
参数希望这个完整指南能帮助你实现从Kafka到Spark SQL的流式处理!