泛指使用SQL操作Spark的流处理。Structured Streaming是一个scalable 和fault-tolerant 流处理引擎,该引擎是构建Spark SQL之上。可以使得用户以静态批处理的方式计算流数据。Structured Streaming底层会调用Spark SQL 引擎对流数据做增量和持续的更新计算并且输出最终结果。用户可以使用DataSet/DataFrame API
完成流处理中的常见问题:aggregations-聚合统计、event-time window-事件窗口、stream-to-batch/stream-to-stream join连接等功能。Structured Streaming可以通过 checkpointing (检查点)和 Write-Ahead Logs(写前日志)机制实现end-to-end(端到端)、exactly-once(进准一次)语义容错机制。总之Structured Streaming提供了 快速、可扩展、容错、端到端的精准一次的流处理,无需用户过多的干预。
Structured Streaming底层计算引擎默认采取的是micro-batch
处理引擎(DStream一致的),除此之外Spark还提供了其它的处理模型可供选择:micro-batch-100ms
、Fixed interval micro-batches
、One-time micro-batch
、Continuous Processing-1ms(实验)
’
.version>2.4.3</spark.version>
.version>2.11</scala.version>
</properties>
org.apache.spark</groupId>
spark-sql_${scala.version}</artifactId>
${spark.version}</version>
</dependency>
</dependencies>
package com.hw.demo01
import org.apache.spark.sql.SparkSession
/**
* @aurhor:fql
* @date 2019/10/10 18:31
* @type: 单词统计案例
*/
object StructurestreamWordCount {
def main(args: Array[String]): Unit = {
//1.创建SparkSession
val spark = SparkSession.builder()
.appName("wordCount")
.master("local[6]")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
//2.通过流的方式创建DataFrame -细化
val lines = spark.readStream
.format("socket") //指定方式
.option("host", "CentOS") //指定主机名
.option("port", 9999) //指定端口号
.load()
//3.执行SQL操作API -细化 窗口等
val wordCounts = lines.as[String].flatMap(_.split("\\s+"))
.groupBy("value").count()
//4.构建StreamQuery 将结果写出去 --细化
val query = wordCounts.writeStream
.outputMode("complete") //表示全量输出,等价于Storm的updateStateByKey
.format("console") //输出到控制台
.start()
//5.关闭流
query.awaitTermination()
}
}
Batch: 0
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
+-----+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
| you| 1|
| how| 1|
| good| 1|
| is| 1|
| are| 1|
| hahh| 1|
| a| 1|
| this| 1|
+------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
| you| 1|
| how| 1|
| good | 1|
| is| 1|
| are| 1|
| hahh| 1|
| a| 1|
| this| 5|
+------+-----+
结构化流处理中的关键思想是将实时数据流视为被连续追加的表。将输入数据流视为"Input Table"。流上到达的每个数据项都像是将新的行附加到Input Table中。
对Input Table的查询将生成“Result Table”。每个触发间隔(例如,每一秒钟),新行将附加到Input Table中,最终更新Result Table。无论何时更新Result Table,我们都希望将更改后的结果写入外部接收器(Sink)。
“输出”定义为写到外部存储器的内容。输出支持以下模式的输出:
dependency>
org.apache.spark</groupId>
spark-sql-kafka-0-10_${scala.version}</artifactId>
${spark.version}</version>
</dependency>
package com.hw.demo01
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
/**
* @aurhor:fql
* @date 2019/10/10 19:08
* @type:
*/
object StructuredKafkaSource {
def main(args: Array[String]): Unit = {
//1.创建SparkSession
val spark = SparkSession.builder()
.appName("wordCount")
.master("local[5]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//2.通过流的方式创建DataFrame -细化
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "CentOS:9092")
.option("subscribe", "topic01")
.load()
//3.执行SQL操作 API
val wordCounts = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "partition", "offset", "CAST(timestamp AS LONG)")
.flatMap(row => row.getAs[String]("value").split("\\s+"))
.map((_, 1))
.toDF("word", "num")
.groupBy($"word")
.agg(sum($"num"))
//4.构建StreamQuery 将结果写出去
val query = wordCounts.writeStream
.outputMode(OutputMode.Update())
.format("console")
.start()
//5.关闭流
query.awaitTermination()
}
}
Batch: 0
-------------------------------------------
+----+--------+
|word|sum(num)|
+----+--------+
+----+--------+
-------------------------------------------
Batch: 1
-------------------------------------------
+----+--------+
|word|sum(num)|
+----+--------+
|thia| 1|
| ha| 3|
|this| 1|
+----+--------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+--------+
| word|sum(num)|
+-----+--------+
|thisa| 1|
| you| 4|
| d | 1|
| have| 1|
|dream| 1|
| I| 1|
| a| 1|
| this| 4|
+-----+--------+
/1.创建sparksession
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.master("local[6]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//2.通过流的方式创建Dataframe - 细化
val schema = new StructType()
.add("id",IntegerType)
.add("name",StringType)
.add("age",IntegerType)
.add("dept",IntegerType)
val df = spark.readStream
.schema(schema)
.format("json")
.load("file:///D:/demo/json")
//3 。SQL操作
// 略
//4.构建StreamQuery 将结果写出去
val query = df.writeStream
.outputMode(OutputMode.Update())
.format("console")
.start()
//5.关闭流
query.awaitTermination()
val spark = SparkSession
.builder
.appName("filesink")
.master("local[6]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
val wordCounts=lines.as[String].flatMap(_.split("\\s+"))
.map((_,1))
.toDF("word","num")
val query = wordCounts.writeStream
.outputMode(OutputMode.Append())
.option("path", "file:///D:/write/json")
.option("checkpointLocation", "file:///D:/checkpoints") //需要指定检查点
.format("json")
.start()
query.awaitTermination()
注:仅仅只支持Append Mode,所以一般用作数据的清洗,不能作为数据分析(聚合)输出
package com.hw.demo01
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.OutputMode
/**
* @aurhor:fql
* @date 2019/10/11 19:03
* @type:
*/
object KafkaSink02 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("kafkaSink")
.master("local[6]")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
//数据类型
// 001 zhangsan iphone 15000
// 002 lisi apple 7.8
val userCost = lines.as[String].map(_.split("\\s+"))
.map(ts => (ts(0), ts(1), ts(3).toDouble))
.toDF("id", "name", "cost")
.groupBy("id", "name")
.sum("cost")
.as[(String, String, Double)]
.map(t => (t._1, t._2 + "\t" + t._3))
.toDF("key", "value") //输出字段中必须有value String类型
val query = userCost.writeStream
.outputMode(OutputMode.Update())
.format("kafka")
.option("topic", "topic02") //指定主题
.option("kafka.bootstrap.server", "CentOS:9092")
.option("checkpointLocation", "file:///D:/checkpoint01") //设置检查点
.start()
query.awaitTermination()
}
}
注:支持Append、Update、Complete输出模式
使用foreach和foreachBatch操作,您可以在流查询的输出上应用任意操作并编写逻辑。它们的用例略有不同-尽管foreach允许在每行上使用自定义写逻辑,而foreachBatch允许在每个微批处理的输出上进行任意操作和自定义逻辑。
foreachBatch(…)允许您指定在流查询的每个微批处理的输出数据上执行的函数。从Spark 2.4开始,Scala,Java和Python支持此功能。它具有两个参数:具有微批处理的输出数据的DataFrame或Dataset和微批处理的唯一ID。
[root@CentOS ~]# nc -lk 9999
0001 zhangsan iphone 15700
002 lk apple 7.8
001 zhangsan orange 13444
002 lk apple 1345
package com.hw.demo01
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
/**
* @aurhor:fql
* @date 2019/10/11 19:20
* @type:
*/
object ForeachBatchSink {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("ForeachBatchSink")
.master("local[6]")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
val userCost = lines.as[String].map(_.split("\\s+"))
.map(ts => (ts(0), ts(1), ts(3).toDouble))
.toDF("id", "name", "cost")
.groupBy("id", "name") //分组依据
.sum("cost")
.as[(String, String, Double)]
.map(t => (t._1, t._2 + "\t" + t._3))
.toDF("key", "value")
val query = userCost.writeStream
.outputMode(OutputMode.Update())
.foreachBatch((ds: Dataset[Row], bacthId) => {
ds.show()
})
.start()
query.awaitTermination()
}
}
+----+----------------+
| key| value|
+----+----------------+
|0001|zhangsan 15700.0|
| 002| lk 7.8|
+----+----------------+
+---+----------------+
|key| value|
+---+----------------+
|001|zhangsan 13444.0|
+---+----------------+
+---+---------+
|key| value|
+---+---------+
|002|lk 1352.8|
+---+---------+
使用foreachBatch,可以执行以下操作。
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist() //缓存
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()//释放缓存
}
如果不使用foreachBatch,则可以使用foreach表达自定义writer将数据写到外围系统。具体来说,您可以通过自定Writer将数据写到外围系统:open,process和close。从Spark 2.4开始,foreach在Scala,Java和Python中可用。
val spark = SparkSession
.builder
.appName("filesink")
.master("local[6]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
import org.apache.spark.sql.functions._
//001 zhangsan iphonex 15000
val userCost=lines.as[String].map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(3).toDouble))
.toDF("id","name","cost")
.groupBy("id","name")
.agg(sum("cost") as "cost")
val query = userCost.writeStream
.outputMode(OutputMode.Update())
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = {//开启事务
// println(s"open:${partitionId},${epochId}")
return true //返回true,系统调用 process ,然后调用 close
}
override def process(value: Row): Unit = {
val id=value.getAs[String]("id")
val name=value.getAs[String]("name")
val cost=value.getAs[Double]("cost")
println(s"${id},${name},${cost}") //提交事务
}
override def close(errorOrNull: Throwable): Unit = {
//println("close:"+errorOrNull) //errorOrNull!=nul 回滚事务
}
})
.start()
//5.关闭流
query.awaitTermination()
滑动事件时间窗口上的聚合对结构化流而言非常简单,并且与分组聚合
非常相似,时间是嵌入在数据当中
。
package com.hw.demo02
import java.sql.Timestamp
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{Row, SparkSession}
/**
* @aurhor:fql
* @date 2019/10/11 19:44
* @type:
*/
object WindowWordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("windowWordCount")
.master("local[6]")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
import org.apache.spark.sql.functions._
val wordcounts = lines.as[String].map(_.split("\\s+"))
.map(ts => (ts(0), new Timestamp(ts(1).toLong), 1))
.toDF("word", "timestamp", "num")
.groupBy(
window($"timestamp", "4 seconds", "2 seconds"), //设置窗口大小 和滑动间隔
$"word")
.agg(sum("num") as "sum")
.map(row=> {
val start = row.getAs[Row]("window").getAs[Timestamp]("start")
val end = row.getAs[Row]("window").getAs[Timestamp]("end")
val word = row.getAs[String]("word")
val sum = row.getAs[Long]("sum")
(start,end,word,sum)
}).toDF("start","end","word","sum")
wordcounts.printSchema()
//4.构建StreamQuery 将结果写出去
val query = wordcounts.writeStream
.outputMode(OutputMode.Complete())
.format("console")
.start()
query.awaitTermination()
}
}
a 1570795521000
a 1570795522000
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-------------------+----+---+
| start| end|word|sum|
+-------------------+-------------------+----+---+
|2019-10-11 20:05:18|2019-10-11 20:05:22| a| 1|
|2019-10-11 20:05:20|2019-10-11 20:05:24| a| 1|
+-------------------+-------------------+----+---+
-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+-------------------+----+---+
| start| end|word|sum|
+-------------------+-------------------+----+---+
|2019-10-11 20:05:18|2019-10-11 20:05:22| a| 1|
|2019-10-11 20:05:22|2019-10-11 20:05:26| a| 1|
|2019-10-11 20:05:20|2019-10-11 20:05:24| a| 2|
+-------------------+-------------------+----+---+
在窗口流处理当中,由于网络传输的问题,数据有可能出现乱序,比如说 计算节点以及读到12:11的数据已经完成计算了,也就意味着12:00~12:10
的窗口已经触发过了,后续抵达的数据的时间正常来说一定12:11以后的数据,但实际的使用场景中由于网络延迟或者故障导致出现乱序的数据,例如在12:11,接受到了12:04数据,此时Spark需要将12:04添加到12:00 ~ 12:10
窗口中,也就意味着Spark一直存储12:00 ~ 12:10
窗口的计算状态,因此默认Spark会一直留存窗口的计算状态,来保证乱序可以正常加入到窗口计算中。
由于流计算不同于批处理,需要24*7小时不间断的工作,因此对于流处理而言长时间存储的计算状态不太切合实际,因此我们需要告诉引擎什么时候可以丢弃计算中间状态。Spark2.1提出Watermarking
机制,可以让引擎知道什么时候丢弃窗口的计算状态。watermarker计算公式max event time seen by the engine - late threshold
,当窗口的endtime T
值 < watermarker
,这个时候Spark就可以丢弃该窗口的计算状态。如果后续还有数据落入到了 已经被淹没的窗口中,称为该数据为late data。由于窗口被淹没,因此窗口的状态就没法保证一定存在(Spark会尝试清理那些 已经被淹没窗口状态),迟到越久的数据被处理的几率越低。
一般情况下,窗口触发条件是:Watermarking >= 窗口 end time ,窗口输出的结果一般是FinalResult,但是在Structured Streaming中Watermarking 仅仅控制的是引擎什么时候删除窗口计算状态。如果用户想输出的FinalResult,也就意味着只用当Watermarking >= 窗口 end time
的时候才输出结果,用户必须配合Append输出模式
.
在使用水位线机制的时候用户不能使用
Complete 输出模式
watermarker计算公式max event time seen by the engine - late threshold (当前的事件时间-设定阈值的大小)
输出条件:
package com.hw.demo02
import java.sql.Timestamp
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{Row, SparkSession}
/**
* @aurhor:fql
* @date 2019/10/11 19:44
* @type:
*/
object WindowWordCountUpdate {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("windowWordCount")
.master("local[6]")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
import org.apache.spark.sql.functions._
val wordcounts = lines.as[String].map(_.split("\\s+"))
.map(ts => (ts(0), new Timestamp(ts(1).toLong), 1))
.toDF("word", "timestamp", "num")
.withWatermark("timestamp","1 second") //设置阈值
.groupBy(
window($"timestamp", "4 seconds", "2 seconds"),
$"word")
.agg(sum("num") as "sum")
.map(row=> {
val start = row.getAs[Row]("window").getAs[Timestamp]("start")
val end = row.getAs[Row]("window").getAs[Timestamp]("end")
val word = row.getAs[String]("word")
val sum = row.getAs[Long]("sum")
(start,end,word,sum)
}).toDF("start","end","word","sum")
wordcounts.printSchema()
//4.构建StreamQuery 将结果写出去
val query = wordcounts.writeStream
.outputMode(OutputMode.Update())
.format("console")
.start()
query.awaitTermination()
}
}
//1.创建sparksession
val spark = SparkSession
.builder
.appName("windowWordcount")
.master("local[6]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//2.通过流的方式创建Dataframe - 细化
val lines = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
//3.执行SQL操作 API
// a 时间戳Long
import org.apache.spark.sql.functions._
val wordCounts = lines.as[String].map(_.split("\\s+"))
.map(ts => (ts(0), new Timestamp(ts(1).toLong), 1))
.toDF("word", "timestamp", "num")
.withWatermark("timestamp", "1 second") //设置阈值
.groupBy(
window($"timestamp", "4 seconds", "2 seconds"),
$"word")
.agg(sum("num") as "sum")
.map(row=> {
val start = row.getAs[Row]("window").getAs[Timestamp]("start")
val end = row.getAs[Row]("window").getAs[Timestamp]("end")
val word = row.getAs[String]("word")
val sum = row.getAs[Long]("sum")
(start,end,word,sum)
}).toDF("start","end","word","sum")
wordCounts.printSchema()
//4.构建StreamQuery 将结果写出去
val query = wordCounts.writeStream
.outputMode(OutputMode.Append())
.format("console")
.start()
//5.关闭流
query.awaitTermination()
严格意义上说Spark并没有提供too late 数据(在其他的流处理框架称为迟到,所谓late数据称为乱序)的处理机制,默认策略是丢弃。Storm和Flink都提供了对too late数据的处理方案,这一点Spark有待提高。