注:次文档参考 【尚硅谷】大数据高级 flink技术精讲(2020年6月) 编写。
1.由于视频中并未涉及到具体搭建流程,Flink 环境搭建部分并未编写。
2.视频教程 Flink 版本为 1.10.0,此文档根据 Flink v1.11.1 进行部分修改。
3.文档中大部分程序在 Windows 端运行会有超时异常,需要打包后在 Linux 端运行。
4.程序运行需要的部分 Jar 包,请很具情况去掉 pom 中的 “scope” 标签的再进行打包,才能在集群上运行。
5.原始文档在 Markdown 中编写,此处目录无法直接跳转。且因字数限制,分多篇发布
此文档仅用作个人学习,请勿用于商业获利。
概念
Flink 是一个 框架 和 分布式处理引擎,用于对 无界和有界数据流 进行 状态 计算。
为什么选择 Flink
哪些行业需要处理流数据
Flink 主要特点
分层 API
Flink 其他特点
Flink VS SparkStreaming
# 创建用户
userdel -r flink && useradd flink && echo flink | passwd --stdin flink
# 下载
wget https://archive.apache.org/dist/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.11.tgz
或
wget https://mirror.bit.edu.cn/apache/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.11.tgz
# 解压并启动
tar -zxvf flink-1.11.1-bin-scala_2.11.tgz
/home/flink/flink-1.11.1/bin/start-cluster.sh
# UI
http://test01:8081/#/overview
sudo yum -y install nc
# 使用 linux 的 nc 命令来向 socket 当中发送一些单词
nc -lk 7777
org.apache.flink
flink-scala_2.11
1.11.1
provided
org.apache.flink
flink-streaming-scala_2.11
1.11.1
provided
org.apache.flink
flink-clients_2.11
1.11.1
provided
dev
true
dev
prod
prod
src/main/resources/env/config-${env}.properties
src/main/resources
*.properties
*.txt
*.xml
*.yaml
net.alchim31.maven
scala-maven-plugin
3.4.6
-target:jvm-1.8
compile
org.apache.maven.plugins
maven-compiler-plugin
3.1
1.8
1.8
1.8
org.apache.maven.plugins
maven-assembly-plugin
3.0.0
jar-with-dependencies
make-assembly
package
single
Code
package com.mso.flink.dataset
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.api.scala._
object DataSetWordCount {
def main(args: Array[String]): Unit = {
// 创建一个批处理执行环境
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
// 从文件中读取数据
// val resource: URL = getClass.getResource("/word.txt")
// val inputDataSet: DataSet[String] = environment.readTextFile(resource.getPath)
val params: ParameterTool = ParameterTool.fromArgs(args)
val inputDataSet: DataSet[String] = environment.readTextFile(params.get("input-path"))
// 基于 DataSet 做转换,首先按空格拆分,然后按照 word 作为 key 做 groupBy 分组聚合
val resultDataSet: AggregateDataSet[(String, Int)] = inputDataSet
.flatMap((_: String).split(" ")) // 分词得到 word 构成的数据集
.map(((_: String), 1)) // 转换成一个二元组 (word, count)
.groupBy(0) // 以二元组中第一个元素作为 key 分组
.sum(1) // 聚合二元组中第二个元素的值
resultDataSet.printOnTaskManager("DataSetWordCount")
environment.execute("DataSetWordCount")
// ~/flink-1.11.1/bin/flink run -p 1 -c com.mso.flink.dataset.DataSetWordCount FlinkPractice-1.0-SNAPSHOT-jar-with-dependencies.jar --input-path /home/flink/word.txt
}
}
Run
~/flink-1.11.1/bin/flink run -p 1 -c com.mso.flink.dataset.DataSetWordCount FlinkPractice-1.0-SNAPSHOT-jar-with-dependencies.jar --input-path /home/flink/word.txt
Code
package com.mso.flink.stream
import org.apache.flink.streaming.api.scala._
object StreamWordCount {
def main(args: Array[String]): Unit = {
// 创建流处理执行环境
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 接收 socket 文本流
val inputSocketDataStream: DataStream[String] = environment.socketTextStream("test01", 7777)
// 定义转换操作, word count
val resultDataStream: DataStream[(String, Int)] = inputSocketDataStream
.flatMap(_.split(" ")) // 分词得到 word 构成的数据集
.filter(_.nonEmpty) // 过滤空集
.map((_, 1)) // 转换成一个二元组 (word, count)
.keyBy(0) // 以二元组中第一个元素作为 key 分组
.sum(1) // 聚合二元组中第二个元素的值
// 打印输出
resultDataStream.print()
// 提交执行
environment.execute()
}
}
Run
~/flink-1.11.1/bin/flink run -c com.mso.flink.stream.StreamWordCount -p 1 FlinkPractice-1.0-SNAPSHOT-jar-with-dependencies.jar
Result
tail -f flink-flink-taskexecutor-0-test01.out
Stop
~/flink-1.11.1/bin/flink list
~/flink-1.11.1/bin/flink cancel jobID
规划
Node | JobManager | TaskManager | JPS |
---|---|---|---|
test01 | Y | Y | StandaloneSessionClusterEntrypoint、TaskManagerRunner |
test02 | N | Y | TaskManagerRunner |
test03 | N | Y | TaskManagerRunner |
安装
# 修改 flink-conf.yaml
vim ~/flink-1.11.1/conf/flink-conf.yaml
#jobmanager.rpc.address: localhost
jobmanager.rpc.address: test01
# 修改 workers
vim ~/flink-1.11.1/conf/workers
test01
test02
test03
# 免密
ssh-keygen
ssh-copy-id test02
ssh-copy-id test03
# 分发安装包
scp -r ~/flink-1.11.1/ flink@test02:
scp -r ~/flink-1.11.1/ flink@test03:
# 启动 Flink 集群
~/flink-1.11.1/bin/start-cluster.sh
# WebUI 界面访问
http://test01:8081/#/overview
提交任务
Session-cluster 模式
先启动集群,然后再提交作业。首先向 yarn 申请一块空间,之后资源永远保持不变,如果资源满了,下一个作业就无法提交。
所有作业共享 Dispatcher 和 ResourceManager。
适用于规模小且执行时间短的作业。
Per-Job-Cluster
每次提交 Job 都会对应一个 Flink 集群,每提交一个作业会根据自身的情况,都会单独向 yarn 申请资源,直到作业执行完成。
# 启动
~/flink-1.11.1/bin/yarn-session.sh -n 2 -s 2 -jm 1024 -tm 1024 -nm test -d
-n(--container) : TaskManager 的数量
-s(--slots) : 每个 TaskManager 的 slot 数量,默认一个 slot 一个 core,默认每个 taskmanager 的 slot 的个数为 1,有时可以多一些
-jm : JobManager 的内存(单位 MB)
-tm : 每个 TaskManager 的内存(单位 MB)
-nm : yarn 的 appName(现在 yarn 的 ui 上的名字)
-d : 后台执行
# 取消 yarn session
yarn application --kill applicationId
# 提交任务
~/flink-1.11.1/bin/flink run -m yarn-cluster -c com.mso.flink.stream.StreamWordCount -p 1 FlinkPractice-1.0-SNAPSHOT-jar-with-dependencies.jar
# 提交任务
~/flink-1.11.1/bin/flink run -c com.mso.flink.stream.StreamWordCount -p 1 FlinkPractice-1.0-SNAPSHOT-jar-with-dependencies.jar
搭建 Kubernetes 集群
略
配置各组件的 yaml 文件
在 k8s 上构建 Flink Session Cluster,需要将 Flink 集群的组件对应的 docker 镜像分别在 k8s 上启动。
包括 JobManager、TaskManager、JobManagerService 三个镜像服务。每个镜像服务都可以从中央镜像仓库中获取。
启动Flink Session Cluster
# 启动 jobmanager-service 服务
kubectl create -f jobmanager-service.yaml
# 启动 jobmanager-deployment 服务
kubectl create -f jobmanager-deployment.yaml
# 启动 taskmanager-deployment 服务
kubectl create -f taskmanager-deployment.yaml
访问 Flink 111 页面
http://(JobManagerHost:Port)/api/v1/namespaces/default/services/flink-jobmanager:ui/proxy
JobManager
TaskManager
ResourceManager
Dispacher
任务提交流程
任务提交流程 On Yarn
代码中定义的每一步操作(算子、operator)就是一个任务。
算子可以设置并行度,所以每一步操作都可以有多个并行的子任务。
Flink 可以将前后执行的不同任务合并起来。
即,如果并行度相同,one-to-one 数据传输,那么可以把算子合并成一个任务链。
slot 是 TaskManager 拥有的计算资源的子集,一个任务必须再一个 slot 上执行。
每个算子的并行任务,必须执行在不同的 slot 上。
如果是不同算子的任务,可以共享一个 slot。
一般情况下,一段代码执行需要的 slot 数量,就是并行度最大的算子的并行度。
并行度和任务有关,就是每一个算子拥有的并行任务数量。
slot 数量只跟 TaskManager 配置有关,代表 TaskManager 并行处理数据的能力。
注:不共享 slot 的配置
# 全局不共享
environment.disableOperatorChaining()
# 算子之间不共享
Transform.slotSharingGroup("1")
Transform.disableChaining()
Transform.startNewChain()
// 0. Create stream environment
val streamEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
streamEnvironment.setParallelism(1)
val dataSetEnvironment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
dataSetEnvironment.setParallelism(1)
// 0. 返回本地执行环境,需要在调用时指定默认的并行度
val streamLocalExeEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironment(1)
// 0. 返回集群执行环境,将 Jar 提交到远程服务器。需要在调用时指定集群地址和要在集群运行的 Jar 包
val streamRemoteExeEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.createRemoteEnvironment("test01", 6123,"PATH/something.jar")
streamRemoteExeEnvironment.setParallelism(1)
Code
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object SourceDemo {
def main(args: Array[String]): Unit = {
// 0. Create stream environment.
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
environment.setParallelism(1)
// 1. Source from Collection
val sourceFromCollection1: DataStream[String] = environment.fromElements[String]("hello world", "hello flink")
val sourceFromCollection2: DataStream[SensorReading] = environment.fromCollection(List(
SensorReading("sensor_1", 1547718199, 35.8),
SensorReading("sensor_6", 1547718201, 15.4),
SensorReading("sensor_7", 1547718202, 6.7),
SensorReading("sensor_10", 1547718205, 38.1),
SensorReading("sensor_1", 1547718207, 37.2),
SensorReading("sensor_1", 1547718212, 33.5),
SensorReading("sensor_1", 1547718215, 38.1),
SensorReading("sensor_6", 1547718222, 35.8)
))
// 打印输出
sourceFromCollection1.print("sourceFromCollection1")
sourceFromCollection2.print("sourceFromCollection2")
environment.execute("Source demo")
}
}
File
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,37.2
sensor_1,1547718212,33.5
sensor_1,1547718215,38.1
sensor_6,1547718222,35.8
Code
// 2. Source from File
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceFromFile: DataStream[String] = environment.readTextFile(params.get("path"))
Code
// 3. Source from socket
val sourceFromSocket: DataStream[String] = environment.socketTextStream("test01", 7777)
Create topic
kafka-topics --list --zookeeper localhost:2181/kafka
kafka-topics --create --zookeeper localhost:2181/kafka --replication-factor 3 --partitions 2 --topic sensor
kafka-topics --describe --zookeeper localhost:2181/kafka --topic sensor
kafka-console-producer --broker-list test01:9092,test02:9092,test03:9092 --topic sensor
kafka-console-consumer --bootstrap-server test01:9092,test02:9092,test03:9092 --topic sensor
pom
org.apache.flink
flink-connector-kafka_2.11
1.11.1
provided
Code
// 4. Source from kafka
val properties = new Properties()
properties.setProperty("bootstrap.servers", "test01:9092")
properties.setProperty("group.id", "test-group")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")
// earliest : 当各分区下有已提交的 offset 时,从提交的 offset 开始消费;无提交的 offset 时,从头开始消费
// latest : 当各分区下有已提交的 offset 时,从提交的 offset 开始消费;无提交的 offset 时,消费新产生的该分区下的数据
// none : topic各分区都存在已提交的 offset 时,从 offset 后开始消费;只要有一个分区不存在已提交的 offset,则抛出异常
val sourceFromKafka: DataStream[String] = environment.addSource(new FlinkKafkaConsumer[String]("sensor", new SimpleStringSchema(), properties))
Code
// 5. Source from Custom Source
val sourceFromMySensorSource: DataStream[SensorReading] = environment.addSource(new MySensorSource)
sourceFromMySensorSource.print("sourceFromMySensorSource")
// 实现一个自定义的 SourceFunction,自动生成测试数据
class MySensorSource() extends SourceFunction[SensorReading] {
// 定义一个 flag,表示数据源是否正常运行
private var running: Boolean = true
override def cancel(): Unit = running = false
// 随机生成 SensorReading 数据
override def run(sourceContext: SourceFunction.SourceContext[SensorReading]): Unit = {
// 定义一个随机数发生器
val rand = new Random()
// 定义 10 个传感器的初始温度
var curTemps = 1.to(10).map(i => ("sensor_" + i, 60 + rand.nextGaussian() * 20))
// 无限循环,生成随机数据
while (running) {
// 在当前温度基础上,随机生成微小波动
curTemps = curTemps.map(data => (data._1, data._2 + rand.nextGaussian()))
// 包装成样例类,用 sourceContext 发出数据
curTemps.foreach(
data => sourceContext.collect(SensorReading(data._1, System.currentTimeMillis(), data._2))
)
Thread.sleep(1000L)
}
}
}
package com.mso.flink.stream.source
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import scala.util.Random
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object SourceDemo {
def main(args: Array[String]): Unit = {
// 0. Create stream environment.
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
environment.setParallelism(1)
// 1. Source from Collection
val sourceFromCollection1: DataStream[String] = environment.fromElements[String]("hello world", "hello flink")
val sourceFromCollection2: DataStream[SensorReading] = environment.fromCollection(List(
SensorReading("sensor_1", 1547718199, 35.8),
SensorReading("sensor_6", 1547718201, 15.4),
SensorReading("sensor_7", 1547718202, 6.7),
SensorReading("sensor_10", 1547718205, 38.1),
SensorReading("sensor_1", 1547718207, 37.2),
SensorReading("sensor_1", 1547718212, 33.5),
SensorReading("sensor_1", 1547718215, 38.1),
SensorReading("sensor_6", 1547718222, 35.8)
))
// 2. Source from File
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceFromFile: DataStream[String] = environment.readTextFile(params.get("path"))
// 3. Source from socket
// val sourceFromSocket: DataStream[String] = environment.socketTextStream("test01", 7777)
// 4. Source from kafka
val properties = new Properties()
properties.setProperty("bootstrap.servers", "test01:9092")
properties.setProperty("group.id", "test-group")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")
// earliest : 当各分区下有已提交的 offset 时,从提交的 offset 开始消费;无提交的 offset 时,从头开始消费
// latest : 当各分区下有已提交的 offset 时,从提交的 offset 开始消费;无提交的 offset 时,消费新产生的该分区下的数据
// none : topic各分区都存在已提交的 offset 时,从 offset 后开始消费;只要有一个分区不存在已提交的 offset,则抛出异常
val sourceFromKafka: DataStream[String] = environment.addSource(new FlinkKafkaConsumer[String]("sensor", new SimpleStringSchema(), properties))
// 5. Source from Custom Source
val sourceFromMySensorSource: DataStream[SensorReading] = environment.addSource(new MySensorSource)
// 打印输出
// sourceFromCollection1.print("sourceFromCollection1")
// sourceFromCollection2.print("sourceFromCollection2")
// sourceFromFile.print("sourceFromFile")
// sourceFromKafka.print("sourceFromKafka")
sourceFromMySensorSource.print("sourceFromMySensorSource")
environment.execute("Source demo")
}
}
// 实现一个自定义的 SourceFunction,自动生成测试数据
class MySensorSource() extends SourceFunction[SensorReading] {
// 定义一个 flag,表示数据源是否正常运行
private var running: Boolean = true
override def cancel(): Unit = running = false
// 随机生成 SensorReading 数据
override def run(sourceContext: SourceFunction.SourceContext[SensorReading]): Unit = {
// 定义一个随机数发生器
val rand = new Random()
// 定义 10 个传感器的初始温度
var curTemps = 1.to(10).map(i => ("sensor_" + i, 60 + rand.nextGaussian() * 20))
// 无限循环,生成随机数据
while (running) {
// 在当前温度基础上,随机生成微小波动
curTemps = curTemps.map(data => (data._1, data._2 + rand.nextGaussian()))
// 包装成样例类,用 sourceContext 发出数据
curTemps.foreach(
data => sourceContext.collect(SensorReading(data._1, System.currentTimeMillis(), data._2))
)
Thread.sleep(1000L)
}
}
}
转换算子,读取数据之后,sink 之前的操作。
dataStream.map { x => x * 2 }
dataStream.flatMap { str => str.split(" ") }
dataStream.filter { _ != 0 }
dataStream.filter { x => x==1 }
逻辑的将一个流拆分成不相交的分区,每个分区包含具有相同 key 的元素,在内部以 hash 的行式实现的。
package com.mso.flink.stream.transform
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.java.functions.KeySelector
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object TransformDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// 1. 基本转换
val basicTransDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// 2. 分组滚动聚合
val aggStream: DataStream[SensorReading] = basicTransDataStream
// .keyBy(0)
// .keyBy("id")
// .keyBy(data => data.id)
.keyBy(new MyKeySelector)
// .min("temperature") // 取当前分组内,temperature 最小的数据,且其他字段取第一条数据的值
.minBy("temperature") //取当前分组内,temperature 最小的数据,且其他字段取 temperature 最小的那条数据的值
basicTransDataStream.print("basicTransDataStream")
aggStream.print("aggStream")
environment.execute()
}
}
// 自定义函数类,key 选择器
private class MyKeySelector2 extends KeySelector[SensorReading2, String] {
override def getKey(in: SensorReading2): String = in.id
}
keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
// .reduce(new MyReduceFunction)
.reduce((curData: SensorReading, newData: SensorReading) =>
SensorReading(
curData.id,
curData.timestamp.max(newData.timestamp),
curData.temperature.min(newData.temperature)
)
) // 取 时间的最大值 和 温度的最小值
// 自定义 Reduce 方法
private class MyReduceFunction extends ReduceFunction[SensorReading] {
override def reduce(t: SensorReading, t1: SensorReading): SensorReading = {
SensorReading(t.id, t.timestamp.max(t1.timestamp), t.temperature.min(t1.temperature))
}
}
// 3. 分流
val splitStream: SplitStream[SensorReading] = aggStream
.split(
(data: SensorReading) => {
if (data.temperature > 30)
Seq("high")
else
Seq("low")
}
)
val highTempStream: DataStream[SensorReading] = splitStream.select("high")
val lowTempStream: DataStream[SensorReading] = splitStream.select("low")
val allTempStream: DataStream[SensorReading] = splitStream.select("high", "low")
val highTempOutputTag: OutputTag[String] = new OutputTag[String]("high")
val lowTempOutputTag: OutputTag[String] = new OutputTag[String]("low")
val mainDataStream: DataStream[SensorReading] = aggStream
.process(new ProcessFunction[SensorReading, SensorReading] {
override def processElement(
value: SensorReading,
ctx: ProcessFunction[SensorReading, SensorReading]#Context,
out: Collector[SensorReading]): Unit = {
if (value.temperature > 30) {
// 将数据发送到侧输出中
ctx.output(highTempOutputTag, String.valueOf(value))
}
else if (value.temperature < 20) {
ctx.output(lowTempOutputTag, String.valueOf(value))
}
else {
// 将数据发送到常规输出中
out.collect(value)
}
}
}
)
// 通过 getSideOutput 获取侧输出流
val sideOutputHighTempDataStream: DataStream[String] = mainDataStream.getSideOutput(highTempOutputTag)
val sideOutputLowTempDataStream: DataStream[String] = mainDataStream.getSideOutput(lowTempOutputTag)
highTempStream.print("highTempStream")
lowTempStream.print("lowTempStream")
allTempStream.print("allTempStream")
sideOutputHighTempDataStream.print("sideOutputHighTempDataStream")
sideOutputLowTempDataStream.print("sideOutputLowTempDataStream")
connect,主要用于合并两个不同类型的流,且合并之后的流不能再 connect
// 4. 合流
val waringStream: DataStream[(String, Double)] = highTempStream.map((data: SensorReading) => (data.id, data.temperature))
val connectedStreams: ConnectedStreams[(String, Double), SensorReading] = waringStream.connect(lowTempStream)
val connectedResultStream: DataStream[Product] = connectedStreams.map(
(waringData: (String, Double)) => (waringData._1, waringData._2, "high temp waring"),
(lowTempData: SensorReading) => (lowTempData.id, "normal")
)
connectedResultStream.print("connectedResultStream")
union,主要用于合并相同类型的流,且合并之后的流可以继续合并
val unionStream: DataStream[SensorReading] = highTempStream.union(lowTempStream)
unionStream.print("unionStream")
package com.mso.flink.stream.transform
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.java.functions.KeySelector
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object TransformDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// 1. 基本转换
val basicTransDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// 2. 分组滚动聚合
val aggStream: DataStream[SensorReading] = basicTransDataStream
// .keyBy(0)
// .keyBy("id")
// .keyBy(data => data.id)
.keyBy(new MyKeySelector)
// .min("temperature") // 取当前分组内,temperature 最小的数据,且其他字段取第一条数据的值
// .minBy("temperature") //取当前分组内,temperature 最小的数据,且其他字段取 temperature 最小的那条数据的值
// .reduce(new MyReduceFunction)
.reduce((curData: SensorReading, newData: SensorReading) =>
SensorReading(
curData.id,
curData.timestamp.max(newData.timestamp),
curData.temperature.min(newData.temperature)
)
) // 取 时间的最大值 和 温度的最小值
// 3. 分流
val splitStream: SplitStream[SensorReading] = aggStream
.split(
(data: SensorReading) => {
if (data.temperature > 30)
Seq("high")
else
Seq("low")
}
)
val highTempStream: DataStream[SensorReading] = splitStream.select("high")
val lowTempStream: DataStream[SensorReading] = splitStream.select("low")
val allTempStream: DataStream[SensorReading] = splitStream.select("high", "low")
val highTempOutputTag: OutputTag[String] = new OutputTag[String]("high")
val lowTempOutputTag: OutputTag[String] = new OutputTag[String]("low")
val mainDataStream: DataStream[SensorReading] = aggStream
.process(new ProcessFunction[SensorReading, SensorReading] {
override def processElement(
value: SensorReading,
ctx: ProcessFunction[SensorReading, SensorReading]#Context,
out: Collector[SensorReading]): Unit = {
if (value.temperature > 30) {
// 将数据发送到侧输出中
ctx.output(highTempOutputTag, String.valueOf(value))
}
else if (value.temperature < 20) {
ctx.output(lowTempOutputTag, String.valueOf(value))
}
else {
// 将数据发送到常规输出中
out.collect(value)
}
}
}
)
// 通过 getSideOutput 获取侧输出流
val sideOutputHighTempDataStream: DataStream[String] = mainDataStream.getSideOutput(highTempOutputTag)
val sideOutputLowTempDataStream: DataStream[String] = mainDataStream.getSideOutput(lowTempOutputTag)
// 4. 合流
val unionStream: DataStream[SensorReading] = highTempStream.union(lowTempStream)
val waringStream: DataStream[(String, Double)] = highTempStream.map((data: SensorReading) => (data.id, data.temperature))
val connectedStreams: ConnectedStreams[(String, Double), SensorReading] = waringStream.connect(lowTempStream)
val connectedResultStream: DataStream[Product] = connectedStreams.map(
(waringData: (String, Double)) => (waringData._1, waringData._2, "high temp waring"),
(lowTempData: SensorReading) => (lowTempData.id, "normal")
)
basicTransDataStream.print("basicTransDataStream")
aggStream.print("aggStream")
highTempStream.print("highTempStream")
lowTempStream.print("lowTempStream")
allTempStream.print("allTempStream")
sideOutputHighTempDataStream.print("sideOutputHighTempDataStream")
sideOutputLowTempDataStream.print("sideOutputLowTempDataStream")
unionStream.print("unionStream")
connectedResultStream.print("connectedResultStream")
environment.execute()
}
}
// 自定义函数类,key 选择器
private class MyKeySelector extends KeySelector[SensorReading, String] {
override def getKey(in: SensorReading): String = in.id
}
// 自定义 Reduce 方法
private class MyReduceFunction extends ReduceFunction[SensorReading] {
override def reduce(t: SensorReading, t1: SensorReading): SensorReading = {
SensorReading(t.id, t.timestamp.max(t1.timestamp), t.temperature.min(t1.temperature))
}
}
Flink 支持所有的 Java 和 Scala 基础数据类型,Int、Double、Long、String …
val numbers: DataStream[Long] = env.fromElements(1L, 2L, 3L, 4L)
numbers.map(n => n + 1)
val persons: DataStream[(String, Integer)] = env.fromElements(
("Adam", 17),
("Sarah", 23) )
persons.filter(p => p._2 > 18)
case class Person(name: String, age: Int)
val persons: DataStream[Person] = env.fromElements(
Person("Adam", 17),
Person("Sarah", 23)
)
persons.filter(p => p.age > 18)
public class Person {
public String name;
public int age;
public Person() {}
public Person(String name, int age) {
this.name = name;
this.age = age;
}
}
DataStream persons = env.fromElements(
new Person("Alex", 42),
new Person("Werdy", 23));
Flink 对 Java 和 Scala 中的一些特殊的数据类型也是支持的。比如 Java 的 ArrayList、HashMap、Enum 等
Flink 暴露了所有 UDF 函数的接口,实现方式为接口或抽象类。比如上面例子中的 MyKeySelector,MyReduceFunction
val tweets: DataStream[String] = ...
val flinkTweets = tweets.filer(_.contains("flink"))
富函数 是 DataStream API 提供的一个函数类的接口,所有 Flink 函数类都有其 Rich 版本。
它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。
Rich Function 有一个生命周期的概念,典型的生命周期方法有:
private class MyRichMap extends RichMapFunction{
getRuntimeContext.getIndexOfThisSubtask // 并行子任务的索引
// 调用方法最初的操作,常用于创建数据库连接
override def open(parameters: Configuration): Unit = super.open(parameters)
override def map(in: Nothing): Nothing = ???
// 调用方法最后的操作,常用于关闭数据库连接
override def close(): Unit = super.close()
}
keyBy
基于 keyBy 的 hash code 重分区
同一个 key 只能再一个分区内处理,一个分区内可以有不同的 key 的数据
keyBy 之后的所有操作,针对的作用域都只是当前的 key,不同于 Spark reduceByKey() 在本地聚合的操作,keyBy 不涉及计算,仅确定当前数据要发往哪个分区
滚动聚合操作
DataStream 没有聚合操作,目前所有的聚合操作都是针对 KeyedStream
多流转换算子
split-select, connect-comap/coflatmap 成对出现
先转换成 SplitStream, ConnectedStreams,然后再通过 select/comap 操作转换回 DataStream
所谓 coMap,其实就是基于 ConnectedStreams 的 map 方法,里面传入的参数是 CoMapFunction
富函数
富函数是函数类的增强版,可以有生命周期方法,还可以获取运行时上下文,在运行时上下文可以对 state 进行操作
Flink 有状态的流式计算,状态编程,就是基于 RichFunction
更多连接器:DataStream Connectors
Create topic
kafka-topics --list --zookeeper localhost:2181/kafka
kafka-topics --create --zookeeper localhost:2181/kafka --replication-factor 3 --partitions 1 --topic flink-sink
kafka-topics --describe --zookeeper localhost:2181/kafka --topic flink-sink
kafka-console-producer --broker-list test01:9092,test02:9092,test03:9092 --topic flink-sink
kafka-console-consumer --bootstrap-server test01:9092,test02:9092,test03:9092 --topic flink-sink
Data
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,37.2
sensor_1,1547718212,33.5
sensor_1,1547718215,38.1
sensor_6,1547718222,35.8
Code
package com.mso.flink.stream.sink
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaProducer}
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object KafkaSinkDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// Source
// 输入数据同 sensor.txt 格式为:sensor_1, 1547718199, 35.8
val properties = new Properties()
properties.setProperty("bootstrap.servers", "test01:9092")
properties.setProperty("group.id", "test-group")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")
val sourceFromKafka: DataStream[String] = environment.addSource(
new FlinkKafkaConsumer[String](
"sensor",
new SimpleStringSchema(),
properties))
// Transform
val basicTransDataStream: DataStream[String] = sourceFromKafka
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).trim.toLong, dataArray(2).toDouble).toString
})
// Sink
basicTransDataStream.addSink(new FlinkKafkaProducer[String](
"test01:9092",
"flink-sink",
new SimpleStringSchema()))
environment.execute("Kafka sink demo")
}
}
POM
org.apache.bahir
flink-connector-redis_2.11
1.0
provided
Data
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,37.2
sensor_1,1547718212,33.5
sensor_1,1547718215,38.1
sensor_6,1547718222,35.8
Code
package com.mso.flink.stream.sink
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
object RedisSinkDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// Transform
val sourceDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// Sink
val conf = new FlinkJedisPoolConfig.Builder().setHost("localhost").setPort(6379).build()
sourceDataStream.addSink(new RedisSink[SensorReading](conf, new MyRedisMapper))
environment.execute("Redis sink demo")
}
}
class MyRedisMapper extends RedisMapper[SensorReading] {
// 定义保存到 redis 的命令,hset table_name key value
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "sensor_temp")
}
override def getKeyFromData(data: SensorReading): String = data.id
override def getValueFromData(data: SensorReading): String = data.temperature.toString
}
POM
org.apache.flink
flink-connector-elasticsearch6_2.11
1.11.1
provided
Data
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,37.2
sensor_1,1547718212,33.5
sensor_1,1547718215,38.1
sensor_6,1547718222,35.8
Code
package com.mso.flink.stream.sink
import java.util
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
object ESSinkDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// Transform
val sourceDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// Sink
val httpHosts = new util.ArrayList[HttpHost]()
httpHosts.add(new HttpHost("127.0.0.1", 9200, "http"))
val myEsSinkFunc: ElasticsearchSinkFunction[SensorReading] = new ElasticsearchSinkFunction[SensorReading] {
override def process(t: SensorReading, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
// 包装写入 es 的数据
val dataSource = new util.HashMap[String, String]()
dataSource.put("sensor_id", t.id)
dataSource.put("timestamp", t.timestamp.toString)
dataSource.put("temperature", t.temperature.toString)
// 创建一个 index request
val indexRequest: IndexRequest = Requests.indexRequest()
.index("sensor_temp")
.`type`("readingdata")
.source(dataSource)
requestIndexer.add(indexRequest)
}
}
sourceDataStream.addSink(new ElasticsearchSink.Builder[SensorReading](httpHosts, myEsSinkFunc).build())
environment.execute("Elasticsearch sink demo")
}
}
POM
org.apache.flink
flink-connector-jdbc_2.11
1.11.1
provided
mysql
mysql-connector-java
5.1.21
provided
Create table
# 注,此处参照生产环境添加了唯一性索引。
# 若没有唯一性索引,可使用第一种方法进行新增和修改数据
# 若有唯一性索引,两种方法都可以使用
DROP TABLE IF EXISTS `testdb`.`sensor_table`;
CREATE TABLE `testdb`.`sensor_table` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`sensor` varchar(20) CHARACTER SET latin1 COLLATE latin1_swedish_ci NOT NULL,
`temperature` double NULL DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE,
UNIQUE INDEX `sensor`(`sensor`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 80 CHARACTER SET = latin1 COLLATE = latin1_swedish_ci ROW_FORMAT = Compact;
SET FOREIGN_KEY_CHECKS = 1;
Data
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,37.2
sensor_1,1547718212,33.5
sensor_1,1547718215,38.1
sensor_6,1547718222,35.8
Code 1
package com.mso.flink.stream.sink
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.flink.streaming.api.scala._
object JdbcSinkDemo1 {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// Transform
val sourceDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// Sink
sourceDataStream.addSink(new MyJdbcSink).setParallelism(1)
environment.execute("Jdbc sink demo")
}
}
class MyJdbcSink extends RichSinkFunction[SensorReading] {
// 首先定义 sql 连接,以及预编译语句
var coon: Connection = _
var insertStmt: PreparedStatement = _
var updateStmt: PreparedStatement = _
// 在 open 生命周期方法中创建连接以及预编译语句
override def open(parameters: Configuration): Unit = {
Class.forName("com.mysql.jdbc.Driver");
// coon = DriverManager.getConnection("jdbc:mysql://localhost:3306/testdb?useUnicode=true&characterEncoding=utf-8", "admin", "12345678")
coon = DriverManager.getConnection("jdbc:mysql://localhost:3306/testdb", "admin", "12345678")
updateStmt = coon.prepareStatement("UPDATE sensor_table set temperature=? WHERE sensor = ?")
insertStmt = coon.prepareStatement("INSERT INTO sensor_table (id, sensor, temperature) VALUES (null,?,?)")
}
// 调用连接 执行sql
override def invoke(value: SensorReading, context: SinkFunction.Context[_]): Unit = {
// 执行更新语句
updateStmt.setDouble(1, value.temperature)
updateStmt.setString(2, value.id)
updateStmt.execute()
// 如果 update 没有更新,即没有查询到数据,那么执行插入操作
if (updateStmt.getUpdateCount == 0) {
insertStmt.setString(1, value.id)
insertStmt.setDouble(2, value.temperature)
insertStmt.execute()
}
}
// 关闭操作
override def close(): Unit = {
insertStmt.close()
updateStmt.close()
coon.close()
}
}
Code 2
package com.mso.flink.stream.sink
import java.sql.PreparedStatement
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.connector.jdbc.{JdbcConnectionOptions, JdbcExecutionOptions, JdbcSink, JdbcStatementBuilder}
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala._
object JdbcSinkDemo2 {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 从文件中红读取数据
val params: ParameterTool = ParameterTool.fromArgs(args)
val sourceStream: DataStream[String] = environment.readTextFile(params.get("path"))
// Transform
val sourceDataStream: DataStream[(String, Double, Double)] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
(dataArray(0), dataArray(2).toDouble, dataArray(2).toDouble)
})
// Sink
val insertSql = "INSERT INTO sensor_table (id, sensor, temperature) VALUES (NULL,?,?)"
val updateSql = "UPDATE sensor_table set temperature=? WHERE sensor = ?"
val upsertSql = "INSERT INTO sensor_table (sensor, temperature) VALUES (?,?) ON DUPLICATE KEY UPDATE temperature=?"
val myJdbcSinkFunction: SinkFunction[(String, Double, Double)] = JdbcSink.sink(
upsertSql,
new MyJdbcSinkBuilder(),
new JdbcExecutionOptions.Builder().withBatchSize(500).build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withDriverName("com.mysql.jdbc.Driver")
.withUrl("jdbc:mysql://localhost:3306/testdb")
.withUsername("admin")
.withPassword("12345678")
.build())
sourceDataStream.addSink(myJdbcSinkFunction)
environment.execute("Jdbc sink demo2")
}
}
//手动实现 interface 的方式来传入相关 JDBC Statement build 函数
class MyJdbcSinkBuilder extends JdbcStatementBuilder[(String, Double, Double)] {
override def accept(t: PreparedStatement, u: (String, Double, Double)): Unit = {
t.setString(1, u._1)
t.setDouble(2, u._2)
t.setDouble(3, u._3)
}
}
对于流式计算,如果需要求一些聚合的数据,例如最大值,最小值,平均值等,是没办法做到的。
通常使用窗口来解决流式计算求取聚合数据的问题。
滚动窗口(Tumbling Windows)
滑动窗口(Sliding Windows)
会话窗口(Session Windows)
滚动窗口是特殊的滑动窗口,当滑动步长等于窗口长度时,两者内的数据相同,代码写法相同。
stream
.keyBy(...) <- keyed versus non-keyed windows
.window(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
在上面,方括号([…])中的命令是可选的。且 widow function 必须在 window() 和 聚合操作中间。
Flink 允许以多种不同的方式自定义窗口逻辑,以实现需求。
.trigger() - 触发器,定义 window 什么时候关闭,计算,输出结果
.evictor() - 移除器,定义移除某些数据的逻辑
.allowedLateness() - 允许处理迟到的数据
.sideOutputLateData() - 将迟到的数据放入侧输出流
.getSideOutput() - 获取侧输出流
窗口分配器 - window() 方法
val minTempPerWindow = dataStream
.map(r => (r.id , r.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(15))
.reduce((r1,r2) => (r1._1, r1._2.min(r2._2)))
窗口分配器(Window assigner)
Window 是一种可以把数据切割成有限数据块的手段,窗口可以是 时间驱动[Time Window]的(比如每30秒) 或者 数据驱动[Count Window]的(比如每100个)
创建不同类型的窗口
sourceDataStream
.keyBy(data => data.id)
// 会话窗口,10min 失效
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
// 滚动时间窗口,窗口大小 1h (第三种为前两种的简写)
.window(TumblingEventTimeWindows.of(Time.hours(1), Time.hours(-8))) // 第二个参数为偏移量,常用于表示时区
.window(TumblingProcessingTimeWindows.of(Time.hours(1), Time.hours(-8))) // 第二个参数为偏移量,常用于表示时区
.timeWindow(Time.hours(1))
// 滑动时间窗口,窗口大小 1h ,每过 1min 滑动一次 (第三种为前两种的简写)
.window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(1), Time.hours(-8))) // 第三个参数为偏移量,常用于表示时区,可以省略
.window(SlidingProcessingTimeWindows.of(Time.hours(1), Time.minutes(1), Time.hours(-8))) // 第三个参数为偏移量,常用于表示时区,可以省略
.timeWindow(Time.hours(1), Time.minutes(1))
// 滚动计数窗口,窗口大小为 10
.countWindow(10L)
// 滑动计数窗口,窗口大小为 10,每过 2条 滑动一次
.countWindow(10L, 2L)
window function 定义了要对窗口中收集的数据做的计算操作,分为以下两类:
增量聚合函数 更加符合流式处理的架构,但是 增量聚合函数 有局限,仅能保存一个简单的状态信息。
比如求中位数、根据排序后的数据进行复杂计算,增量聚合函数 并不适合这些场景。
增量聚合函数 - reduce - Demo
package com.mso.flink.stream.window
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
// 输入数据的样例类
case class SensorReading(id: String, timestamp: Long, temperature: Double)
object IncrementalWindowDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sourceStream: DataStream[String] = environment.socketTextStream("localhost", 7777)
// Transform
val sourceDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
val resultStream: DataStream[SensorReading] = sourceDataStream
.keyBy(data => data.id)
.timeWindow(Time.seconds(15), Time.seconds(5))
.reduce(new MyReduceFunction)
resultStream.print()
environment.execute("Incremental window demo")
}
}
// 自定义 Reduce 方法
private class MyReduceFunction extends ReduceFunction[SensorReading] {
override def reduce(t: SensorReading, t1: SensorReading): SensorReading = {
SensorReading(t.id, t.timestamp.max(t1.timestamp), t.temperature.min(t1.temperature))
}
}
此处 demo 为滑动时间窗口,窗口大小为 15s,每 5s 滑动一次。
从测试结果中可发现,keyBy 后的每一个 key 都会出现三次,输出三次后会被丢弃,且输出的数据并未按照输入数据输出。
每个 key 会输出三次,是因为按照我们设置的窗口大小,每 5S 滑动一次(会输出一次),数据输出次数等于窗口重叠次数。每一个 key 会存在于三个窗口中,在三次滑动达到 窗口大小后,这条数据就会被丢弃。
输出数据乱序是因为,window 是在聚合端开了个桶,所有数据都在同一个桶内进行计算,按照数据计算的先后顺序进行输出。
全窗口函数 - apply - Demo
package com.mso.flink.stream.window
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object FullWindowDemo {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sourceStream: DataStream[String] = environment.socketTextStream("localhost", 7777)
// Transform
val sourceDataStream: DataStream[SensorReading] = sourceStream
.map((data: String) => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
val resultStream: DataStream[(String, Long, Int)] = sourceDataStream
.keyBy(data => data.id)
.timeWindow(Time.seconds(15), Time.seconds(5))
.apply(new MyWindowFunction)
resultStream.print("Full window demo")
environment.execute()
}
}
// 自定义全窗口函数。 不同于 ReduceFunction 和 MapFunction 仅能处理一条数据,全窗口函数可以处理一堆数据
/**
* Base interface for functions that are evaluated over keyed (grouped) windows.
* trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
* tparam IN The type of the input value.
* tparam OUT The type of the output value.
* tparam KEY The type of the key.
*/
class MyWindowFunction extends WindowFunction[SensorReading, (String, Long, Int), String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[SensorReading], out: Collector[(String, Long, Int)]): Unit = {
// 获取当前时间窗的 起始时间 和 数据量
// 注意此处可发现窗口的起始点为 h/min/s 取整的时间,不是程序启动时间
out.collect((key, window.getStart, input.size))
// val id: String = input.head.id
// val id: String = key.asInstanceOf[Tuple1[String]].f0
}
}
全窗口函数 - process - Demo
package com.practice.flink.stream.demo6
import com.amazonaws.services.ecr.model.EmptyUploadException
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.datastream.DataStreamSink
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
import org.apache.flink.util.Collector
object CountWindowAvg {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// Source
import org.apache.flink.api.scala._
// 使用 nc -lk 手动生成一些数字进行计算
val sourceStream: DataStream[String] = environment.socketTextStream("test01", 9002)
// Transform & Sink
val avgResult: DataStreamSink[Double] = sourceStream.map(x => (1, x.toInt))
.keyBy(0)
.countWindow(3) // 滚动窗口。窗口大小为 3条 数据
// .countWindow(5, 3) // 滑动窗口,窗口大小为 5条 数据,每 3条 数据向前滑动
.process(new MyProcessWindow)
.print()
// execute
environment.execute()
}
}
/**
* IN - (Int, Int) : The type of the input value.
* OUT - Double : The type of the output value.
* KEY - Tuple : The type of the key.
* W - GlobalWindow : The type of the window.
*/
private class MyProcessWindow extends ProcessWindowFunction[(Int, Int), Double, Tuple, GlobalWindow] {
/**
*
* @param key 定义我们聚合的 key
* @param context 上下文对象。用于将数据进行一些上下文的获取
* @param elements 传入的数据
* @param out
*/
override def process(key: Tuple, context: Context, elements: Iterable[(Int, Int)], out: Collector[Double]): Unit = {
// 用于统计一共有多少条数据
var totalNum: Int = 0;
// 用于定义我们所有数据的累加的和
var totalResult: Int = 0;
for(element <- elements){
totalNum+=1
totalResult+=element._2
}
out.collect(totalResult/totalNum)
}
}
Window 操作主要有两个操作
window 类型
注:滑动窗口中,每条数据可以属于 size/slide 个窗口。且滑动步长是多大,就多久输出一次。
若 size 远大于 slide 会造成同一条数据存在于多个桶中,会占用大量的资源。
会话窗口,窗口长度不固定,需要指定间隔时间,
窗口函数 - 窗口函数是基于当前窗口内的数据的,是有界数据集的计算,通常只在窗口关闭时输出一次。
window function 定义了要对窗口中收集的数据做的计算操作,分为以下两类:
程序默认的时间语义,是 Processing Time