flink重温笔记(十五): flinkSQL 顶层 API ——实时数据流转化为SQL表的操作

Flink学习笔记

前言:今天是学习 flink 的第 15 天啦!学习了 flinkSQL 基础入门,主要是解决大数据领域数据处理采用表的方式,而不是写复杂代码逻辑,学会了如何初始化环境,鹅湖将流数据转化为表数据,以及如何查询表数据,结合自己实验猜想和代码实践,总结了很多自己的理解和想法,希望和大家多多交流!

Tips:"分享是快乐的源泉,在我的博客里,不仅有知识的海洋,还有满满的正能量加持,快来和我一起分享这份快乐吧!

喜欢我的博客的话,记得点个红心❤️和小关小注哦!您的支持是我创作的动力!"


文章目录

  • Flink学习笔记
    • 一、FlinkSQL 入门
      • 1. 引入依赖
      • 2. 创建 TableEnvironment
        • 2.1 配置版本的流式查询(Flink-Streaming-Query)
        • 2.2 配置老版本的批处理环境(Flink-Batch-Query)
        • 2.3 配置新版本的流式查询(Blink-Streaming-Query)
        • 2.4 配置新版本的批处理环境(Blink-Batch-Query)
      • 3. 查询表
        • 3.1 导包操作
        • 3.2 Table API 调用模型
        • 3.3 SQL 查询模型
      • 4. 将 DataStream 转化为表

一、FlinkSQL 入门

1. 引入依赖

在 FlinkSQL 学习阶段,需要的 pom 文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>cn.itcast</groupId>
    <artifactId>flinksql_pro</artifactId>
    <version>1.0-SNAPSHOT</version>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <properties>
        <flink.version>1.13.1</flink.version>
        <java.version>1.8</java.version>
        <scala.binary.version>2.11</scala.binary.version>
        <hadoop.version>2.7.5</hadoop.version>
        <hbase.version>2.0.0</hbase.version>
        <zkclient.version>0.8</zkclient.version>
        <hive.version>2.1.1</hive.version>
        <mysql.version>5.1.47</mysql.version>
    </properties>

    <!--仓库配置-->
    <repositories>
        <repository>
            <id>nexus-aliyun</id>
            <name>Nexus aliyun</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public</url>
        </repository>
        <repository>
            <id>central_maven</id>
            <name>central maven</name>
            <url>https://repo1.maven.org/maven2</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>

    </repositories>

    <dependencies>
        <!-- Apache Flink dependencies -->
        <!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <!--<scope>compile</scope>-->
            <exclusions>
                <exclusion>
                    <groupId>log4j</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>log4j</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner-blink_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-runtime-blink_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-scala-bridge_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>


        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>log4j</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!--kafka-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-sql-connector-kafka_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.3.0</version>

        </dependency>
        <!--es6-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-elasticsearch6_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!--link-jdbc-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!--flink-hbase-->

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-hbase-2.2_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>


        <!--hadoop-->
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-shaded-hadoop-2-uber</artifactId>
            <version>2.7.5-10.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-csv</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!--flink-hbase-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-json</artifactId>
            <version>${flink.version}</version>
            <!--<scope>test</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <!--<scope>test</scope>-->
        </dependency>

        <dependency>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-classic</artifactId>
            <version>1.2.3</version>
            <!--<scope>test</scope>-->
        </dependency>


        <!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-core -->
        <dependency>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-core</artifactId>
            <version>1.2.3</version>
        </dependency>

        <!-- json -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.5</version>
        </dependency>

        <!-- On hive -->
        <!-- Flink Dependency -->

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-hive_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- Hive Dependency -->
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>${hive.version}</version>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.2</version>
            <scope>provided</scope>
        </dependency>

        <!-- mysql 连接驱动 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
</project>

2. 创建 TableEnvironment

概述:TableEnvironment 是 Table API 和 SQL 的核心概念

作用:

  • 1- 注册 catalog,并在其内部注册表
  • 2- 执行 SQL 查询
  • 3- 注册用户自定义函数
  • 4- 将 DataStream 转化为表
  • 5- 保存对 ExecutionEnvironment 和 StreamExecutionEnvironment 的引用

2.1 配置版本的流式查询(Flink-Streaming-Query)
// **********************// FLINK STREAMING QUERY// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

StreamExecutionEnvironment fsEnv = StreamExecutionEnvironment.getExecutionEnvironment();

EnvironmentSettings fsSettings = EnvironmentSettings.newInstance()
	.useOldPlanner() // 使用老版本planner
	.inStreamingMode() // 流处理模式
	.build();

StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings);
// or TableEnvironment fsTableEnv = TableEnvironment.create(fsSettings);
2.2 配置老版本的批处理环境(Flink-Batch-Query)
// ******************// FLINK BATCH QUERY// ******************
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.BatchTableEnvironment;

ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv);
2.3 配置新版本的流式查询(Blink-Streaming-Query)

和老版本的区别在于:useBlinkPlanner()

// **********************// BLINK STREAMING QUERY// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();

EnvironmentSettings bsSettings = EnvironmentSettings.newInstance()
	.useBlinkPlanner()// 使用新版本planner
	.inStreamingMode()// 流处理模式
	.build();
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
// or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings);
2.4 配置新版本的批处理环境(Blink-Batch-Query)

和老版本的区别在于:

  • 老版本用的是BatchTableEnvironment,传入fbEnv

  • 新版本用的是TableEnvironment,传入bbSettings

// ******************// BLINK BATCH QUERY// ******************
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

EnvironmentSettings bbSettings = EnvironmentSettings.newInstance()
	.useBlinkPlanner()// 使用新版本planner
	.inBatchMode()// 批处理模式
	.build();
TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings);

3. 查询表

3.1 导包操作
import static org.apache.flink.table.api.Expressions.*;
3.2 Table API 调用模型
# 借助 table 环境,找到数据源表
Table orderTable = tableEnv.from("inputTable");

# 调用 table API 进行查询
# select中,$(字段名)
# filter中,可以过滤操作

Table resultTable = orderTable
        .select($("id"),$("timestamp"),$("category"),$("areaName"),$("money"))
        .filter($("areaName").isEqual("北京"));
        

# 如果有分组聚类的话,groupBy 需要写在 select 前面
Table aggResultSqlTable = orderTable
        .groupBy($("areaName"))
        .select($("areaName"), $("id").count().as("cnt"));
3.3 SQL 查询模型
# 借助 table 环境,找到数据源表
Table orderTable = tableEnv.from("inputTable");

# 借助 sqlQuery 方法进行 SQL 查询
Table resultTable2  = tableEnv.sqlQuery("select id,`timestamp`,category,areaName,money from inputTable where areaName='北京'");

4. 将 DataStream 转化为表

# 读取的数据文件可以放在 resource 目录下
String filePath = 所在的类名.class.getClassLoader.getResource(“文件名”).getPath();

# 读取数据 ->
env.readTextFile()

# map函数转化类型

# 将数据流转化为表格
tableEnv.fromDataStream(dataStream)

案例:将 DataStream 转化为 表

数据源:order.csv,放在 resource 目录下

user_001,1621718199,10.1,电脑
user_001,1621718201,14.1,手机
user_002,1621718202,82.5,手机
user_001,1621718205,15.6,电脑
user_004,1621718207,10.2,家电
user_001,1621718208,15.8,电脑
user_005,1621718212,56.1,电脑
user_002,1621718260,40.3,家电
user_001,1621718580,11.5,家居
user_001,1621718860,61.6,家居

代码:

package cn.itcast.day01;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.types.Row;

import static org.apache.flink.table.api.Expressions.$;


/**
 * @author lql
 * @time 2024-03-12 14:34:40
 * @description TODO:将 DataStream 转化为表
 */
public class DataStreamToTable {
    public static void main(String[] args) throws Exception {
        // todo 1) 初始化 table 环境
        // 1.1 流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 1.2 setting环境
        EnvironmentSettings bsSettings = EnvironmentSettings.newInstance()
                .useBlinkPlanner()
                .inStreamingMode()
                .build();
        // 1.3 表环境
        StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(env, bsSettings);

        // todo 2) 用流环境读取数据源
        String filePath = DataStreamToTable.class.getClassLoader().getResource("order.csv").getPath();

        DataStreamSource<String> inputStream = env.readTextFile(filePath);

        // todo 3) map 成为样例数据类型
        SingleOutputStreamOperator<OrderInfo> dataStream = inputStream.map(new MapFunction<String, OrderInfo>() {
            @Override
            public OrderInfo map(String data) throws Exception {
                String[] dataArray = data.split(",");
                return new OrderInfo(
                        dataArray[0],
                        dataArray[1],
                        Double.parseDouble(dataArray[2]),
                        dataArray[3]);
            }
        });

        // todo 4) 将数据流化成表
        Table dataTable = bsTableEnv.fromDataStream(dataStream);

        // todo 5) 读取表格数据
        // 方法一: 调用 api 获得数据
        Table resultTable = dataTable
                .select($("id"), $("timestamp"), $("money"),$("category"))
                .filter($("category").isEqual("电脑"));

        // 将表转化成为流打印
        bsTableEnv.toAppendStream(resultTable, Row.class).print("方法一:调用api的结果");


        // 方法二:临时表,sql查询获得数据
        bsTableEnv.createTemporaryView("inputTable",dataTable);
        Table resultTable1 = bsTableEnv.sqlQuery("SELECT * FROM inputTable WHERE category = '电脑'");
        bsTableEnv.toAppendStream(resultTable1, Row.class).print("方法二:调用sql查询的结果");

        env.execute();
    }

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class OrderInfo {
        private String id;
        private String timestamp;
        private Double money;
        private String category;
    }
}

结果:

方法一:调用api的结果:5> +I[user_001, 1621718205, 15.6, 电脑]
方法二:调用sql查询的结果:5> +I[user_001, 1621718205, 15.6, 电脑]
方法一:调用api的结果:3> +I[user_001, 1621718199, 10.1, 电脑]
方法一:调用api的结果:7> +I[user_001, 1621718208, 15.8, 电脑]
方法二:调用sql查询的结果:3> +I[user_001, 1621718199, 10.1, 电脑]
方法二:调用sql查询的结果:7> +I[user_001, 1621718208, 15.8, 电脑]
方法一:调用api的结果:7> +I[user_005, 1621718212, 56.1, 电脑]
方法二:调用sql查询的结果:7> +I[user_005, 1621718212, 56.1, 电脑]

总结:

  • 1- 没有设置并行度为 1,打印结果乱序
  • 2- 方法一:调用 api 方法,流转化为表写执行逻辑
  • 3- 方法二:sql 查询,流转化为表,建立临时视图
  • 4- 两种方法都需要转化为流才能打印出来!

你可能感兴趣的:(Flink重温笔记,flink,笔记,sql,大数据,学习方法,数据库,KAFKA)