Flink cdc使用及参数设置

Flink Sql通过CDC监听mysql

create table order_source_ms(id BIGINT,deal_amt DOUBLE,shop_id STRING,customer_id String,city_id bigint,product_count double,
order_at timestamp(3),last_updated_at timestamp(3),pay_at timestamp,refund_at timestamp,
tenant_id STRING,order_category STRING,
h as hour(last_updated_at),
pay_hour as hour(pay_at),
refund_hour as hour(refund_at),
m as MINUTE(last_updated_at),
dt as to_DATE(cast(last_updated_at as string)),
pay_dt as to_DATE(cast(pay_at as string)),
refund_dt as to_DATE(cast(refund_at as string)),
PRIMARY KEY(id) NOT ENFORCED)
with(
'connector' ='mysql-cdc',
'hostname' ='ip',
'port'='3306',
'username' = 'username',
'password' = 'password',
'database-name'='databasename',
'scan.startup.mode'='latest-offset',
'debezium.skipped.operations'='d',
'table-name'='tablename')

可以通过SQLclient的方式执行上面的SQL语句,就建立了和mysql对应的表的连接。当然前提都是需要将需要的jar包 flink-sql-connector-mysql-cdc-2.2-SNAPSHOT.jar依赖放到flink的lib目录下面

在案例中建表时有使用到flinkSQL相关的时间函数,获取每条数据的小时,分钟,日期数据的截取

h as hour(last_updated_at),
pay_hour as hour(pay_at),
refund_hour as hour(refund_at),
m as MINUTE(last_updated_at),
dt as to_DATE(cast(last_updated_at as string)),
pay_dt as to_DATE(cast(pay_at as string)),
refund_dt as to_DATE(cast(refund_at as string)),

flinksql的内置函数

参数解读:

        对于一般的参数可以通过官网查看:

Option Required Default Type Description
connector required (none) String Specify what connector to use, here should be 'mysql-cdc'.
hostname required (none) String IP address or hostname of the MySQL database server.
username required (none) String Name of the MySQL database to use when connecting to the MySQL database server.
password required (none) String Password to use when connecting to the MySQL database server.
database-name required (none) String Database name of the MySQL server to monitor. The database-name also supports regular expressions to monitor multiple tables matches the regular expression.
table-name required (none) String Table name of the MySQL database to monitor. The table-name also supports regular expressions to monitor multiple tables matches the regular expression.
port optional 3306 Integer Integer port number of the MySQL database server.
server-id optional (none) Integer A numeric ID or a numeric ID range of this database client, The numeric ID syntax is like '5400', the numeric ID range syntax is like '5400-5408', The numeric ID range syntax is recommended when 'scan.incremental.snapshot.enabled' enabled. Every ID must be unique across all currently-running database processes in the MySQL cluster. This connector joins the MySQL cluster as another server (with this unique ID) so it can read the binlog. By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.
scan.incremental.snapshot.enabled optional true Boolean Incremental snapshot is a new mechanism to read snapshot of a table. Compared to the old snapshot mechanism, the incremental snapshot has many advantages, including: (1) source can be parallel during snapshot reading, (2) source can perform checkpoints in the chunk granularity during snapshot reading, (3) source doesn't need to acquire global read lock (FLUSH TABLES WITH READ LOCK) before snapshot reading. If you would like the source run in parallel, each parallel reader should have an unique server id, so the 'server-id' must be a range like '5400-6400', and the range must be larger than the parallelism. Please see Incremental Snapshot Readingsection for more detailed information.
scan.incremental.snapshot.chunk.size optional 8096 Integer The chunk size (number of rows) of table snapshot, captured tables are split into multiple chunks when read the snapshot of table.
scan.snapshot.fetch.size optional 1024 Integer The maximum fetch size for per poll when read table snapshot.
scan.startup.mode optional initial String Optional startup mode for MySQL CDC consumer, valid enumerations are "initial" and "latest-offset". Please see Startup Reading Positionsection for more detailed information.
server-time-zone optional UTC String The session time zone in database server, e.g. "Asia/Shanghai". It controls how the TIMESTAMP type in MYSQL converted to STRING. See more here.
debezium.min.row. count.to.stream.result optional 1000 Integer During a snapshot operation, the connector will query each included table to produce a read event for all rows in that table. This parameter determines whether the MySQL connection will pull all results for a table into memory (which is fast but requires large amounts of memory), or whether the results will instead be streamed (can be slower, but will work for very large tables). The value specifies the minimum number of rows a table must contain before the connector will stream results, and defaults to 1,000. Set this parameter to '0' to skip all table size checks and always stream all results during a snapshot.
connect.timeout optional 30s Duration The maximum time that the connector should wait after trying to connect to the MySQL database server before timing out.
debezium.* optional (none) String Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from MySQL server. For example: 'debezium.snapshot.mode' = 'never'. See more about the Debezium's MySQL Connector properties

 要说明的是其中的一个参数设置

'debezium.skipped.operations'='d',

这个参数的配置是监听mysql的binlog时要跳过删除操作,这个参数是找了好久才发现的,因为业务需求需要对删除操作进行过滤,一直没有找到通过flinkSQL过滤的参数,最后发现:

官网直接提供的参数配置的最后一行,通过查看debezium所提供的参数来扩展,可以通过提供的连接去找到自己还需要的监听参数,我所找到的过滤删除操作的参数地址

Flink cdc使用及参数设置_第1张图片

代码操作连接监听mysqlCDC

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;

public class MySqlSourceExample {
  public static void main(String[] args) throws Exception {
    MySqlSource mySqlSource = MySqlSource.builder()
        .hostname("yourHostname")
        .port(yourPort)
        .databaseList("yourDatabaseName") // set captured database
        .tableList("yourDatabaseName.yourTableName") // set captured table
        .username("yourUsername")
        .password("yourPassword")
        .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
        .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    // enable checkpoint
    env.enableCheckpointing(3000);

    env
      .fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
      // set 4 parallel source tasks
      .setParallelism(4)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute("Print MySQL Snapshot + Binlog");
  }
}

        这里有一点需要补充的是,我在使用flinkSQL建表的时候对于金额类型的字段使用的是double数据类型和decimal类型对应,虽然建表的时候不会报错,但是在对金额进行运算的时候会出现精度损失,所以需要使用decimal类型来建表; 

        更多有关FlinkSQL的使用可以参考:FlinkSQL详细系统的全面讲解及在企业生产的实践使用

你可能感兴趣的:(Flink,flink,大数据,big,data)