seatunnel 消费kafka数据写入clickhouse

SeaTunnel 是一个非常易用、高性能、支持实时流式和离线批处理的海量数据集成平台,架构于 Apache Spark 和 Apache Flink 之上,支持海量数据的实时同步与转换。

今天使用seatunnel 消费kafka topic数据写入clickhouse

seatunnel :2.1.0

spark:2.4.*

kafka:2.7.0

clickhouse:21.7.5.29

kafka 数据从filebeat 写入kafka,seatunnel消费kafka数据写入clickhouse

kafka日志格式如下:

{"@timestamp":"2022-06-25T13:01:01.211Z","@metadata":{"beat":"filebeat","type":"_doc","version":"7.13.1"},"log":{"offset":4333868,"file":{"path":"/opt/servers/tomcatServers/tomcat-record/t1/webapps/recordlogs/test.log"}},"message":"[2022-06-25 21:01:00] [heartDataLog(10)] [ConsumeMessageThread_3] [INFO]-21471190\t汇文\t启动器\t广东省\t东莞市\t\t1060942144065537278\t1060942144065537278\t906\t114.087797\t22.815885\tnull\t中国广东省东莞市环市西路2号\t113.77.226.5\t1\t1\t1656162060855\t1656162060846\tnull\tnull\tnull\tnull\tnull\tnull","input":{"type":"log"},"fields":{"filetype":"heart","fields_under_root":true},"ecs":{"version":"1.8.0"},"host":{"name":"hd.n12"},"agent":{"version":"7.13.1","hostname":"hd.n12","ephemeral_id":"5260b6bd-d5ba-4020-86fd-bb02e1437cab","id":"72eafd44-b71d-452f-bdb5-b986d2a12c15","name":"hd.n12","type":"filebeat"}}

clickhouse 创建dw_testdata 表


create database seatunnel;
DROP  table seatunnel.dw_testdata;
CREATE TABLE seatunnel.dw_testdata
(
    `userid` String,
    `username` String,
    `app_name` String,
    `province_name` String,
    `city_name` String,
    `district_name` String,
    `code` String,
    `real_code` String,
    `product_name` String,
    `longitude` String,
    `latitude` String,
    `rd` String,
    `address` String,
    `ip` String,
    `screenlight` String,
    `screenlock` String,
    `hearttime` String,
    `time` String,
    `ssid` String,
    `mac_address` String,
    `rssi` String,
    `timezone` String,
    `current_time` String,
    `program_version` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(FROM_UNIXTIME(toInt32(time)))
PRIMARY KEY (userid, machine_no)
ORDER BY (userid, machine_no)
TTL FROM_UNIXTIME(toInt32(time)) + toIntervalMonth(1)
SETTINGS index_granularity = 8192

dw_testdata 表对应message 字段 [info]-后数据格式

seatunnel conf 目录下创建 vi spart.streaming.kafka.conf

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

######
###### This config file is a demonstration of stream processing in seatunnel config
######

env {
  # You can set spark configuration here
  # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
  spark.app.name = "SeaTunnel"
  spark.executor.instances = 2
  spark.executor.cores = 1
  spark.executor.memory = "1g"
  spark.streaming.batchDuration = 5
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**
 # fakeStream {
 #   content = ["Hello World, SeaTunnel"]
 # }
 kafkaStream {
    topics = "filebeat_tests"
    consumer.bootstrap.servers = "localhost:9092"
    consumer.group.id = "seatunnel_group"
}

  # You can also use other input plugins, such as file
  # file {
  #   result_table_name = "accesslog"
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
  #   format = "json"
  # }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/spark/configuration/source-plugins/FakeStream
}

transform {

 # split {
 #   fields = ["msg", "name"]
 #   delimiter = ","
 # }
#因数据直接是json格式,所以使用json transform
 json {
    source_field = "raw_message"
 }
 #下级直接针对json 字串中的message 使用split切分 
  split {
    source_field = "message"
   # target_field = "info"
    separator = "\\[INFO\\]-"
    fields = ["title","data"]
  }
  #拆分的data 继续使用split 切分
  split {
    source_field = "data"
    separator = "\t"
    fields = ["userid","username","app_name","province_name","city_name","district_name","code","real_code","product_name","longitude","latitude","rd","address","ip","screenlight","screenlock","hearttime","time","ssid","mac_address","rssi","timezone","current_time","program_version"]
    result_table_name = "tests"
   }

  # you can also use other filter plugins, such as sql
   Sql {
     table_name = "tests"
     sql = "select * from (select  userid,username,app_name,province_name,city_name,district_name,machine_no,real_machineno,product_name,longitude,latitude,rd,address,ip,screenlight,screenlock,string(hearttime),string(time),ssid,mac_address,rssi,timezone,machine_current_time,program_version from tests) t1"
   }

  # If you would like to get more information about how to configure seatunnel and see full list of filter plugins,
  # please go to https://seatunnel.apache.org/docs/spark/configuration/transform-plugins/Split
}

sink {
  # choose stdout output plugin to output data to console
  #Console {}
#输出到console 控制台
  console {
    limit = 10,
    serializer = "json"
  }
#输出到clickhouse
  clickhouse {
    host = "localhost:8123"
    clickhouse.socket_timeout = 50000
    database = "seatunnel"
    table = "dw_testdata"
    fields = ["userid","username","app_name","province_name","city_name","district_name","machine_no","real_machineno","product_name","longitude","latitude","rd","address","ip","screenlight","screenlock","hearttime","time","ssid","mac_address","rssi","timezone","machine_current_time","program_version"]
    username = "default"
    password = ""
    bulk_size = 20000
}
  # you can also use other output plugins, such as hdfs
  # hdfs {
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
  #   save_mode = "append"
  # }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/spark/configuration/sink-plugins/Console
}

然后启动seatunnel  vi start-seatunnel.sh

./bin/start-seatunnel-spark.sh \
--master yarn \
--deploy-mode client \
--config $1

sh start-seatunnel.sh  conf/spark.streaming.kafka.conf 

数据可以实时写入clickhouse。

存在问题:

1、字符串格式的时间戳如何转成clickhouse 中的datetime 格式

2、message 字段中如果存在jsonarray 格式,如何分成多行写入?

期待有缘人回复解答,再此感谢。

你可能感兴趣的:(seatunel,大数据,大数据)