SeaTunnel 是一个非常易用、高性能、支持实时流式和离线批处理的海量数据集成平台,架构于 Apache Spark 和 Apache Flink 之上,支持海量数据的实时同步与转换。
今天使用seatunnel 消费kafka topic数据写入clickhouse
seatunnel :2.1.0
spark:2.4.*
kafka:2.7.0
clickhouse:21.7.5.29
kafka 数据从filebeat 写入kafka,seatunnel消费kafka数据写入clickhouse
kafka日志格式如下:
{"@timestamp":"2022-06-25T13:01:01.211Z","@metadata":{"beat":"filebeat","type":"_doc","version":"7.13.1"},"log":{"offset":4333868,"file":{"path":"/opt/servers/tomcatServers/tomcat-record/t1/webapps/recordlogs/test.log"}},"message":"[2022-06-25 21:01:00] [heartDataLog(10)] [ConsumeMessageThread_3] [INFO]-21471190\t汇文\t启动器\t广东省\t东莞市\t\t1060942144065537278\t1060942144065537278\t906\t114.087797\t22.815885\tnull\t中国广东省东莞市环市西路2号\t113.77.226.5\t1\t1\t1656162060855\t1656162060846\tnull\tnull\tnull\tnull\tnull\tnull","input":{"type":"log"},"fields":{"filetype":"heart","fields_under_root":true},"ecs":{"version":"1.8.0"},"host":{"name":"hd.n12"},"agent":{"version":"7.13.1","hostname":"hd.n12","ephemeral_id":"5260b6bd-d5ba-4020-86fd-bb02e1437cab","id":"72eafd44-b71d-452f-bdb5-b986d2a12c15","name":"hd.n12","type":"filebeat"}}
clickhouse 创建dw_testdata 表
create database seatunnel;
DROP table seatunnel.dw_testdata;
CREATE TABLE seatunnel.dw_testdata
(
`userid` String,
`username` String,
`app_name` String,
`province_name` String,
`city_name` String,
`district_name` String,
`code` String,
`real_code` String,
`product_name` String,
`longitude` String,
`latitude` String,
`rd` String,
`address` String,
`ip` String,
`screenlight` String,
`screenlock` String,
`hearttime` String,
`time` String,
`ssid` String,
`mac_address` String,
`rssi` String,
`timezone` String,
`current_time` String,
`program_version` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(FROM_UNIXTIME(toInt32(time)))
PRIMARY KEY (userid, machine_no)
ORDER BY (userid, machine_no)
TTL FROM_UNIXTIME(toInt32(time)) + toIntervalMonth(1)
SETTINGS index_granularity = 8192
dw_testdata 表对应message 字段 [info]-后数据格式
seatunnel conf 目录下创建 vi spart.streaming.kafka.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
######
###### This config file is a demonstration of stream processing in seatunnel config
######
env {
# You can set spark configuration here
# see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
spark.app.name = "SeaTunnel"
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "1g"
spark.streaming.batchDuration = 5
}
source {
# This is a example input plugin **only for test and demonstrate the feature input plugin**
# fakeStream {
# content = ["Hello World, SeaTunnel"]
# }
kafkaStream {
topics = "filebeat_tests"
consumer.bootstrap.servers = "localhost:9092"
consumer.group.id = "seatunnel_group"
}
# You can also use other input plugins, such as file
# file {
# result_table_name = "accesslog"
# path = "hdfs://hadoop-cluster-01/nginx/accesslog"
# format = "json"
# }
# If you would like to get more information about how to configure seatunnel and see full list of input plugins,
# please go to https://seatunnel.apache.org/docs/spark/configuration/source-plugins/FakeStream
}
transform {
# split {
# fields = ["msg", "name"]
# delimiter = ","
# }
#因数据直接是json格式,所以使用json transform
json {
source_field = "raw_message"
}
#下级直接针对json 字串中的message 使用split切分
split {
source_field = "message"
# target_field = "info"
separator = "\\[INFO\\]-"
fields = ["title","data"]
}
#拆分的data 继续使用split 切分
split {
source_field = "data"
separator = "\t"
fields = ["userid","username","app_name","province_name","city_name","district_name","code","real_code","product_name","longitude","latitude","rd","address","ip","screenlight","screenlock","hearttime","time","ssid","mac_address","rssi","timezone","current_time","program_version"]
result_table_name = "tests"
}
# you can also use other filter plugins, such as sql
Sql {
table_name = "tests"
sql = "select * from (select userid,username,app_name,province_name,city_name,district_name,machine_no,real_machineno,product_name,longitude,latitude,rd,address,ip,screenlight,screenlock,string(hearttime),string(time),ssid,mac_address,rssi,timezone,machine_current_time,program_version from tests) t1"
}
# If you would like to get more information about how to configure seatunnel and see full list of filter plugins,
# please go to https://seatunnel.apache.org/docs/spark/configuration/transform-plugins/Split
}
sink {
# choose stdout output plugin to output data to console
#Console {}
#输出到console 控制台
console {
limit = 10,
serializer = "json"
}
#输出到clickhouse
clickhouse {
host = "localhost:8123"
clickhouse.socket_timeout = 50000
database = "seatunnel"
table = "dw_testdata"
fields = ["userid","username","app_name","province_name","city_name","district_name","machine_no","real_machineno","product_name","longitude","latitude","rd","address","ip","screenlight","screenlock","hearttime","time","ssid","mac_address","rssi","timezone","machine_current_time","program_version"]
username = "default"
password = ""
bulk_size = 20000
}
# you can also use other output plugins, such as hdfs
# hdfs {
# path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
# save_mode = "append"
# }
# If you would like to get more information about how to configure seatunnel and see full list of output plugins,
# please go to https://seatunnel.apache.org/docs/spark/configuration/sink-plugins/Console
}
然后启动seatunnel vi start-seatunnel.sh
./bin/start-seatunnel-spark.sh \
--master yarn \
--deploy-mode client \
--config $1
sh start-seatunnel.sh conf/spark.streaming.kafka.conf
数据可以实时写入clickhouse。
存在问题:
1、字符串格式的时间戳如何转成clickhouse 中的datetime 格式
2、message 字段中如果存在jsonarray 格式,如何分成多行写入?
期待有缘人回复解答,再此感谢。