Flink消費kafka数据的反序列化方式

Flink消費kafka数据的反序列化方式

  • Flink反序列化顶层接口
          • 1.DeserializationSchema接口
          • 2.KafkaDeserializationSchema接口
  • 接收kafka数据时常用的反序列化类
          • 1.SimpleStringSchema类
          • 2.JSONKeyValueDeserializationSchema类
          • 3.实现KafkaDeserializationSchema自定义序列化类

Flink DataStream Api在接收kafka数据时,需要进行反序列化,以便进行后续的逻辑处理。本文根据作者的开发经验,简单介绍几种常用的kafka反序列化方式。

Flink反序列化顶层接口

1.DeserializationSchema接口

源码如下:

/**
 * The deserialization schema describes how to turn the byte messages delivered by certain
 * data sources (for example Apache Kafka) into data types (Java/Scala objects) that are
 * processed by Flink.
 * 翻译:
 * 反序列化模式描述了如何将某些数据源(例如Apache Kafka)传递的字节消息转换为Flink处理的数据类型(Java / Scala对象)
 *
 * 

In addition, the DeserializationSchema describes the produced type ({@link #getProducedType()}), * which lets Flink create internal serializers and structures to handle the type. * 翻译: * 此外,DeserializationSchema描述了产生的类型({@link #getProducedType()}),它使Flink可以创建内部序列化器和结构来处理该类型 *

Note: In most cases, one should start from {@link AbstractDeserializationSchema}, which * takes care of producing the return type information automatically. * * 在大多数情况下,应该从{@link AbstractDeserializationSchema}开始,它负责自动生成返回类型信息 *

A DeserializationSchema must be {@link Serializable} because its instances are often part of * an operator or transformation function. * DeserializationSchema必须为{@link Serializable},因为其实例通常是运算符或转换函数的一部分 * @param The type created by the deserialization schema. */ @Public public interface DeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> { /** * Deserializes the byte message. * * @param message The message, as a byte array. * * @return The deserialized message as an object (null if the message cannot be deserialized). */ T deserialize(byte[] message) throws IOException; /** * Method to decide whether the element signals the end of the stream. If * true is returned the element won't be emitted. * * @param nextElement The element to test for the end-of-stream signal. * @return True, if the element signals end of stream, false otherwise. */ boolean isEndOfStream(T nextElement); }

DeserializationSchema是顶层接口,定义了如何将某些数据源转化成flink能够处理的数据类型。

2.KafkaDeserializationSchema接口
/**
 * The deserialization schema describes how to turn the Kafka ConsumerRecords
 * into data types (Java/Scala objects) that are processed by Flink.
 * 
 *翻译:反序列化模式描述了如何将Kafka ConsumerRecords转换为Flink处理的数据类型(Java / Scala对象)
 
 * @param  The type created by the keyed deserialization schema.
 */
@PublicEvolving
public interface KafkaDeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {

	/**
	 * Method to decide whether the element signals the end of the stream. If
	 * true is returned the element won't be emitted.
	 *
	 * @param nextElement The element to test for the end-of-stream signal.
	 *
	 * @return True, if the element signals end of stream, false otherwise.
	 */
	boolean isEndOfStream(T nextElement);

	/**
	 * Deserializes the Kafka record.
	 *
	 * @param record Kafka record to be deserialized.
	 *
	 * @return The deserialized message as an object (null if the message cannot be deserialized).
	 */
	T deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception;
}

可以发现,DeserializationSchema和KafkaDeserializationSchema是同级的接口,都继承了Serializable, ResultTypeQueryable这两个接口。不同点是,deserialize方法接口的参数不一样,KafkaDeserializationSchema接口为反序列化kafka数据而生。DeserializationSchema接口可以反序列化任意二进制数据,更加具有通用性。

接收kafka数据时常用的反序列化类

1.SimpleStringSchema类
/**
 * Very simple serialization schema for strings.
 *
 * 

By default, the serializer uses "UTF-8" for string/byte conversion. */ public class SimpleStringSchema implements DeserializationSchema<String>, SerializationSchema<String>

此类实现了DeserializationSchema和SerializationSchema两个接口,可以同时用于序列化和反序列化,在接收和发送kafka数据的时候都可以使用。
官方注释说 Very simple serialization schema for strings :非常简单的字符串序列化架构。将数据序列化和反序列化成String类型,默认编码格式为UTF-8。

2.JSONKeyValueDeserializationSchema类
/**
 * DeserializationSchema that deserializes a JSON String into an ObjectNode.
 * 将JSON字符串反序列化为ObjectNode的DeserializationSchema
 * 
 * 

Key fields can be accessed by calling objectNode.get("key").get(<name>).as(<type>) * *

Value fields can be accessed by calling objectNode.get("value").get(<name>).as(<type>) * *

Metadata fields can be accessed by calling objectNode.get("metadata").get(<name>).as(<type>) and include * the "offset" (long), "topic" (String) and "partition" (int). */ @PublicEvolving public class JSONKeyValueDeserializationSchema implements KafkaDeserializationSchema<ObjectNode> { private static final long serialVersionUID = 1509391548173891955L; private final boolean includeMetadata; private ObjectMapper mapper; public JSONKeyValueDeserializationSchema(boolean includeMetadata) { this.includeMetadata = includeMetadata; } @Override public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception { if (mapper == null) { mapper = new ObjectMapper(); } ObjectNode node = mapper.createObjectNode(); if (record.key() != null) { node.set("key", mapper.readValue(record.key(), JsonNode.class)); } if (record.value() != null) { node.set("value", mapper.readValue(record.value(), JsonNode.class)); } if (includeMetadata) { node.putObject("metadata") .put("offset", record.offset()) .put("topic", record.topic()) .put("partition", record.partition()); } return node; } @Override public boolean isEndOfStream(ObjectNode nextElement) { return false; } @Override public TypeInformation<ObjectNode> getProducedType() { return getForClass(ObjectNode.class); } }

JSONKeyValueDeserializationSchema的构造方法需要传入一个布尔值,参数为true时,反序列化的数据中带有kafka metadata元数据信息,包括offset,topic,partition信息。参数为false时,反序列化的数据中只有kafka中key和value的信息。

deserialize方法的返回值为ObjectNode,由以下源码可知,ObjectNode由Map 组成。因此在获取kafka数据时,通过ObjectNode.get(“key”)、ObjectNode.get(“value”).get(filed name)、ObjectNode.get(“metadata”).get(“offset”)来获取我们想要的数据。

public class ObjectNode extends ContainerNode<ObjectNode> {
    protected final Map<String, JsonNode> _children;

注意:在使用此类反序列化时,要求kafka中传输的数据为JSON字符串,负责无法序列化。

3.实现KafkaDeserializationSchema自定义序列化类

由于作者项目中需要获取kafka中key和topic的值,同时kafka中的数据又包含非JSON类型的数据,所以需要自定义实现KafkaDeserializationSchema以获取key和topic信息。

/**
 * @program: 
 * @description: 自定义反序列化类获取key,value,topic的值
 * @author: 
 * @create: 
 **/
public class CustomKeyValueDeserializationSchema implements KeyedDeserializationSchema<String> {
    @Override
    public String deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset) throws IOException {
        StringBuffer stringBuffer = new StringBuffer();
        String mskey = new String(messageKey, StandardCharsets.UTF_8);
        String ms = new String(message, StandardCharsets.UTF_8);
        //使用"\t"进行分割,便于后续逻辑处理时做数据切分。返回值为String类型
        return stringBuffer.append(ms).append("\t").append(mskey).append("\t").append(topic).toString();
    }

    @Override
    public boolean isEndOfStream(String nextElement) {
        return false;
    }
	
	//定义返回值类型。
    @Override
    public TypeInformation<String> getProducedType() {
        return TypeInformation.of(String.class);
    }
}

自定义类,可以灵活的处理反序列化的数据,获取自己想要的信息。

你可能感兴趣的:(flink,flink,kafka)