Using Apache FlinkPDF version

Schema Registry with Flink

When Kafka is chosen as source and sink for your application, you can use Cloudera Schema Registry to register and retrieve schema information of the different Kafka topics. You must add Schema Registry dependency to your project and add the appropriate schema object to your Kafka topics.

There are several reasons why you should prefer the Schema Registry instead of custom serializer implementations on both consumer and producer side:
  • Offers automatic and efficient serialization/deserialization for Avro, JSON and basic types
  • Guarantees that only compatible data can be written to a given topic (assuming that every producer uses the registry)
  • Supports safe schema evolution on both producer and consumer side
  • Offers visibility to developers on the data types and they can track schema evolution for the different Kafka topics
Add the following Maven dependency or equivalent to use the schema registry integration in your project:
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cloudera-registry</artifactId>
    <version>${flink.version}</version>
</dependency>
The schema registry can be plugged directly into the KafkaSource and KafkaSink using the appropriate schema:
org.apache.flink.formats.registry.cloudera.avro.ClouderaRegistryAvroKafkaRecordSerializationSchema
org.apache.flink.formats.registry.cloudera.avro.ClouderaRegistryAvroKafkaDeserializationSchema

See the Apache Flink documentation for Kafka consumer and producer basics.

Currently, the following data types are supported for producers and consumers:
  • Avro Specific Record types
  • Avro Generic Records
  • JSON data types
  • Basic Java Data types: byte[], Byte, Integer, Short, Double, Float, Long, String, Boolean

To get started with Avro schemas and generated Java objects, see the Apache Avro documentation.

You need to include every SSL configuration into a Map that is passed to the Schema Registry configuration.
Map<String, String> sslClientConfig = new HashMap<>();
sslClientConfig.put(K_TRUSTSTORE_PATH, params.get(K_SCHEMA_REG_SSL_CLIENT_KEY + "." + K_TRUSTSTORE_PATH));
sslClientConfig.put(K_TRUSTSTORE_PASSWORD, params.get(K_SCHEMA_REG_SSL_CLIENT_KEY + "." + K_TRUSTSTORE_PASSWORD));

Map<String, Object> schemaRegistryConf = new HashMap<>();
schemaRegistryConf.put(K_SCHEMA_REG_URL, params.get(K_SCHEMA_REG_URL));
schemaRegistryConf.put(K_SCHEMA_REG_SSL_CLIENT_KEY, sslClientConfig);
For Kerberos authentication, Flink can maintain the authentication and ticket renewal automatically. You can define an additional RegistryClient property to the security.kerberos.login.contexts parameter in Cloudera Manager.
security.kerberos.login.contexts=Client,KafkaClient,RegistryClient

You can construct the schema serialization with the ClouderaRegistryAvroKafkaRecordSerializationSchema and ClouderaRegistryJsonKafkaRecordSerializationSchema objects for FlinkKafkaProducer based on the required data format. You must set the topic configuration and RegistryAddress parameter in the object.

Required settings:
  • Topic configuration when creating the builder, which can be static or dynamic (extracted from the data)
  • RegistryAddress parameter on the builder to establish the connection
Optional settings:
  • Arbitrary SchemaRegistry client configuration using the setConfig method
  • Key configuration for the produced Kafka messages
    • Specifying a KeySelector function that extracts the key from each record
    • Using a Tuple2 stream for (key, value) pairs directly
  • Security configuration
        KafkaSerializationSchema<ItemTransaction> schema = 
                ClouderaRegistryAvroKafkaRecordSerializationSchema.<ItemTransaction>builder(topic)
                         .setRegistryAddress(registryAddress)
                         .setKey(ItemTransaction::getItemId)
                         .build();
        KafkaSink<ItemTransaction> kafkaSink =
                KafkaSink.<Row>builder()
                        .setKafkaProducerConfig(kafkaProps)
                        .setRecordSerializer(schema)
                        .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
                        .build();

You can construct the schema deserialization with the ClouderaRegistryAvroKafkaRecordDeserializationSchema and ClouderaRegistryJsonKafkaRecordDeserializationSchema objects for FlinkKafkaProducer to read the messages in the same schema from the FlinkKafkaProducer. You must set the class or schema of the input messages and the RegistryAddress parameter in the object.

When reading messages (and keys), you always have to specify the expected Class<T> or record Schema of the input records. This way Flink can do any necessary conversion between the raw data received from Kafka and the expected output of the deserialization.

Required settings:
  • Class or Schema of the input messages depending on the data type
  • RegistryAddress parameter on the builder to establish the connection
Optional settings:
  • Arbitrary SchemaRegistry client configuration using the setConfig method
  • Key configuration for the consumed Kafka messages (only to be specified if reading the keys into a key or value stream is necessary)
  • Security configuration
        KafkaDeserializationSchema<ItemTransaction> schema = 
                        ClouderaRegistryAvroKafkaDeserializationSchema.builder(ItemTransaction.class)
                                .setRegistryAddress(registryAddress)
                                .build();
        KafkaSource<String> transactionSource = KafkaSource.<String>builder()
                .setBootstrapServers("<your_broker_url>")
                .setTopics(inputTopic)
                .setGroupId(groupId)
                .setStartingOffsets(OffsetsInitializer.earliest())
                .setValueOnlyDeserializer(schema)
                .build();