Schema registries in Kafka and how they’re used

Short introduction: if you use Kafka you’re most likely dealing with various kind of messages. You store the schema of each message in a schema-registry. But how does Kafka use a schema-registry? Here’s my short introduction.
If you want to know why you’d need a schema-registry, what a schema is, or what Kafka is, follow up the links or use a search-engine or contact me. You’d most likely get to this article because you want to know the details, so I’ll just skip to them.

Kafka and schemas?

Any application that uses Kafka either produces messages or consumes them. Producing means writing messages (also sometimes called records) to a kafka topic. Consuming means reading messages from a a kafka topic. Kafka is pretty liberal on what the content is of those messages. Basically it can be any textual or binary information. But to make it more manageable almost always you’d stick with Json, Avro or Protobuf.

How does kafka interact with the schema serializer?

To understand the schema registry, knowing more about kafka serde, which consists of serializer/deserializer.

The serializer translates e.g. transforms messages from your application to a kafka message:

  • Get the schema from the message based on some key (usually type of message, but can be topic name too)
  • Check if schema is some local in-memory cache
  • if not, check the configured schemaregistry using a http call
  • if it’s present use it and put it in the cache
  • if not present, add it to the registry
  • Create kafka record (e.g. message) that includes the original message. If the original message contains the full schema, remove it and replace it by reference by key. Or add the reference
  • Write the message it to the Kafka topic

The deserializer basically works the other way around

  • Read message
  • Get the schema-key
  • Read it from the in-memory cache based on key, if present use the schema
  • If not present, call the configured schemaregistry using http. If present, store it in the in-memory cache and use it
  • Use the schema to read and decode the kafka record
  • If no schema is present, error!

The source code of the serializer of confluent can be interesting to learn more about it, see for example the one of confluent or the one of apicurio .