Schema evolution and compatibility
This chapter examines how Pulsar schema evolves and what Pulsar schema compatibility check strategies are.
Pulsar schema is defined in a data structure called .
Each SchemaInfo
stored with a topic has a version. The version is used to manage the schema changes happening within a topic.
The message produced with SchemaInfo
is tagged with a schema version. When a message is consumed by a Pulsar client, the Pulsar client can use the schema version to retrieve the corresponding SchemaInfo
and use the correct schema information to deserialize data.
Schemas store the details of attributes and types. To satisfy new business requirements, you need to update schemas inevitably over time, which is called schema evolution.
Any schema changes affect downstream consumers. Schema evolution ensures that the downstream consumers can seamlessly handle data encoded with both old schemas and new schemas.
How Pulsar schema should evolve?
The answer is Pulsar schema compatibility check strategy. It determines how schema compares old schemas with new schemas in topics.
For more information, see .
How does Pulsar support schema evolution?
When a producer/consumer/reader connects to a broker, the broker deploys the schema compatibility checker configured by
schemaRegistryCompatibilityCheckers
to enforce schema compatibility check.The schema compatibility checker is one instance per schema type.
Currently, Avro and JSON have their own compatibility checkers, while all the other schema types share the default compatibility checker which disables schema evolution.
The producer/consumer/reader sends its client
SchemaInfo
to the broker.The broker knows the schema type and locates the schema compatibility checker for that type.
The broker uses the checker to check if the
SchemaInfo
is compatible with the latest schema of the topic by applying its compatibility check strategy.Currently, the compatibility check strategy is configured at the namespace level and applied to all the topics within that namespace.
Pulsar has 8 schema compatibility check strategies, which are summarized in the following table.
Suppose that you have a topic containing three schemas (V1, V2, and V3), V1 is the oldest and V3 is the latest:
Compatibility check strategy | Definition | Note |
---|---|---|
ALWAYS_COMPATIBLE | Disable schema compatibility check. | None |
ALWAYS_INCOMPATIBLE | Disable schema evolution, that is, any schema change is rejected. | FULL . |
Example
-
In some situations, an application needs to store events of several different types in the same Pulsar topic.
In particular, when developing a data model in an
Event Sourcing
style, you might have several kinds of events that affect the state of an entity.For example, for a user entity, there are
userCreated
,userAddressChanged
anduserEnquiryReceived
events. The application requires that those events are always read in the same order.Consequently, those events need to go in the same Pulsar partition to maintain order. This application can use
ALWAYS_COMPATIBLE
to allow different kinds of events co-exist in the same topic.
BACKWARD and BACKWARD_TRANSITIVE
Suppose that you have a topic containing three schemas (V1, V2, and V3), V1 is the oldest and V3 is the latest:
Example
Example 1
Remove a field.
A consumer constructed to process events without one field can process events written with the old schema containing the field, and the consumer will ignore that field.
Example 2
You want to load all Pulsar data into a Hive data warehouse and run SQL queries against the data.
Same SQL queries must continue to work even the data is changed. To support it, you can evolve the schemas using the
BACKWARD
strategy.
FORWARD and FORWARD_TRANSITIVE
Compatibility check strategy | Definition | Description |
---|---|---|
FORWARD | Consumers using the last schema can process data written by producers using a new schema, even though they may not be able to use the full capabilities of the new schema. | The consumers using the schema V3 or V2 can process data written by producers using the schema V3. |
FORWARD_TRANSITIVE | Consumers using all previous schemas can process data written by producers using a new schema. | The consumers using the schema V3, V2, or V1 can process data written by producers using the schema V3. |
Example
Example 1
Add a field.
In most data formats, consumers written to process events without new fields can continue doing so even when they receive new events containing new fields.
Example 2
If a consumer has an application logic tied to a full version of a schema, the application logic may not be updated instantly when the schema evolves.
In this case, you need to project data with a new schema onto an old schema that the application understands.
Consequently, you can evolve the schemas using the
FORWARD
strategy to ensure that the old schema can process data encoded with the new schema.
Suppose that you have a topic containing three schemas (V1, V2, and V3), V1 is the oldest and V3 is the latest:
Example
In some data formats, for example, Avro, you can define fields with default values. Consequently, adding or removing a field with a default value is a fully compatible change.
tip
You can set schema compatibility check strategy at the topic, namespace or broker level. For how to set the strategy, see here.
When a producer or a consumer tries to connect to a topic, a broker performs some checks to verify a schema.
Producer
When a producer tries to connect to a topic (suppose ignore the schema auto creation), a broker does the following checks:
Check if the schema carried by the producer exists in the schema registry or not.
If the schema is already registered, then the producer is connected to a broker and produce messages with that schema.
If the schema is not registered, then Pulsar verifies if the schema is allowed to be registered based on the configured compatibility check strategy.
Consumer
When a consumer tries to connect to a topic, a broker checks if a carried schema is compatible with a registered schema based on the configured schema compatibility check strategy.
Compatibility check strategy | Check logic |
---|---|
ALWAYS_COMPATIBLE | All pass |
ALWAYS_INCOMPATIBLE | No pass |
BACKWARD | Can read the last schema |
BACKWARD_TRANSITIVE | Can read all schemas |
FORWARD | Can read the last schema |
FORWARD_TRANSITIVE | Can read the last schema |
FULL | Can read the last schema |
FULL_TRANSITIVE | Can read all schemas |
The order of upgrading client applications is determined by the compatibility check strategy.