Event Design for Streaming Systems: A Primer

Event Design for Streaming Systems: A Primer

Introduction

Event streaming architectures are fundamentally different from traditional request-response systems. Instead of services calling each other directly, teams publish streams of immutable facts about what has happened in their domain. Other teams subscribe to these streams and react independently. This creates loose coupling: the publisher doesn’t know or care who consumes their events, and consumers don’t need to query the publisher for context.

The key challenge is managing these public event streams as contracts between teams. When one team changes their event structure, how do we prevent breaking every downstream consumer? When multiple teams are evolving their schemas simultaneously, how do we ensure consistency? This is where schema registries and structured event design become critical.

Events are Facts, Commands are Actions

The foundational principle of event streaming is that events in the log record immutable facts about what happened, expressed in the past tense. An event like OrderPlaced describes a business fact: at this timestamp, this customer placed this order with these items. Once written, this fact never changes. It’s part of your organization’s permanent history.

Commands, by contrast, are imperative requests for something to happen. They live outside the event log. When a consumer reads an OrderPlaced event, it might emit a ReserveInventory command to the warehouse system, or a ChargeCustomer command to the payment processor. These commands are ephemeral messages between specific services, not part of the shared historical record.

This distinction matters because it defines the boundaries of your system. The event log is a shared, append-only ledger that multiple teams depend on. Commands are private conversations between services that can be retried, failed, or changed without affecting the broader organization. You can replay events to rebuild state; you cannot (and should not) replay commands to re-execute actions like charging credit cards.

Consider a concrete example. When a user registers, the registration service emits a UserRegistered event containing the user’s ID, email, name, and registration timestamp. The email service consumes this event and generates a SendWelcomeEmail command with the user’s email and name. The analytics service consumes the same event and records metrics. The onboarding service consumes it and creates initial tutorial tasks. Each consumer decides independently how to react. The event stream is the shared truth; commands are derived actions.

What is a Schema Registry and Why Do We Need It?

A schema registry is a standalone service that stores, versions, and validates the schemas for your event streams. When you’re working in a statically-typed language, you might wonder why you need external schema management at all (doesn’t the type system handle this?). The answer lies in the distributed, polyglot nature of event streaming architectures.

Your service might produce events that services in a different language consume, or vice versa. Even within a single language, you’ll often have multiple services compiled separately, deployed independently, at different times. The schema registry provides a language-neutral, centralized source of truth for what your events look like. It acts as a contract enforcement mechanism: when Team A tries to deploy a producer that changes the structure of OrderPlaced events, the registry checks whether this change is compatible with the schemas that Team B’s consumer expects.

Without a schema registry, you have several painful options. You could manually coordinate deployments: “Everyone freeze, we’re upgrading the order schema this Tuesday at 3pm.” This doesn’t scale. You could embed version numbers in your event types and handle multiple versions in every consumer: OrderPlacedV1, OrderPlacedV2, OrderPlacedV3. This creates exponential complexity. You could just break things and fix them when consumers start failing in production. This is obviously bad.

The schema registry solves this by shifting validation to CI/deploy time rather than runtime. When you change a schema, the registry checks compatibility before you deploy. If you try to change a field type from String to Int, the registry rejects the schema and your build fails. This prevents broken contracts from reaching production. When you add an optional field, the registry accepts the schema, assigns it a new version number, and allows deployment to proceed. Consumers using the old schema continue working; consumers that have upgraded see the new field.

How Schema Registry Implementations Work

The schema registry operates as a versioned key-value store with compatibility checking. Each event type (technically, each Kafka topic-key or topic-value pair) maps to a subject in the registry. A subject contains an ordered list of schema versions. When you register a new schema, the registry checks it against the existing versions according to configurable compatibility rules, then assigns it a unique schema ID and version number.

At runtime, producers and consumers interact with the registry through a simple protocol. When a producer serializes an event, it first registers or looks up the schema in the registry (this is usually cached after the first call). The registry returns a schema ID (a 32-bit integer). The producer then serializes the event data according to the schema (using Avro, Protobuf, or JSON Schema) and prepends the schema ID to the binary payload. This is what actually gets written to Kafka: a 4-byte schema ID followed by the encoded data.

When a consumer reads the message, it extracts the schema ID from the first 4 bytes, fetches the corresponding schema from the registry (again, cached), and uses that schema to deserialize the binary payload back into a structured data type. The beauty of this approach is that the schema ID is tiny overhead, the schema itself is shared and cached, and the registry ensures that consumers always know how to interpret the data they’re reading.

The registry supports several compatibility modes, but the default and most useful is backward compatibility. This means a new schema must be able to read data written with the old schema. Adding optional fields is backward compatible (old data simply lacks those fields, and the deserializer fills in defaults). Removing fields is backward compatible (new code ignores fields that old data contains). Changing field types is not backward compatible (you can’t safely interpret an old integer as a new string). The registry enforces these rules at schema registration time, preventing incompatible changes from being deployed.

Include Complete Context: The No-Lookups Principle

A critical but often overlooked principle is that events should contain all the data that consumers need to react. This is counterintuitive if you’re coming from a normalized database background, where you store an ID and join to get details. In event streaming, denormalization is your friend.

Consider an OrderPlaced event. You could emit just an order ID and expect consumers to query the order service for details. But this creates several problems. First, it couples consumers to the order service’s API (they need to know how to call it, handle its failures, and interpret its response format). Second, it introduces latency: every consumer must make a network call to process each event. Third, it creates a temporal coupling: if the order details change between when the event was emitted and when the consumer processes it, the consumer sees inconsistent state. Fourth, it makes the system fragile: if the order service is down, consumers can’t process events.

Instead, enrich the event at write time. Include the order ID, user ID, user email, item SKUs with quantities and prices, shipping address, total amount, and timestamp. This makes events self-contained. The fulfillment service can extract the shipping address and generate warehouse commands without calling the order service. The email service can send order confirmations using the email address in the event. The analytics service can record revenue using the total. The fraud detection service can analyze the complete order details. All of this happens in parallel, without additional API calls, without failures cascading.

Think of events as rich data structures that carry all relevant context:

data OrderPlaced = OrderPlaced
  { orderId :: OrderId
  , userId :: UserId
  , userEmail :: EmailAddress
  , items :: NonEmpty OrderItem
  , total :: Money
  , shippingAddress :: Address
  , placedAt :: UTCTime
  } deriving (Generic, ToJSON, FromJSON)

Not just data OrderPlaced = OrderPlaced { orderId :: OrderId }. The type should represent everything downstream consumers need, making them autonomous.

Schema Design and Naming

Event names should be past-tense verbs that describe business facts: OrderPlaced, PaymentCompleted, UserRegistered, ItemShipped. Some organizations use present tense consistently instead, which is fine as long as you’re consistent across the entire system. The key is that the name should unambiguously describe what happened, not what should happen next.

Each event type should have its own schema. Avoid creating generic events with type discriminators like { eventType: "OrderPlaced", payload: {...} }. This bundles unrelated events into a single schema, making versioning painful. When you need to add a new order event type or change the structure of order placement, you’re forced to version the entire generic schema, affecting consumers who only care about order cancellations. Keep schemas focused and independent.

Schema Registries Beyond Event Streams

The schema registry pattern solves a fundamental problem in distributed systems: how do you maintain type-safe contracts between services that evolve independently? While we’ve been discussing this in the context of event streaming, the problem is universal. Every service-to-service API needs schema management, whether it’s REST endpoints, RPC calls, or event streams.

Unfortunately, the current schema registry ecosystem is poorly adapted to traditional service-oriented architectures. The Confluent Schema Registry and similar tools were built for Kafka. They work brilliantly for event streams where schemas are embedded in message payloads. But when you’re making synchronous service calls, the impedance mismatch is obvious. You’re forced to bolt schema validation onto your API gateway or build custom tooling to check compatibility at deploy time. The integration feels awkward because these tools weren’t designed for request-response patterns.

GraphQL, whatever its other tradeoffs, got this right. GraphQL has schema-first development built into its DNA. The schema is the primary artifact. Every GraphQL service publishes a schema that describes exactly what queries it supports, what types it returns, and how those types relate to each other. Clients can introspect the schema at runtime, generate strongly-typed client code, and detect breaking changes before deployment. The tooling ecosystem is mature: schema linters, compatibility checkers, federation systems that compose multiple services into a unified graph.

The key insight GraphQL captured is that the schema should be queryable and composable. A client can ask “what can I do here?” and get a machine-readable answer. Multiple services can publish schemas that get merged into a single namespace with clear ownership boundaries. Breaking changes are detected automatically because the schema is versioned and validated on every deployment. This is exactly what Kafka’s schema registry does for events, but for synchronous APIs.

You might not like GraphQL’s query language, or its N+1 problem, or the way it encourages clients to request arbitrary fields. Those are valid concerns. But the schema management story is excellent. If you’re building service-oriented architectures and you’re not using GraphQL, you need to solve the same problems GraphQL solved. You need a schema registry for your APIs, you need compatibility checking, you need code generation, and you need introspection. Most organizations end up building these capabilities themselves in fragmented ways across different teams.

The lesson is that schema management is a first-class concern for any distributed system. Event streams arrived at implementing schema registries early, because the problem was obvious: you’re publishing data into a shared log that multiple teams consume, and you need a contract. Synchronous APIs have the same problem, but it’s less visible because there’s typically one consumer per endpoint, and you can coordinate deployments more easily. As systems scale, that coordination becomes impossible, and you need the same machinery.

Schema-First Development and Code Generation

A critical architectural decision is whether to generate schemas from your application code or generate application code from schemas. For distributed systems with multiple services in potentially different languages, schema-first development with code generation is strongly preferred.

In a schema-first approach, teams define events using the schema language directly (writing Avro IDL, Protobuf definitions, or JSON Schema documents). These schema definitions live in a shared repository that all teams can access. Each service then generates language-specific types from these schemas using codegen tools. A service written in TypeScript generates TypeScript interfaces, a service in Rust generates Rust structs, and a service in your language of choice generates appropriate types, all from the same canonical schema definition.

This approach has several advantages. First, it avoids privileging any one service over others. If Team A’s producer generates schemas from their internal types and publishes them, consumers must adapt to Team A’s modeling choices, language idioms, and type system constraints. The schema becomes an artifact of Team A’s implementation rather than a neutral contract. With schema-first, the schema is the primary artifact that everyone agrees on, and no service’s internal representation is special.

Second, schema-first development produces better tooling support. Schema languages like Avro and Protobuf have mature ecosystems with validators, linters, compatibility checkers, and documentation generators. When the schema is your source of truth, you can leverage these tools in your CI pipeline. Schemas can be reviewed, versioned, and validated independently of any service deployment. You catch incompatibilities before any code is written.

Third, code generation ensures that all services interpret events identically. When you manually write serialization code or derive schemas from types, subtle differences can creep in. One service might represent optional fields as nullable references while another uses a different convention. Generated code eliminates this category of bugs (the generator guarantees that the serialization format matches the schema exactly).

Fourth, when schemas change, codegen makes updates mechanical. If a new optional field is added to OrderPlaced, you regenerate your types, and the compiler immediately tells you everywhere that needs updating. With hand-written serialization code, you might miss a field or incorrectly handle the optional semantics. The generator handles the boilerplate correctly every time.

The workflow looks like this: schemas are defined in a shared repository using Avro IDL or Protobuf syntax. When schemas change, they’re validated against the schema registry for compatibility. If validation passes, the schemas are published to a package repository or artifact store. Each service declares a dependency on specific schema versions and runs a codegen step during its build process to produce language-specific types. Application code imports these generated types and uses them for event production and consumption.

Here’s what an Avro schema might look like:

record OrderPlaced {
  string orderId;
  string userId;
  string userEmail;
  array<OrderItem> items;
  bytes total;  // Money as decimal encoded bytes
  Address shippingAddress;
  timestamp_ms placedAt;
}

From this single definition, you generate types in every language that needs them. No service’s internal representation is privileged; the schema is the shared contract. Changes to the schema require explicit updates to the schema file and validation through the registry, creating a deliberate, reviewable change process rather than implicit changes buried in application code.

This doesn’t mean you can’t also maintain rich domain types in your application. You might generate a flat OrderPlacedEvent type from the schema but internally work with a richer Order type that has behavior and invariants. The generated type is your serialization boundary (the contract with other services) while your domain types are internal. You write small adapter functions to convert between them. This separation is healthy: it keeps your domain model flexible while maintaining stable contracts externally.

Schema Evolution and Versioning

Schemas will evolve. Requirements change, you discover you need additional fields, you realize you modeled something incorrectly. The schema registry’s job is to make evolution safe by enforcing backward compatibility.

Backward compatibility means new code can read old data. This is the critical property because in a distributed system, you can’t upgrade everything atomically. Service A might deploy a new version that writes events with schema version 2, while Service B is still running code that expects schema version 1. Backward compatibility ensures this works: Service B can read version 2 events by ignoring fields it doesn’t know about.

Compatible changes include adding optional fields, removing fields, and adding default values. If you have an OrderPlaced event and add an optional taxAmount :: Maybe Money field, old consumers simply don’t see this field (it’s Nothing when deserializing old events). New consumers see it when present. Both versions coexist peacefully.

Incompatible changes include changing field types (turning total :: Decimal into total :: Integer), renaming fields, adding required fields, or changing the semantic meaning of a field. If you realize your temperature field was in Celsius but you want Fahrenheit, that’s a semantic breaking change even though the type stays the same. Downstream consumers have no way to detect this. Don’t do it. Add a new field called temperatureFahrenheit instead.

When you must make breaking changes, you have two options. The preferred approach is to create a new event type with a new schema: OrderPlacedV2 or better yet, give it a semantically meaningful name that reflects what changed. This allows old and new schemas to coexist indefinitely. Consumers can migrate on their own schedule. The second option is coordinated deployment: upgrade all producers and consumers simultaneously. This is painful and doesn’t scale, so avoid it when possible.

The schema registry integrates with your CI/CD pipeline to catch incompatible changes before deployment. When you modify a schema, a build-time validation step checks compatibility against the existing registered versions. If the change is incompatible, the build fails and you’re forced to reconsider your approach. This shifts errors left, catching contract violations at development time rather than in production.

Consumer Patterns

Consumers fall into two broad categories: those that execute side effects directly and those that re-emit commands.

Direct side effect consumers read an event and immediately perform an action. An analytics consumer might read OrderPlaced and write directly to a data warehouse. A cache invalidation consumer might read ProductUpdated and immediately evict cache entries. A replication consumer might read events and write them to a different data store. This pattern works well when the action is simple, synchronous, and entirely owned by your team.

Command-emitting consumers read an event and transform it into commands relevant to their domain. A fulfillment consumer reads OrderPlaced and emits warehouse-specific commands like AllocateInventory and GeneratePickList. An email consumer reads UserRegistered and emits SendWelcomeEmail to an email service. A fraud consumer reads PaymentCompleted and might emit ReviewTransaction to a manual review queue. This pattern works well when the action requires orchestration, involves multiple steps, or needs to be decoupled from the event processing itself.

Both patterns are valid and often coexist in the same system. Choose based on your architecture and requirements. The important part is that consumers can make this choice independently without affecting the event producer or other consumers.

Here’s an example of how you might structure a consumer that does both:

handleOrderPlaced :: (MonadDB m, MonadKafka m) => OrderPlaced -> m ()
handleOrderPlaced event = do
  -- Direct side effect: write to database
  DB.insertOrder (orderId event) (items event)
  
  -- Or emit command: publish to different topic
  Kafka.produce "warehouse-commands" (AllocateInventory $ items event)

The key is that the event itself contains everything you need to perform either action without additional lookups.

Avro versus JSON Schema

The two most common schema formats in the Kafka ecosystem are Apache Avro and JSON Schema. Protobuf is also supported but less common in this context.

Avro uses a compact binary encoding with schemas defined separately from the data. The schema defines field names and types, and the binary format omits field names entirely, relying on field order. This makes Avro very space-efficient. Avro also has well-defined schema evolution semantics: the specification clearly states what changes are compatible and how to resolve differences between reader and writer schemas. The tooling is mature, and the Kafka ecosystem has strong support for Avro. The downside is that Avro data is opaque (you can’t inspect it without the schema, and debugging requires additional tools).

JSON Schema uses human-readable JSON with schemas that validate JSON documents. The schema defines structure, types, and constraints. JSON Schema’s advantage is familiarity and debuggability: you can read the data directly, and any JSON tool can work with it. The disadvantage is verbosity (field names are repeated in every message) and less mature schema evolution semantics. Compatibility checking is more manual and error-prone with JSON Schema than with Avro.

For production systems, Avro is generally the better choice. The binary efficiency matters at scale, the evolution semantics are more robust, and the ecosystem tooling is more mature. JSON Schema is reasonable if you’re just getting started or if human readability is critical for your use case (perhaps you’re building developer tools or debugging workflows). But for inter-service communication at scale, Avro’s advantages outweigh its complexity.

Practical Implementation

Your services will interact with the schema registry through client libraries. In a schema-first workflow, you start with schemas defined in a shared repository, generate language-specific types during your build process, and use those generated types in your application code.

At build time, integrate schema validation into your CI pipeline. When a schema changes in the shared repository, the CI process validates it against the schema registry for compatibility. If the schema is incompatible with existing versions, the build fails. If it’s compatible, the schema is registered and a new version is assigned. Services that depend on these schemas then regenerate their types from the updated schema definitions.

At runtime, producers use the registry client to look up the schema (typically cached after the first call), obtain a schema ID, serialize the event data using the generated serialization code, and write the schema ID plus serialized data to Kafka. Consumers read messages from Kafka, extract the schema ID, fetch the corresponding schema from the registry (cached), and use the generated deserialization code to reconstruct the data into typed objects.

The beauty of this approach is that it’s transparent once set up. Your application code works with normal typed objects. The generated serialization layer handles the schema registry interaction automatically. The registry ensures compatibility. And all services, regardless of language, work from the same canonical schema definitions.

Using Envelopes for Forward Compatibility

Backward compatibility ensures new consumers can read old events. But there’s a dual concern: forward compatibility, where old consumers need to handle new event types they don’t understand. This becomes critical when multiple related event types share a topic, and producers want to introduce new event types without breaking existing consumers.

The solution is to use Kafka message headers as an envelope that describes the payload. Headers are key-value pairs of metadata that Kafka attaches to each message independently of the message body. By including an event type identifier in the headers, consumers can examine the envelope before deserializing the payload and decide whether to process or skip the message.

Consider a payment processing system where a topic contains multiple payment-related events: PaymentInitiated, PaymentCompleted, PaymentFailed, PaymentRefunded. Each event has its own schema registered in the schema registry. When you want to add a new event type (say, PaymentDisputed), older consumers that don’t know about disputes could break if they try to deserialize it. With header-based envelopes, they can safely ignore it.

The pattern works like this: when a producer writes an event, it includes a header like event-type: PaymentDisputed or event-schema-id: 12345 alongside the message payload. Consumers read the headers first, check the event type, and make a decision:

data PaymentEvent
  = PaymentInitiated { ... }
  | PaymentCompleted { ... }
  | PaymentFailed { ... }
  | PaymentRefunded { ... }
  deriving (Generic, FromJSON)

handlePaymentMessage :: ConsumerRecord -> m ()
handlePaymentMessage record = do
  let eventType = lookup "event-type" (headers record)
  case eventType of
    Just "PaymentInitiated" -> 
      deserializeAndHandle @PaymentInitiated (value record)
    Just "PaymentCompleted" -> 
      deserializeAndHandle @PaymentCompleted (value record)
    Just "PaymentFailed" -> 
      deserializeAndHandle @PaymentFailed (value record)
    Just "PaymentRefunded" -> 
      deserializeAndHandle @PaymentRefunded (value record)
    Just unknownType -> do
      -- Unknown event type: log and skip
      logInfo $ "Skipping unknown event type: " <> unknownType
      pure ()
    Nothing -> 
      -- No event-type header: could be legacy format
      handleLegacyMessage record

When the producer adds PaymentDisputed events, older consumers see event-type: PaymentDisputed in the header, don’t recognize it, and skip processing. They don’t attempt deserialization, so they don’t break. Meanwhile, newer consumers that understand disputes can handle them. The producer can safely introduce new event types without coordinating deployments across all consumers.

This pattern is particularly valuable when topics contain heterogeneous events (multiple related event types that share a domain but have distinct schemas). It decouples producers from consumers more than traditional schema evolution, because you’re not just adding fields to an existing event; you’re adding entirely new event types that only relevant consumers need to understand.

Headers also support other metadata useful for routing and filtering: correlation IDs for tracing, tenant IDs for multi-tenant systems, priority levels for queue management, or content encoding flags. All of this lives in the envelope, separate from the business payload, making it trivial to inspect and route without deserializing the entire message.

The key architectural principle is that headers provide structured metadata about the message that consumers can use for control flow decisions before committing to deserializing the payload. This creates a natural extension point: as your system evolves and new event types emerge, consumers can opt into handling them or gracefully ignore them. The schema registry still validates compatibility for each individual event type, and headers provide the mechanism for consumers to select which types they understand.

In Haskell, you might encode this pattern more type-safely by having consumers explicitly declare which event types they handle, using type-level lists or extensible records to track capabilities at compile time. But the runtime mechanism remains the same: check headers, filter events you understand, skip the rest without error.

Organizational Workflow

In practice, event-driven architecture with schema management works like this: teams design events by identifying the business facts they want to publish and determining what context downstream consumers will need. They define these events using a schema language like Avro or JSON Schema, focusing on making them self-contained and rich with information.

They codegen language-specific modules from these types, and register the schemas in the schema registry through their CI/CD pipeline. The registry validates compatibility and assigns version numbers. If the schema is incompatible with existing versions, the build fails, and the team must either make the change compatible (by making fields optional) or create a new event type.

Once deployed, the producing service emits events enriched with relevant context. Other teams consume these events through their own services, which are completely decoupled from the producer. Consumers can execute side effects directly, re-emit commands to other services, or do both. The schema registry ensures that all participants agree on the structure of the data they’re exchanging.

When schemas need to evolve, teams add optional fields for backward-compatible changes, and the registry allows these changes to flow through immediately. For breaking changes, teams create new event types or coordinate deployments, treating the breaking change as a new contract rather than an evolution of the existing one. When introducing entirely new event types to shared topics, teams use header-based envelopes to allow older consumers to gracefully skip events they don’t recognize.

This workflow enables teams to operate independently while maintaining strong contracts. The schema registry acts as the enforcement mechanism, catching problems at build time rather than in production. The result is a loosely coupled system where teams can evolve their services without constant cross-team coordination, while maintaining data consistency and reliability across the entire organization.