HomeToolsAbout

Kafka

What is it

Event streaming platform used to broadcast event between systems.

Used for:

  • Writing and reading events
  • Storing those events for a period of time, allowing look-back

Kafka provides at-least-once semantics in most cases.

  • Needs additional configuration to guarantee

What is it not

Not a queue

  • Can mimic FIFO with a single producer and single consumer group, but lacks features to be a dedicated queue

Transport for RPC calls

  • Kafka fire and forgets - should not care about the results of consumptions of events

Topics

Producers and Consumers exchange information on a named topic.

Topics are always multi-producer and multi-consumer.

  • allows multiple systems to publish to one topic and consume the same events in multiple different applications.

Partitions

Each topic is divided into Partitions.

  • Each partition is a separate bucket of events within the topic.

Each partition can be located on a different broker node, making this important for the performance of the Kafka cluster.

Each message is produced to a single partition.

  • Consumers must consume all partitions, this is automatically handled via a ConsumerGroup.

Order of events is guaranteed within a partition not within a topic.

  • As a consequence, it is useful to use a fixed partition key, such as the account_id, so that message ordering is preserved within an account.
Producer => Brokers => All Partitions (to ConsumerGroup)

Publishing and Consuming

Producers connect to Kafka Brokers to publish their messages.

Each message consists of a Key and a Value field.

  • The Value field is the payload of the event, the Key is used to allocate the event to a particular partition.
  • The partitions of the topic are spread across the broker nodes allowing for spreading the load of a single topic across multiple brokers.

Consumption is typically done via a ConsumerGroup.

The group can contain 1 or more consumers and these can be across different hosts, allowing for parallel processing.

  • The Kafka brokers ensure that each partition is assigned to a single consumer in the group.

The maximum parallelism is limited to the number of partitions in the topic.

  • Each consumer group has a unique id, so all members of that group would register with the same id and the partitions of a topic would be divided among them.

Deterministic Hashing

In order to ensure all events for a particular account are produced to a single partition, and thus retain ordering, one needs a deterministic way to go from some_idpartition #.

  • It is important that all producers follow the same convention.

Key is set to some_id.

Partition = murmur3(some_id) % numPartitions

murmur3 is a common non-cryptographic hashing function available in virtually any language, so it makes a good choice as a convention.

Idempotent Producers

When producing to Kafka, one way to reduce the possibility of duplicates is to enable idempotence on the producer.

Producer will do extra bookkeeping, such as including a sequence number, to ensure events are not unintentionally published twice.

Reference:

  • https://stackoverflow.com/questions/58894281/difference-between-idempotence-and-exactly-once-in-kafka-stream

Replication Factor and Min In-Sync Replicas

Kafka Topics have two important settings which, when misconfigured, can cause the topics to become unusable.

The first is replication factor.

  • the number of copies of the message written to the Kafka cluster.

The second is the minimum number of in sync replicas.

  • indicates the number of replicas that have been published to before the message is considered durably stored.

If replicationFactor < min in sync replicas, the cluster will no longer accept publishing to that topic.

The solution is to increase the replication factor or decrease the number of in-sync replicas.

AboutContact