Kafka
What is it
Event streaming platform used to broadcast event between systems.
Used for:
- Writing and reading events
- Storing those events for a period of time, allowing look-back
Kafka provides at-least-once semantics in most cases.
- Needs additional configuration to guarantee
What is it not
Not a queue
- Can mimic
FIFO
with a singleproducer
and singleconsumer
group, but lacks features to be a dedicatedqueue
Transport for RPC
calls
Kafka
fire and forgets - should not care about the results of consumptions of events
Topics
Producers
and Consumers
exchange information on a named topic
.
Topics
are always multi-producer
and multi-consumer
.
- allows multiple systems to publish to one
topic
and consume the same events in multiple different applications.
Partitions
Each topic
is divided into Partitions
.
- Each
partition
is a separate bucket of events within thetopic
.
Each partition
can be located on a different broker
node, making this important for the performance of the Kafka cluster.
Each message
is produced to a single partition
.
Consumers
must consume allpartitions
, this is automatically handled via aConsumerGroup
.
Order of events is guaranteed within a partition
not within a topic
.
- As a consequence, it is useful to use a fixed
partition key
, such as theaccount_id
, so that message ordering is preserved within an account.
Producer => Brokers => All Partitions (to ConsumerGroup)
Publishing and Consuming
Producers
connect to Kafka Brokers
to publish their messages.
Each message
consists of a Key and a Value field.
- The
Value
field is thepayload
of theevent
, the Key is used to allocate theevent
to a particularpartition
. - The
partitions
of thetopic
are spread across thebroker
nodes allowing for spreading the load of a singletopic
across multiplebrokers
.
Consumption
is typically done via a ConsumerGroup
.
The group can contain 1 or more consumers
and these can be across different hosts
, allowing for parallel processing.
- The Kafka
brokers
ensure that eachpartition
is assigned to a singleconsumer
in the group.
The maximum parallelism is limited to the number of partitions
in the topic
.
- Each
consumer group
has a unique id, so allmembers
of that group would register with thesame id
and thepartitions
of atopic
would be divided among them.
Deterministic Hashing
In order to ensure all events
for a particular account are produced to a single partition
, and thus retain ordering, one needs a deterministic way to go from some_id
→ partition #
.
- It is important that all
producers
follow the same convention.
Key is set to some_id
.
Partition = murmur3(some_id) % numPartitions
murmur3
is a common non-cryptographic hashing function available in virtually any language, so it makes a good choice as a convention.
Idempotent Producers
When producing to Kafka, one way to reduce the possibility of duplicates is to enable idempotence on the producer.
Producer will do extra bookkeeping, such as including a sequence number, to ensure events are not unintentionally published twice.
Reference:
- https://stackoverflow.com/questions/58894281/difference-between-idempotence-and-exactly-once-in-kafka-stream
Replication Factor and Min In-Sync Replicas
Kafka Topics have two important settings which, when misconfigured, can cause the topics to become unusable.
The first is replication factor
.
- the number of copies of the message written to the Kafka cluster.
The second is the minimum number of in sync replicas
.
- indicates the number of replicas that have been published to before the message is considered durably stored.
If replicationFactor
< min in sync replicas
, the cluster will no longer accept publishing to that topic.
The solution is to increase the replication factor
or decrease the number of in-sync replicas
.