Kafka Fundamentals
Data Streaming and Big Data using Apache Kafka
Based on this series
AWS = MSK, Azure = Kafka
Introduction
Kafka is a distributed event log and pub/sub service
Data is moving towards streaming data and we need some kind of platform to connect these events and allow us to use this data to do more useful things
Continuous data over simple chunks. Kafka provides a platform that allows us to:
- Scale globally
- Process in real time
- Store persistently
- Process streams
Kafka enables all sorts of connections like such:
graph TD
IoT --> Kafka
Mobile --> Kafka
SaaS --> Kafka
Edge --> Kafka
Kafka --> Datacenter
Kafka --> Microservices
Kafka --> Databases
Kafka --> DL(Data Lake)
Kafka --> ML(Machine Learning)
Kafka --> Cloud
Data on Kafka is stored as a stream of events
Used in industries like:
- Banking
- Automotive
- Commerce
- Healthcare
- Gaming
- Government
Fundamentals
Producer
Producers are things that create data that needs to go into a Kafka Cluser
Broker
A broker is an individual machine/container/vm which is what Kafka uses to actually run this. A Kafka cluser is made up of a bunch of different brokers which allows for distribution of this data
Consumer
Comsumers are applications that comsume data from Kafka and can then pass that data into other locations. Consumers basically Poll Kafka for any messages or events
Architecture
The Kafka architecture looks like the diagram below, in the below diagram we use Zookeepers
to handle things like consensus in a cluser
graph LR
Zookeeper --> Brokers --> Zookeeper
Producers --> Brokers --> Consumers
Zookeeper is in the process of being removed from Kafka
Producers are decoupled from consumers so that they can be added or removed as needed. Additionally, consumers can also be added and they will be able to consume an entire history of events and not just the current ones
Topics, Partitions, and Segments
Scaling is handled by splitting up a topic into partitions, this means that each partition within a topic can be placed into its own location/server which enables Kafka to scale such that we don't run out of actual computation and I/O
Every partition is a log of data and events
On Disk, a segment is a piece of a partition
Kafka has some different ideas of the types of topics we can use. Namely:
- Regular topics allow us to retreive all data from a topic
- Something like streaming data
- Compacted topics allow us to retreive the latest record from a topic per key
- Something like the latest change to a record or set of records
Records
A piece of data in Kafka looks like so:
classDiagram
class Record {
Key
Value
headers(Optional)
timestamp(CreationTime or IngestionTime)
}
Messages with the same Key
always end up in the same partition, a usecase for this is to use something like a producer's device ID (e.g. ID of an IoT Device) gets stored to a single partition and the order of these are gauranteed
Broker Replication
Brokers replicate each partition, usually 3 replicants, per partition, and these are broken down into a Leader and Followers for each partition. These are managed by brokers