Apache Kafka is a high-throughput, low-latency, and scalable messaging system. Thanks to its distributed architecture, it manages large-scale data streams. In this post, we’ll explore the fundamental concepts of Kafka and how it works.
Kafka is a messaging system originally developed by LinkedIn and later open-sourced under the Apache Software Foundation. It is widely used for big data processing, real-time stream analytics, and event-driven architectures.
Figure 1. High-level architecture overview of Apache Kafka showing the interaction between producers, brokers, and consumers. Source: Created by the author.
To understand Kafka, it’s important first to get familiar with its core components:
Broker: Servers that store and distribute messages. A Kafka cluster consists of multiple brokers.
Producer: A component that writes messages to a specific topic.
Consumer: A component that reads messages from a specific topic.
Topic: Logical categories where messages are organized. A topic can be split into multiple partitions.
Partition: Subdivisions of topics that enable scalability. Each partition is stored on a specific broker.
Segment: Smaller files that store messages within a partition.
Log: The physical file where data within a partition is stored. In Kafka, each partition is maintained as an append-only log file.
Record: The smallest unit of data in Kafka, consisting of a key, value, and timestamp. Kafka logs are essentially ordered collections of records.
Figure 2. Kafka topic structure showing the relationship between topics, partitions, and segments in a distributed setup. Source: Created by the author.
A topic in Kafka is a logical category where messages are grouped. Producers send messages to specific topics, while consumers read from them. Topics are divided into partitions to handle data more efficiently.
Partitions enable horizontal scaling of a topic. Each partition behaves as an independent append-only log, meaning messages are added sequentially and are not deleted (until the retention period expires).
Key features:
A topic can have multiple partitions.
Kafka distributes messages across partitions for load balancing.
Partitions of the same topic can be stored on different brokers to improve scalability.
There are partition leaders and replica partitions. The leader is the primary copy, while the replicas are backups.
Figure 3. Leader-follower replication model in Kafka demonstrating how data is replicated across multiple brokers for fault tolerance. Source: Created by the author.
Example:
Topic: "user-activity"
Partition 0 → Broker 1
Partition 1 → Broker 2
Partition 2 → Broker 3
The replication factor in Kafka defines how many copies of a partition are stored on different brokers. It enhances data durability and protects against broker failures.
Key features:
Replication factor = 1: No backup; the partition exists on only one broker.
Replication factor ≥ 2: Multiple replicas of the partition are stored on different brokers.
Each partition has only one leader; others act as followers.
Consumers read only from the leader partition.
If the leader broker fails, Kafka elects a follower as the new leader.
Example replication setup:
Topic: "user-activity", Replication Factor: 3
Partition 0 → Leader: Broker 1, Replica: Broker 2, Broker 3
Partition 1 → Leader: Broker 2, Replica: Broker 1, Broker 3
Partition 2 → Leader: Broker 3, Replica: Broker 1, Broker 2
This setup ensures no data loss and uninterrupted operation in case of a broker failure.
Partitions are divided into smaller files called segments. This helps optimize disk usage and speeds up data lookup.
Key features:
Each partition contains multiple segments.
Kafka creates a new segment based on a time interval or file size threshold.
Old segments are cleaned up based on Kafka’s log retention policy.
Figure 4. Internal structure of a Kafka broker showing how messages are organized in segments within partitions. Source: Created by the author.
Example:
Partition 0
├── segment_0001.log
├── segment_0002.log
├── segment_0003.log
Kafka’s log-based architecture stores messages in log files. Each partition acts as an append-only log, where messages are written sequentially and remain immutable.
A log is a collection of records stored in a partition.
A record is the smallest data unit in Kafka, containing:
Key (optional): An identifier for the message
Value: The actual message payload
Timestamp: When the message was created
Example of a Kafka record:
Key: "user_123"
Value: "{"action": "login", "timestamp": "2025-03-18T12:34:56Z"}"
Timestamp: "2025-03-18T12:34:56Z"
Consumers read and process records from the logs of specific topics.
Figure 5. Kafka's log-based storage model illustrating how messages are appended sequentially in partition logs. Source: Created by the author.
To understand Kafka’s operation, consider the following steps:
A producer sends a message to a specific topic.
A broker stores the message in the appropriate partition of that topic.
A consumer reads messages from the topic and processes them.
Messages are retained for a configured duration (retention period) and then deleted.
This process enables Kafka to deliver real-time data streaming and forms the backbone of event-driven systems.
Here’s why Kafka is widely adopted:
High Performance: Handles large-scale data streams with low latency.
Scalability: Horizontally scalable thanks to its distributed architecture.
Durability: Data is replicated across brokers for fault tolerance.
Flexibility: Can be integrated with various data processing systems (e.g., Flink, Spark, Elasticsearch).
Confluent Platform extends Apache Kafka with additional tools and integrations, and it now supports KRaft mode (ZooKeeper-free architecture). Here’s how to set up Confluent Platform with KRaft mode on macOS using the .tar.gz archive.
Confluent Platform is an enhanced distribution of Apache Kafka built by its original creators. It bundles Apache Kafka with additional tools like Schema Registry, Kafka Connect, REST Proxy, and ksqlDB, making it easier to build, monitor, and scale real-time data pipelines.
Traditionally, Kafka required ZooKeeper—a separate coordination service—to manage metadata and cluster state. However, ZooKeeper adds operational complexity. Confluent Platform now supports KRaft mode, a newer architecture where Kafka handles its own metadata internally, removing the need for ZooKeeper altogether.
In this guide, we use Confluent Platform in KRaft mode, allowing a simpler, ZooKeeper-free setup.
Before you begin, ensure the following:
macOS (Intel or Apple Silicon)
Java 11+ is installed
Check: java -version
Install (if needed): brew install openjdk@17
Basic terminal knowledge
Go to the Confluent downloads page and download the tar.gz version.
Or use curl:
curl -O https://packages.confluent.io/archive/7.5/confluent-7.5.0.tar.gz
Extract it:
tar -xzf confluent-7.5.0.tar.gz
cd confluent-7.5.0
Edit the KRaft server config:
vim etc/kafka/kraft/server.properties
Make sure it includes:
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@localhost:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
log.dirs=/tmp/kraft-logs
You can use a different log path if you like. Save and exit the file.
Before starting the Kafka server, format the storage directory for KRaft:
bin/kafka-storage format -t $(bin/kafka-storage random-uuid) -c etc/kafka/kraft/server.properties
Start the Kafka broker (now also acting as the controller):
bin/kafka-server-start etc/kafka/kraft/server.properties
No Zookeeper is needed. You should see logs showing both broker and controller roles starting.
As you type messages into the producer terminal, you’ll instantly see them appear in the consumer terminal. This demonstrates Kafka’s real-time streaming capability.
In a new terminal:
bin/kafka-topics --bootstrap-server localhost:9092 --create --topic test-topic --partitions 1 --replication-factor 1
Then, produce messages (also in a new terminal):
bin/kafka-console-producer --bootstrap-server localhost:9092 --topic test-topic
And consume them (also in a new terminal):
bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic test-topic --from-beginning
To run Confluent Kafka tools globally:
export CONFLUENT_HOME=~/path/to/confluent-7.5.0
export PATH=$PATH:$CONFLUENT_HOME/bin
Then:
source ~/.zshrc # or ~/.bash_profile
In this post, we covered the core concepts of Apache Kafka and walked through installing Kafka in KRaft mode using the Confluent Platform on macOS. By eliminating ZooKeeper and leveraging the Confluent distribution, we achieved a cleaner and easier Kafka setup—ideal for modern, event-driven applications.
You’re now ready to start building real-time streaming pipelines locally. For production use, consider exploring multi-node clusters, replication strategies, and integrating tools like Schema Registry or Kafka Connect.
Apache Software Foundation. (n.d.). Kafka documentation. Retrieved May 5, 2025, from https://kafka.apache.org/documentation/
Apache Software Foundation. (n.d.). KRaft mode (Kafka Raft metadata mode). Retrieved May 5, 2025, from https://kafka.apache.org/documentation/#kraft
Confluent Inc. (n.d.). Confluent Platform documentation. Retrieved May 5, 2025, from https://docs.confluent.io/
Confluent Inc. (n.d.). Event streaming with Apache Kafka: The complete guide. Retrieved May 5, 2025, from https://www.confluent.io/what-is-apache-kafka/