Kafka Basics: A Simple Guide for Beginners
Problem
First, let's understand the problem at hand. Suppose you are ordering food on the Zomato app and you can see live (real-time) updates of the driver's location on your mobile app.
The driver is moving from point A to point B, and every second, their current location updates on the app.
How do we design this?
A simple, naive solution could be to continuously send the location details to the Zomato server, which then stores them in a database. The customer's mobile app would keep reading the information from the database and display it on the UI.
The problem with this solution is suppose you have 1000 drivers who keep on sending their location to the server, and the server keeps performing insert operations on the database. Eventually, the database will crash as the operations per second (OPS) will increase significantly.
Another problem is that traditional databases have very low throughput. So even if you somehow save your database from crashing, you will not be able to handle the high throughput condition, which will lead to delays in updating the location on the customer's mobile app.
Solution
Kafka has very high throughput.
Let's understand it with an example of Uber.
In the case of Uber, suppose we have 100,000 cars producing data every second, like speed and current location. Each car will be considered a producer in Kafka terminology, sending data to the Kafka service. Other services like fare calculation, analytics, and customer service will keep polling data from the Kafka service. After processing, they will bulk insert the data into the database. This significantly reduces the operations per second (OPS) and, of course, the throughput will also be reduced since we are doing bulk inserts.
Architecture of kafka
Components
Producers
e.g., Uber Cars, Zomato RidersConsumers
e.g., Fare service, customer service, analytics serviceTopics (logical partitioning of messages)
e.g., Rider updates, hotel updates
Now, there could be a possibility that a large number of messages will clutter a topic. To manage this, we can divide the topic, which is called partitioning.
For example, we can partition by location or user name.
Rider updates (topic) -> South India, North India
Important Architectural Points
If there's one consumer, it will consume all partitions.
When another consumer joins, Kafka will automatically partition (auto-balancing).
If the number of consumers is less than the number of topic partitions, then multiple partitions can be assigned to one of the consumers in the group
If the number of consumers is the same as the number of topic partitions, then partition and consumer mapping can be like below,
If the number of consumers is higher than the number of topic partitions, then partition and consumer mapping can be as seen below, Not effective, check Consumer 5
1 consumer can consume multiple partitions
1 partition can only be consumed by at most 1 consumer
Okay, that was a lot of theoretical knowledge. Now it's time for some hands-on practice.
Example
Prerequisites
Node.js
Docker
VsCode
Run zookeeper
docker run -p 2181:2181 zookeper
Run Kafka
docker run -p 9092:9092 \
-e KAFKA_ZOOKEEPER_CONNECT=<PRIVATE_IP> \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://<PRIVATE_IP>:9092 \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
confluentinc/cp-kafka
<PRIVATE_IP> → your network IP
Create a node application
mkdir <project _name>
cd <project_name>
npm/yarn init
#install kafkajs
npm install kafkajs
client.js
const { Kafka } = require("kafkajs");
//setting up a kafka client
exports.kafka = new Kafka({
clientId: "my-app",
brokers: ["<PRIVATE_IP>:9092"],
});
admin.js
File that will handle your Kafka infrastructure, such as the topic name, the number of topics, and the partitioning of each topic.
const { kafka } = require("./client");
async function init() {
const admin = kafka.admin();
console.log("Admin connecting...");
admin.connect();
console.log("Adming Connection Success...");
console.log("Creating Topic [rider-updates]");
//creating topic infra and partitioning logic
await admin.createTopics({
topics: [
{
topic: "rider-updates",
numPartitions: 2,
},
],
});
console.log("Topic Created Success [rider-updates]");
console.log("Disconnecting Admin..");
await admin.disconnect();
}
init();
producer.js
File that will produce messages to the Kafka service, which will later be consumed by the consumer service.
const { kafka } = require("./client");
//package for taking input from the command line
const readline = require("readline");
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
async function init() {
const producer = kafka.producer();
console.log("Connecting Producer");
//connecting producer
await producer.connect();
console.log("Producer Connected Successfully");
rl.setPrompt("> ");
rl.prompt();
rl.on("line", async function (line) {
//destructuring information from the input given on the command line
const [riderName, location] = line.split(" ");
await producer.send({
//sending information to "rider-updates" topic created by admin.js file
topic: "rider-updates",
messages: [
{
//based on the location the input will go that particular partition
partition: location.toLowerCase() === "north" ? 0 : 1,
key: "location-update",
value: JSON.stringify({ name: riderName, location }),
},
],
});
}).on("close", async () => {
await producer.disconnect();
});
}
init();
consumer.js
This file will consume messages from the Kafka topic, meaning it is subscribed to a topic.
const { kafka } = require("./client");
const group = process.argv[2];
async function init() {
const consumer = kafka.consumer({ groupId: group });
await consumer.connect();
//subscribing to the "rider-updates" topic
await consumer.subscribe({ topics: ["rider-updates"], fromBeginning: true });
await consumer.run({
eachMessage: async ({ topic, partition, message, heartbeat, pause }) => {
console.log(
`${group}: [${topic}]: PART:${partition}:`,
message.value.toString()
);
},
});
}
init();
Executing Code
Run admin.js
node admin.js
Admin connecting...
Adming Connection Success...
Creating Topic [rider-updates]
Topic Created Success [rider-updates]
Disconnecting Admin..
Run consumer.js
node consumer.js user-1
#here user-1 is the consumer group name
Run Producer.js
node producer.js
{"level":"WARN","timestamp":"2024-05-26T09:02:34.450Z","logger":"kafkajs","message":"KafkaJS v2.0.0 switched default partitioner. To retain the same partitioning behavior as in previous versions, create the producer with the option \"createPartitioner: Partitioners.LegacyPartitioner\". See the migration guide at https://kafka.js.org/docs/migration-guide-v2.0.0#producer-new-default-partitioner for details. Silence this warning by setting the environment variable \"KAFKAJS_NO_PARTITIONER_WARNING=1\""}
Connecting Producer
Producer Connected Successfully
> abdullah north
#abdullah north is the inpu giving by the producer here abdullah is the
#username and north is partition to which this user belongs
Output at consumer.js terminal
user-1: [rider-updates]: PART:0: {"name":"abdullah","location":"north"}