Kafka Basics: A Simple Guide for Beginners

Problem

First, let's understand the problem at hand. Suppose you are ordering food on the Zomato app and you can see live (real-time) updates of the driver's location on your mobile app.

The driver is moving from point A to point B, and every second, their current location updates on the app.

How do we design this?

A simple, naive solution could be to continuously send the location details to the Zomato server, which then stores them in a database. The customer's mobile app would keep reading the information from the database and display it on the UI.

The problem with this solution is suppose you have 1000 drivers who keep on sending their location to the server, and the server keeps performing insert operations on the database. Eventually, the database will crash as the operations per second (OPS) will increase significantly.

Another problem is that traditional databases have very low throughput. So even if you somehow save your database from crashing, you will not be able to handle the high throughput condition, which will lead to delays in updating the location on the customer's mobile app.

Solution

Kafka has very high throughput.

Let's understand it with an example of Uber.

In the case of Uber, suppose we have 100,000 cars producing data every second, like speed and current location. Each car will be considered a producer in Kafka terminology, sending data to the Kafka service. Other services like fare calculation, analytics, and customer service will keep polling data from the Kafka service. After processing, they will bulk insert the data into the database. This significantly reduces the operations per second (OPS) and, of course, the throughput will also be reduced since we are doing bulk inserts.

Architecture of kafka

Components

  • Producers
    e.g., Uber Cars, Zomato Riders

  • Consumers
    e.g., Fare service, customer service, analytics service

  • Topics (logical partitioning of messages)
    e.g., Rider updates, hotel updates

Now, there could be a possibility that a large number of messages will clutter a topic. To manage this, we can divide the topic, which is called partitioning.
For example, we can partition by location or user name.
Rider updates (topic) -> South India, North India

Important Architectural Points

If there's one consumer, it will consume all partitions.
When another consumer joins, Kafka will automatically partition (auto-balancing).

If the number of consumers is less than the number of topic partitions, then multiple partitions can be assigned to one of the consumers in the group

If the number of consumers is the same as the number of topic partitions, then partition and consumer mapping can be like below,

If the number of consumers is higher than the number of topic partitions, then partition and consumer mapping can be as seen below, Not effective, check Consumer 5

1 consumer can consume multiple partitions
1 partition can only be consumed by at most 1 consumer

Okay, that was a lot of theoretical knowledge. Now it's time for some hands-on practice.

Example

Prerequisites

Node.js
Docker
VsCode

Run zookeeper

docker run -p 2181:2181 zookeper

Run Kafka

docker run -p 9092:9092 \
-e KAFKA_ZOOKEEPER_CONNECT=<PRIVATE_IP> \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://<PRIVATE_IP>:9092 \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
confluentinc/cp-kafka

<PRIVATE_IP> → your network IP

Create a node application

mkdir <project _name>
cd <project_name>
npm/yarn init

#install kafkajs
npm install kafkajs

client.js

const { Kafka } = require("kafkajs");

//setting up a kafka client
exports.kafka = new Kafka({
  clientId: "my-app",
  brokers: ["<PRIVATE_IP>:9092"],
});

admin.js

File that will handle your Kafka infrastructure, such as the topic name, the number of topics, and the partitioning of each topic.

const { kafka } = require("./client");

async function init() {
  const admin = kafka.admin();
  console.log("Admin connecting...");
  admin.connect();
  console.log("Adming Connection Success...");

  console.log("Creating Topic [rider-updates]");
 //creating topic infra and partitioning logic
  await admin.createTopics({
    topics: [
      {
        topic: "rider-updates",
        numPartitions: 2,
      },
    ],
  });
  console.log("Topic Created Success [rider-updates]");

  console.log("Disconnecting Admin..");
  await admin.disconnect();
}

init();

producer.js

File that will produce messages to the Kafka service, which will later be consumed by the consumer service.

const { kafka } = require("./client");

//package for taking input from the command line
const readline = require("readline");

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
});

async function init() {
  const producer = kafka.producer();

  console.log("Connecting Producer");
//connecting producer
  await producer.connect();
  console.log("Producer Connected Successfully");

  rl.setPrompt("> ");
  rl.prompt();

  rl.on("line", async function (line) {

    //destructuring information from the input given on the command line
    const [riderName, location] = line.split(" ");
    await producer.send({
       //sending information to "rider-updates" topic created by admin.js file
      topic: "rider-updates",
      messages: [
        {
          //based on the location the input will go that particular partition
          partition: location.toLowerCase() === "north" ? 0 : 1,
          key: "location-update",
          value: JSON.stringify({ name: riderName, location }),
        },
      ],
    });
  }).on("close", async () => {
    await producer.disconnect();
  });
}

init();

consumer.js

This file will consume messages from the Kafka topic, meaning it is subscribed to a topic.

const { kafka } = require("./client");
const group = process.argv[2];

async function init() {
  const consumer = kafka.consumer({ groupId: group });
  await consumer.connect();
  //subscribing to the "rider-updates" topic
  await consumer.subscribe({ topics: ["rider-updates"], fromBeginning: true });

  await consumer.run({
    eachMessage: async ({ topic, partition, message, heartbeat, pause }) => {
      console.log(
        `${group}: [${topic}]: PART:${partition}:`,
        message.value.toString()
      );
    },
  });
}

init();

Executing Code

Run admin.js

node admin.js
Admin connecting...
Adming Connection Success...
Creating Topic [rider-updates]
Topic Created Success [rider-updates]
Disconnecting Admin..

Run consumer.js

node consumer.js user-1
#here user-1 is the consumer group name

Run Producer.js

node producer.js
{"level":"WARN","timestamp":"2024-05-26T09:02:34.450Z","logger":"kafkajs","message":"KafkaJS v2.0.0 switched default partitioner. To retain the same partitioning behavior as in previous versions, create the producer with the option \"createPartitioner: Partitioners.LegacyPartitioner\". See the migration guide at https://kafka.js.org/docs/migration-guide-v2.0.0#producer-new-default-partitioner for details. Silence this warning by setting the environment variable \"KAFKAJS_NO_PARTITIONER_WARNING=1\""}
Connecting Producer
Producer Connected Successfully
> abdullah north
#abdullah north is the inpu giving by the producer here abdullah is the 
#username and north is partition to which this user belongs

Output at consumer.js terminal

user-1: [rider-updates]: PART:0: {"name":"abdullah","location":"north"}