Kafka’s Revolutionary Leap: Transitioning from ZooKeeper to KRaft for Enhanced Real-Time Data Processing

In the realm of real-time data processing, Kafka, developed by Confluent, has garnered a stronghold with a sprawling presence in over 150,000 organizations. However, with rapidly growing data and throughput requirements, the Kafka platform has been facing the heat, primarily due to its dependence on Apache Zookeeper for managing its crucial system metadata. On the quest for a more nimble solution, the architecture now embarks on a transformational journey from Zookeeper to KRaft.

The Achilles Heel: Apache Zookeeper

Where does the problem lie? Critics have identified a significant part of the problem in how Zookeeper operates. According to the Java expertise site Baeldung, ZooKeeper functions entirely independently of Kafka, which exacerbates the system admin’s management dilemmas. It also retards the system’s overall responsiveness.

Distinctively, other distributed systems, like Elasticsearch, have internalized the synchronization aspect. Kafka, however, is unable to monitor the event log and this results in a lag between the controller memory and the ZooKeeper’s state.

As explained by Colin McCabe from Confluent, ZooKeeper stores metadata about the system itself, such as information about partitions. Over time, the number of partitions that users manage has significantly increased, causing a lag in the system’s responsiveness. When a new controller is elected, the partition metadata fed to the nodes also takes more time, slowing down the entire system.

Dissolving the Dependence: The Advent of KRaft

The solution comes in the form of KRaft. Kafka deployments can now maintain hot standbys with KRaft, eliminating the need for a controller to load all the partition data. Underpinning Kafka’s architecture, KRaft is based on a stream metaphor that houses an inflow of changes. This makes it possible to monitor the stream, identify the current position, and effectively catch up if there’s any lag.

The exploration doesn’t end here. Looking to minimize metadata divergence, the idea is to manage metadata itself through this stream process. In simpler terms, a log will be employed to manage streaming changes to the metadata. This ensures a clear ordering of events and the maintenance of a single timeline.

The outcome? KRaft has successfully managed to lower the latency of metadata reads by a factor of 14, meaning that Kafka can recover 14 times faster from any problem. The platform can now store and maintain up-to-date metadata on as many as 2 million partitions.

Stepping Stones: Towards Full KRaft Implementation

The maiden steps to KRaft implementation have been made with Kafka 3.3, but the journey towards fully ditching Zookeeper is a measured one, expected to culminate with version 4 release. By then, users still reliant on ZooKeeper will have to transition to a Bridge Release.

KIP-833, designating Kafka 3.5 as a bridge release, facilitates the migration from ZooKeeper without downtime. The upgrade process involves accentuating new controller nodes and adding functionality to the existing ones. The new KRaft controller will lead the ZooKeeper nodes.

As explained by McCabe, the system will run on the old mode for a while during the transition, allowing for gradual enrollment of brokers. When all brokers are in KRaft mode, the system will function in dual write mode, making it easier to revert to ZooKeeper if required.

With widespread expectations of enhanced performance and streamlined management, the move from ZooKeeper to KRaft is indeed a significant milestone in Kafka’s evolution. The glowing prospects of Confluent’s Kafka are indeed heartening to observe.

Tags: #Kafka, #Confluent, #ZooKeeper, #KRaft, #RealTimeProcessing

Reference Link