Optimizing Kafka Streams: A Guide for Reducing Processing Downtime in Real-Time Data Streams

Streaming platforms like Kafka bring a whole lot of efficiency to data processing. However, a common issue faced by many users is when the Streams application pauses processing for more than 45 seconds. In this blog, we will delve into exploring this issue logically and provide a solution for it.

Table of Contents

  1. A Look into Kafka Consumer Groups
  2. Kafka Rebalance Protocol
  3. Descriptive Scenario: Impact of Kubernetes Pod Replacement
  4. Mitigating Processing Downtimes
  5. Visualizing the ‘leaveGroupOnClose’ Optimization
  6. Pro Tips
  7. Conclusion

A Look into Kafka Consumer Groups

To give a brief reminder of Kafka consumers and consumer groups, it should be noted that the entire rebalancing process implies that consumers do not process any data as long as the partitions are not reassigned. This is typically fast, varying between 50ms to several seconds, depending on factors such as the load on your Kafka cluster or the complexity of your Streams topology.

Note: The default configuration in Kafka Streams, represented by StreamsConfig.CONSUMER_DEFAULT_OVERRIDES, is set to “false”. This differs from the usual public ConsumerConfig.LEAVE_GROUP_ON_CLOSE_CONFIG process.

Kafka Rebalance Protocol

In such scenarios when Kafka Streams app pauses for more than 45 seconds, the LeaveGroup process is enabled. This timeout happens because the default Consumer session is 45 seconds, as shown per KIP-735.

Descriptive Scenario: Impact of Kubernetes Pod Replacement

For example, if your apps are running on Kubernetes, a longer route needs to be taken to reach a high-available deployment. Kubernetes monitors your container health, scales your deployment and ensures all desired replicas are running according to your specification requirements. However, during these processes, there may be instances where Kubernetes Pod is evicted and replaced, leading to increased downtime.

Mitigating Processing Downtimes

The challenge here lies in sustaining real-time low-latency data streams. One can look at a couple of strategies to reduce processing downtime.

Option 1: Reduce the consumer session timeout by adjusting the session.timeout.ms and heartbeat.interval.ms to lower values.

Option 2: Enable ‘leaveGroupOnClose’ setup for stateless applications, where an internal override is added.

Visualizing the ‘leaveGroupOnClose’ Optimization

By enabling ‘leaveGroupOnClose’, processing downtime can be significantly reduced. The visualisations below give a detailed picture of how these settings aid in improving processing efficiency.

Insert visualisations here

Pro Tips

  • Stateless and Stateful: The ‘leaveGroupOnClose’ can be set to true for stateless applications only. For stateful applications, enabling this process can result in entire rebalancing.

  • Kubernetes Deployment .spec.minReadySeconds: Kubernetes offers this functionality to ensure the readiness of a pod before resuming traffic. A shorter value can lead to more frequent rebalancing, causing delays and imposing a load on the Kafka cluster.

Conclusion

The ‘leaveGroupOnClose’ can be set to ‘true’ to ensure immediate triggering of rebalancing during shutdowns, significantly reducing processing downtime. This property mainly works for stateless Kafka Streams apps, making the shutdown process efficient and also improving the app’s elasticity and resilience.

Remember: When writing code or programming for Kafka-streams, the versions could significantly impact the process. As of writing this blog, the latest version of kafka-streams was 3.4.1.

#Kafka #DataProcessing #Kubernetes #StreamsApp

Reference Link