Apache Kafka is a revered name in the realm of distributed event store and stream-processing platforms. It is highly recognized for its robust functionality in handling voluminous data at a compelling speed. To further augment Kafka’s capabilities, there’s Kafka Streams – intricately designed to simplify the creation of data pipelines and conduct high-level operations like aggregation and joining.
In this blog, we will dive deep into understanding the nuances of working with Kafka while building Kestra and leveraging its strengths in stream processing, navigating through its limitations, and customizing it to suit our specific requirements.
Why Apache Kafka?
Faced with the challenge of choosing a persistent queue for our application without any additional dependencies, we crossed paths with numerous candidates like RabbitMQ, Apache Pulsar, Redis, etc. However, Apache Kafka was the one that stood out, efficiently catering to all our project needs.
One major advantage of Kafka is that it allows us to embed the application directly within our Java application, removing the need for managing a separate platform, quite literally taking microservices to the next level.
Working with Kafka Topics
Kafka comes with its own set of constraints as it isn’t a database. It may seem confusing at first to use the same Kafka topic for source and destination.
Consider this example of a topology, which has the topic as the source, some branching logic, and two separate processes writing to the same destination. Here, the risk of overwriting the previous value becomes evident, ultimately resulting in data loss.
The Custom Joiner for Kafka Streams
To combat this issue, we came up with a customized joiner for Kafka Streams. This alternative was designed to process the executions and split the microservice into multiple topics such as:
- A topic with the executions (multiple tasks)
- A topic with task results
Our custom joiner needed to manually create a store, incorporate merge function, and get the last value. This ensured that regardless of the number of task results entering in parallel, the execution state is always the latest version.
Dealing with Distributed Workload Between Multiple Backends
In our application, Kestra, a scheduler with scheduled execution or long-polling mechanism (detecting files on S3 or SFTP) looks up all flows. To avoid a single point of failure on this service, we needed to split the flows between all instances of schedulers.
We did it by relying on Kafka’s consumer groups that handled complexities of a distributed system for us. Kafka undertakes all the heavy parts of the distributed systems. In case of a thousand flows, every consumer will have ~500 flows, thanks to Kafka’s phenomenal handling of:
- Heartbeat to detect consumer failure
- Notifications for rebalancing
- Ensuring exactly-once semantic for a topic
Monitoring and Debugging
While working with Kafka streams, monitoring and debugging can be a real challenge due to the lag in streams. To alleviate this, we chose to deal with only one topic at a specific time.
This approach helped us minimize network transit and group all streams by source topics.
Throughout this process, we learned some notable tips that helped us navigate our challenges. We were able to adapt our code efficiently to Kafka and make it work well for our use case.
In the end, the experiences and learnings derived from working closely with Apache Kafka and Kestra have been immensely rewarding. If you’re interested in our work and want to learn more, you can find us on GitHub, Twitter, or join our discussions on Slack.
Message us if you found this article helpful or if you have any questions about Apache Kafka.
Tags: #ApacheKafka #Kestra #StreamProcessing #Microservices