StreamProcessing - Minidev News

Apache Kafka is a revered name in the realm of distributed event store and stream-processing platforms. It is highly recognized for its robust functionality in handling voluminous data at a compelling speed. To further augment Kafka’s capabilities, there’s Kafka Streams – intricately designed to simplify the creation of data pipelines and conduct high-level operations like aggregation and joining.

In this blog, we will dive deep into understanding the nuances of working with Kafka while building Kestra and leveraging its strengths in stream processing, navigating through its limitations, and customizing it to suit our specific requirements.

Why Apache Kafka?

Faced with the challenge of choosing a persistent queue for our application without any additional dependencies, we crossed paths with numerous candidates like RabbitMQ, Apache Pulsar, Redis, etc. However, Apache Kafka was the one that stood out, efficiently catering to all our project needs.

One major advantage of Kafka is that it allows us to embed the application directly within our Java application, removing the need for managing a separate platform, quite literally taking microservices to the next level.

Working with Kafka Topics

Kafka comes with its own set of constraints as it isn’t a database. It may seem confusing at first to use the same Kafka topic for source and destination.

Consider this example of a topology, which has the topic as the source, some branching logic, and two separate processes writing to the same destination. Here, the risk of overwriting the previous value becomes evident, ultimately resulting in data loss.

The Custom Joiner for Kafka Streams

To combat this issue, we came up with a customized joiner for Kafka Streams. This alternative was designed to process the executions and split the microservice into multiple topics such as:

A topic with the executions (multiple tasks)
A topic with task results

Our custom joiner needed to manually create a store, incorporate merge function, and get the last value. This ensured that regardless of the number of task results entering in parallel, the execution state is always the latest version.

Dealing with Distributed Workload Between Multiple Backends

In our application, Kestra, a scheduler with scheduled execution or long-polling mechanism (detecting files on S3 or SFTP) looks up all flows. To avoid a single point of failure on this service, we needed to split the flows between all instances of schedulers.

We did it by relying on Kafka’s consumer groups that handled complexities of a distributed system for us. Kafka undertakes all the heavy parts of the distributed systems. In case of a thousand flows, every consumer will have ~500 flows, thanks to Kafka’s phenomenal handling of:

Heartbeat to detect consumer failure
Notifications for rebalancing
Ensuring exactly-once semantic for a topic

Monitoring and Debugging

While working with Kafka streams, monitoring and debugging can be a real challenge due to the lag in streams. To alleviate this, we chose to deal with only one topic at a specific time.

This approach helped us minimize network transit and group all streams by source topics.

Throughout this process, we learned some notable tips that helped us navigate our challenges. We were able to adapt our code efficiently to Kafka and make it work well for our use case.

In the end, the experiences and learnings derived from working closely with Apache Kafka and Kestra have been immensely rewarding. If you’re interested in our work and want to learn more, you can find us on GitHub, Twitter, or join our discussions on Slack.

Message us if you found this article helpful or if you have any questions about Apache Kafka.

Tags: #ApacheKafka #Kestra #StreamProcessing #Microservices

Reference Link

In today’s vast digital landscape, big data concepts have revolutionized the methods we use to handle, process and analyze information. Streams of data generated every second provides invaluable insights about various aspects of our online lives. Apache Kafka and Apache Flink are two major contributors in this realm. Confluent, which offers a fully managed streaming service based on Apache Kafka, embraces the advantages of Kafka in unison with the capabilities of Apache Flink.

Apache Flink on Confluent Cloud

Recently, Apache Flink is made available on Confluent Cloud, initially for preview in select regions on AWS. Flink is re-architected as a cloud-native service on the Confluent Cloud which further enhances the capabilities offered by this platform.

Event-Driven Architectures with Confluent and AWS Lambda

When adopting the event-driven architectures in AWS Lambda, integrating Confluent can provide multiple benefits. To get the most out of this combination, understanding the best practices are crucial.

To Be Continued…

Tags: #ApacheKafka, #ApacheFlink, #ConfluentCloud, #StreamProcessing