Flink vs Kafka Streams/ksqlDB: Comparing Stream Processing Tools

День тому

cnfl.io/podcast-episode-217 | Stream processing can be hard or easy depending on the approach you take, and the tools you choose. This sentiment is at the heart of the discussion with Matthias J. Sax (Apache Kafka® PMC member; Software Engineer, ksqlDB and Kafka Streams, Confluent) and Jeff Bean (Sr. Technical Marketing Manager, Confluent). With immense collective experience in Kafka, ksqlDB, Kafka Streams, and Apache Flink®, they delve into the types of stream processing operations and explain the different ways of solving for their respective issues.
The best stream processing tools they consider are Flink along with the options from the Kafka ecosystem: Java-based Kafka Streams and its SQL-wrapped variant-ksqlDB. Flink and ksqlDB tend to be used by divergent types of teams, since they differ in terms of both design and philosophy.
Why Use Apache Flink?
The teams using Flink are often highly specialized, with deep expertise, and with an absolute focus on stream processing. They tend to be responsible for unusually large, industry-outlying amounts of both state and scale, and they usually require complex aggregations. Flink can excel in these use cases, which potentially makes the difficulty of its learning curve and implementation worthwhile.
Why use ksqlDB/Kafka Streams?
Conversely, teams employing ksqlDB/Kafka Streams require less expertise to get started and also less expertise and time to manage their solutions. Jeff notes that the skills of a developer may not even be needed in some cases-those of a data analyst may suffice. ksqlDB and Kafka Streams seamlessly integrate with Kafka itself, as well as with external systems through the use of Kafka Connect. In addition to being easy to adopt, ksqlDB is also deployed on production stream processing applications requiring large scale and state.
There are also other considerations beyond the strictly architectural. Local support availability, the administrative overhead of using a library versus a separate framework, and the availability of stream processing as a fully managed service all matter.
Choosing a stream processing tool is a fraught decision partially because switching between them isn't trivial: the frameworks are different, the APIs are different, and the interfaces are different. In addition to the high-level discussion, Jeff and Matthias also share lots of details you can use to understand the options, covering employment models, transactions, batching, and parallelism, as well as a few interesting tangential topics along the way such as the tyranny of state and the Turing completeness of SQL.
EPISODE LINKS
► The Future of SQL: Databases Meet Stream Processing: cnfl.io/the-future-of-sql-epi...
► Building Real-Time Event Streams in the Cloud, On Premises: cnfl.io/real-time-event-strea...
► Kafka Streams 101 course: cnfl.io/kafka-streams-101-epi...
► ksqlDB 101 course: cnfl.io/ksqldb-101-episode-217
► Kris Jenkins’ Twitter: / krisajenkins
► Join the Confluent Community: cnfl.io/confluent-community-e...
► Learn more with Kafka tutorials, resources, and guides: cnfl.io/confluent-developer-e...
► Live demo: Intro to Event-Driven Microservices with Confluent: cnfl.io/event-driven-microser...
► Use PODCAST100 to get $100 of free Confluent Cloud usage: cnfl.io/try-cloud-episode-217
► Promo code details: cnfl.io/podcast100-details-ep...
TIMESTAMPS
0:00 - Intro
2:06 - The world of stream processing
6:26 - Flink vs ksqlDB
18:34 - Example use case
20:03 - SQL was built for static data
25:51 - Concept of event time
29:30 - Session-based window joins
35:47 - Processing streaming data with SQL
39:47 - Scaling Kafka Streams/ksqlDB
45:39 - Exactly-once semantics
48:15 - Choosing stream processing tools
53:52 - It's a wrap
ABOUT CONFLUENT
Confluent is pioneering a fundamentally new category of data infrastructure focused on data in motion. Confluent’s cloud-native offering is the foundational platform for data in motion - designed to be the intelligent connective tissue enabling real-time data, from multiple sources, to constantly stream across the organization. With Confluent, organizations can meet the new business imperative of delivering rich, digital front-end customer experiences and transitioning to sophisticated, real-time, software-driven backend operations. To learn more, please visit www.confluent.io.
#streamprocessing #ksqldb #apachekafka #kafka #confluent

КОМЕНТАРІ: 23

@benjinguyen9965 6 місяців тому

The dude in the middle is brilliant! Asks the correct questions for the uninitiated!

@havokgames8297 2 місяці тому

He has his own podcast channel and it is fantastic: www.youtube.com/@DeveloperVoices

@FnordFandango Рік тому

This was an excellent episode. KrisJ - I really like your host/interviewing style. This was an interesting topic and very well presented.

@krisajenkins Рік тому

Thanks! Glad you enjoyed it. 😊

@flyaruu Рік тому

Oh, liked this one! For Kafka Streams/ksqlDB *everything* is about Kafka, all input and all output moves through 1 single Kafka cluster. That has bit me a few times, and Flink is more flexible there: You can read from one cluster and write to another. Or join data from different clusters. Or read data from a cluster you only have read access from.

@mateuszkopij4120 Рік тому

Great debate, thanks for sharing.

@hellenjiang1004 Рік тому

Excellent video, Kris

@affaffofa Рік тому

OMG, it really worked. Thank you so much!!

@jdang67 Рік тому

Where do you persist those states? How easy to share that states when you move Kubernetes from one cluster to a new cluster ? Currently, I persist states in Redis.

@themaninjork Рік тому

Nice Episode!

@rogers2934 Рік тому

Kstreams is my favourite simply because of the deployment model as long as I already have a Kafka cluster. If the echo system does not use Kafka and uses AWS Kinesis, I would choose Flink.

@keja0 8 місяців тому

ksqlDB is the 🐐!

@mikiallen7733 Рік тому

what about latency ? is it as performing as others available on the market if not better at fraction of the cost ? would your provide some benchmark numbers relative to other candidate streaming languages / frameworks targeted use case : streaming large financial datasets in many formats , text , integer , float ...etc your input is highly appreciated

@AP-eh6gr Місяць тому

13:41 the big takeaway as to why/when Flink vs Kstreams

@stamyztekrt340 Рік тому

BEEST!!!

@podunkman2709 Рік тому

Flink is not as advanced a product as you present it. It is more like libraries and scripts for creating software than software itself. In flink you cannot do many trivial things that normally do with data. Flink also changes drastically from version to version and is not compatible with the previous ones. The documentation is unclear. Flink disappointed me a lot.

@MisterKanisterXXX Рік тому

Sorry I think that's nonsense. While KStreams might be easier to use, you might hit a wall with a problem you could only solve in Flink, not the other way around. As they said in the video.

@pawar2946 Рік тому

well, tNice tutorials is going to take forever...

@7912249 Рік тому

Using SQL syntax in streaming application makes things even worse. How do you test Ksql together with Kafka Streams? They just belong two different worlds. The idea of enabling not java developer to work with Kafka will failed at the end. If someone can't even write Java code, he is definitively not qualified for developing or handling the complicity in such streaming applications.

@djl3009 Рік тому

That is an "interesting" take on what qualifies someone to work with the complexity of streaming applications :)

@NikolasHonnef Рік тому

Idk what you are talking about regarding "two different worlds". All ksql queries are converted to Kafka Streams processes afaik, so they are literally "the same world". SQL syntax is just domain specific "code" that hides some complexity and implementation details behind abstractions. Also, I think pretty much all devs can or could write their stream code in Java, it's just a matter of preference or not wanting to add another language to your project.

@7912249 Рік тому

@@NikolasHonnef Yes, ksql under the hood is just kafka-streams. But just image if you're building a a microservice event sourcing system. One service need to collect and transform multiple data sources, if kafka-streams can cover everything within one kafka stream topology, why should I use ksql additionally, how would you write your test code? And you will also need a separate deployment， if you're using kubernetes. At the end ksql will bring you more operation overhead. Other frameworks like flink or spark, they all have a build-in sql-like high-level-api and it combines well with low-level part. For test or deployment you only maintain one codebase instead of handle sql-part separately. So based on that, I don't think it's only a matter of preference...

@7912249 Рік тому

@@djl3009 I don't know either. But if someone can't even handle csv or dat abase ETL, they will have definitely more problem with streaming data, because he can't catch one record from the running stream, manipulate it manually and put it back to the stream ;-)