ETL Is Dead, Long Live Streams: real-time streams w/ Apache Kafka

  Переглядів 275,463

InfoQ

InfoQ

7 років тому

InfoQ Dev Summit Boston, a two-day conference of actionable advice from senior software developers hosted by InfoQ, will take place on June 24-25, 2024 Boston, Massachusetts.
Deep-dive into 20+ talks from senior software developers over 2 days with parallel breakout sessions. Clarify your immediate dev priorities and get practical advice to make development decisions easier and less risky.
Register now: bit.ly/47tNEWv
----------------------------------------------------------------------------------------------------------------
Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.
Download the slides & audio at InfoQ: bit.ly/2ldN6P0
This presentation was recorded at QCon San Francisco 2016. The next QCon is in London, March 5-7, 2018. Check out the tracks and speakers: bit.ly/2hxsoN1
For more awesome presentations on innovator and early adopter topics check out InfoQ’s selection of talks from conferences worldwide: bit.ly/2lRQCll

КОМЕНТАРІ: 105
@cenai61983
@cenai61983 5 років тому
Very good introduction to streaming ETL architecture and Kafka. Misleading title. Streaming ETL is just another way of implementing ETL. Traditional batch-oriented ETL doesn't have to be totally replaced by Streaming ETL.
@IntrepidClown
@IntrepidClown 5 років тому
Introduction to Kafka really starts at 17:36.
@tonybernoulli7859
@tonybernoulli7859 4 роки тому
Comments like this are helping this world become better place
@mwandulu
@mwandulu 4 роки тому
At a speed of 1.25 too
@ch012
@ch012 4 роки тому
@@mwandulu You can go to 1.5 too with very little difference. :)
@bernardlowe5433
@bernardlowe5433 4 роки тому
For me the whole talk was pretty good. See no reason to skip.
@oluwoleoyekanmi6052
@oluwoleoyekanmi6052 3 роки тому
No reason to skip. The preamble puts things into context.
@niranchanadevirajmohan3232
@niranchanadevirajmohan3232 3 роки тому
This was a well thought out presentation by sharing a brief introduction of existing systems, their limitations. And transitioning to the need for kakfa, the way it is designed and also explaining how the limitations are addressed by Kafka. Good one.
@filipedelbel
@filipedelbel 6 років тому
Very clarifying explanation about Kafka, helped me a lot to understand the concept.
@smyk1975
@smyk1975 7 років тому
Great architecture overview of Kafka Streams. Convinced me to look deeper into the Streams API and capabilities.
@gcbzzzz
@gcbzzzz 5 років тому
"event and batch have tradeoffs. now ignore the trade offs and try to use streams for everything" :/
@jocalvo
@jocalvo 5 років тому
ETL's are not dead, they just transformed. The KEY is not apache kafta, the key is DATA ARCHITECTURE, otherwise it will add more mess.
@MrNau007
@MrNau007 5 років тому
2 Observantions: 1) History of ETL - missed the entire evolution of data warehouse from MIS systems 2) example of old and new “T”. You applied “remove PII fields” at streaming platform . Who will identify what is this common transformations which would have to be applied at streaming platform. One benefit is : one higher level of abstraction
@sharathchandra5314
@sharathchandra5314 5 років тому
Nice Presentation...I would like to know what Vendors of ETL Tools like Informatica, DataStage ..etc., has to say about their products in the sense of this briefing..bec these two are quite busy in coming up with new versions.
@abobakrnasr9814
@abobakrnasr9814 3 роки тому
Wonderful talk...than you so much Neha for the presentation.
@renatoalencar4451
@renatoalencar4451 5 років тому
So many angry comments. It's just an attractive title, not an actual PhD thesis.
@babylon_bob
@babylon_bob 5 років тому
I think instead of saying ETL is dead, just say I have not clue. I've never in my life recreated two streams to process the same data into different destinations (12:59) I'd do exactly the same as at (13:29) but with ETL tools.
@prashanthtalla
@prashanthtalla 5 років тому
I agree. If step 2 is same for both the destinations, why will you repeat for Cassandra. You'll just add that destination also to the load logic of the existing ETLs. I wish the speaker gave better example where we may end up doing this and how streaming could have helped. I believe streaming is advantageous from cost perspective (ETL tools are super expensive) and for real time very large volumes, they cannot scale. I'm also not sure if streaming really solves this problem - I've yet to work on streaming technologies.
@sanchitkumar9862
@sanchitkumar9862 5 років тому
Absolutely True, No one is dumb enough to run the computation twice when we have the option of adding the data to multiple destinations.
@msambare
@msambare 6 років тому
Great presentation and talk. Now I want to explore streaming platforms in detail.
@chakrapanireddy1358
@chakrapanireddy1358 7 років тому
Really helpful.. Nice explanation..
@Ravi86055
@Ravi86055 6 років тому
Great content... useful information
@susmitdey9172
@susmitdey9172 3 роки тому
ETL and EAI probably addresses different problems compared to streaming, practically according to me streaming is more of using capabilities of the platform to integrate rather than using a tool to do ETL or Real-Time it addresses the data transfer logic so we can avoid tools, correct me if I'm wrong.
@flynntsang
@flynntsang 6 років тому
This is an intelligent and articulate overview of how Kafka in particular manages increasing volume, velocity and variety of "big data" using real-time streams. It may not resonate with everyone; not everyone needs this. Excellent for those getting started with streaming data and transitioning away from messaging queues or redundant ETL processes.
@im2crazyin
@im2crazyin Рік тому
Very informative, precise and too the point introductory talk on data streams. It gives enough information that one knows why and when to look for streaming solutions and one also knows what specific areas to dig in for once they decide to go for such solution.
@arunasjunevicius533
@arunasjunevicius533 6 років тому
Really? Data integration and Application integration is not the same. ETL and EAI solve two totally unrelated problems. And how can one say that MQ does not scale when if one want's to scale he can choose DDS or whatever different messaging technology.
@audreymciver4863
@audreymciver4863 5 років тому
all Principles should be implemented in any streaming data to be in compliance at all time.
@dx4816
@dx4816 4 роки тому
The "messy" diagram can simply be redrawn to match the Kafka-based diagram. Lots of good information, but the real differentiate is not the integration patterns. Anyway, Kafka is a great product.
@chandraprakashmatam672
@chandraprakashmatam672 5 років тому
Though it is not paradigm shift, the approach given here eventually modern EDW with real time streams.
@IliaTernovich
@IliaTernovich 6 років тому
31:56 link to video please. Unfortunately can't hear names clearly
@MrLyonliang
@MrLyonliang 5 років тому
Thanks a lot for explaining clearly about: what happened yesterday, what's the pain point, what's the new requirement, and HOW.
@ericpham6192
@ericpham6192 4 роки тому
Can parallel processing in bandwidth fill multiple packages help in big data and distributed database and buffering work in hand in hand help in streaming. Also 3d volume fill data storage and extracting data format
@navalsaini
@navalsaini 5 років тому
A very well structured talk. Thanks for it. :-)
@ericpham6192
@ericpham6192 4 роки тому
Share distribute processing by using percentage of iddling resource in cloud sharing processing network
@manojsembekar5703
@manojsembekar5703 5 років тому
great thanks for information..
@sreeRocksRocks
@sreeRocksRocks 6 років тому
Great video and gave overall idea on what Kafka is and how to play with it in real use cases. Excellent and kudos!
@tansudasli
@tansudasli 5 років тому
data integration and service integration layer are handled by different products on the market. that's the main problem. and it is good to see them in a convergent approach. that's why Kafka is on the spot. this convergence brings organizational effectiveness to enterprise. because you can now combine BI's ETL team and Middleware team, so you can get holistic integration capabilities which will also creates advantage point for transformation. on the other hand, scalability is a relative concept. in an enterprise, EAI or ESB is scalable. ETL is batch oriented but it is feasible for an enterprise's near realtime concerns.
@jersute
@jersute 6 років тому
the T in ETL has nothing to do with scrubbing ('data cleaning') or normalization. if you're using ETL to scrub you're already too late in the pipeline and using a hammer as a screwdriver when you want a paintbrush. it's gibberish. ETL is for data snapshots to move between environments where you want only a subset of the data but it is transactionally stable. ETL is how you leave the house. Kafka is the road you drive on to deliver the payload from said house. different topics. Kafka should be viewed as a simd replacement for amqp/zmq or as she has presented it a comparison vs elk for log processing as a limited use case. the streams discussion should be compared with apache storm for analytical capability or a distributed replacement for memcached performance counters. local state is a poor way of saying cache locality and migration. this talk is all over the place. no mention of the problem of dealing with subaggregation and priority dependency issues inherent in kafka/storm without explicit payload tagging or reentrant use of the architecture in general as befits any simd speedup discussion. if you are familiar with the concepts of noshared architectures for data presentation and want a messaging solution with the same principles then kafka may interest you. do not expect magic.
@RicardoMontee
@RicardoMontee 4 роки тому
7:30 "ETL (Extract Transform Load) and EAI (Enterprise Application Integration) are outdated"
@veerun3104
@veerun3104 6 років тому
ETL is not only meant for data integration.. what about business intelligence and analytics apps..
@8Trails50
@8Trails50 4 роки тому
I think they are saying ETL in the form of ingestion of data INTO some tool. Not Spark or Hadoop jobs. In that case you could just subscribe to Kafka.
@anuragakella
@anuragakella 4 роки тому
Awesome presentation skills.and clear explanation about ETL changing from batch to Real -Time
@michalmefli
@michalmefli 6 років тому
Great talk.
@anant3104
@anant3104 5 років тому
Great and it is very helpful, thank you
@xfactor740501
@xfactor740501 4 роки тому
Great presentation. She makes it look simple...Does anyone know the program used to create the presentation?? I like the look, as though was drawn ""free-hand"....very sharp
@kevinshoang
@kevinshoang 3 роки тому
April 2021, Batch is still more popular than stream.
@gauravaithmia
@gauravaithmia 3 роки тому
First 15 minutes are more like a pitch deck dumbed down for a VC.
@piggybox
@piggybox 3 місяці тому
6 years later, ETL is still alive
@allanhouston22
@allanhouston22 4 роки тому
Kafka is not a ETL replacement, it is a streaming/message broker. ETL is a platform that offers adapters for receiving and writing data from/to multiple source/destination types (files, DBs, queue systems), its a centralized mapper tool (say XMLCVS), and supports various integration patterns (best practices). So ETL can typically be used to read/write from/to Kafka while it is performing mapping so that the destination system understands what the source system is trying to send, in real-time. EAI systems, another platform type she mentioned, are particularly written for event/real-time purposes so its more suitable platform for such type of work as it supports transactional behavior and unified monitoring of what is flowing through it, in addition to adapters and centralized mapping. How this woman managed to compare oranges and apples without receiving more down votes it beyond me.
@void0818
@void0818 6 років тому
E --(k)-- T --(k)-- L this is where the kafka in ETL, ETL will never dead but kafka is a good stream used in ETL processes.
@MM-zd8sx
@MM-zd8sx 4 роки тому
Very helpful video! Quality content and great presentation. Stellar job
@zisispontikas2038
@zisispontikas2038 5 років тому
36:50 come on. You just took the stream processing java app and the dashboard app and put them inside in one application. So the database is inside kafka and the job processing and dashboard are merged. There should have been 2 boxes not 1
@davidk7212
@davidk7212 10 місяців тому
Not all data is big data, and all data will never be all big data. There will always be a huge place for standard ETL.
@allmhuran
@allmhuran 5 років тому
ETL is outdated? That's news to any company that has no need to process terabytes of data in real time. This is the problem with keynotes from super giant companes. They only speak from the perspective of a super giant company. The overwhelming majority of enterprises do not have scale problems in this category, but people from such companies walk out of the keynotes thinking "yeah, this is what we should do!". No, you probably shouldn't.
@wennwenn1422
@wennwenn1422 5 років тому
this butthurted all ETL folks..
@md.mottakinchowdhury7898
@md.mottakinchowdhury7898 2 роки тому
Misguiding. Why would batch processing be dead if it is just enough to do batch processing of your data?
@blobbyflobby6752
@blobbyflobby6752 3 роки тому
ETL is dead. Long live ETL!
@pajeetsingh
@pajeetsingh Рік тому
Just use dma.
@audreymciver4863
@audreymciver4863 5 років тому
im only using this to identify any hackers uploading anything of any kind. number one it was without my permission. hacking is a federal offense.and it violates my privacy rights.
@saurabh3614
@saurabh3614 6 років тому
this is not at all comparable, Both meant for different purpose. I doubt if she has ever looked at the DWH code and design .And bet you if you show me one single implementation which include complete fact table design to solve customer business problem
@Ranjan316
@Ranjan316 5 років тому
Saurabh u are right, if u look at her work history she worked for just 1 company ( linkedin) and took kafka out as a new company, she is trying to just make money out of that......she has no idea why facts and dimensions are needed, you add any stream someone needs to transform them into data which data analysts or data scientists can use,
@robinsoncarter3432
@robinsoncarter3432 6 років тому
hello you use chroma key in this video?
@sanchitkumar9862
@sanchitkumar9862 5 років тому
It's a very harsh statement to say ETL is dead. No, ETL is not dead.
@dantepraxedis
@dantepraxedis 4 роки тому
catch title
@nareshgb1
@nareshgb1 6 років тому
elsewhere: ukposts.info/have/v-deo/bHOchpuupIifs5c.html
@pajeetsingh
@pajeetsingh Рік тому
By Mark 5:00 you'd figure out all the shenanigans regarding streams, data integration and why these corporate tech lords created Kafka. Good presentation.
@vikramachandranselvakumar6316
@vikramachandranselvakumar6316 5 років тому
The speaker has no inkling of what ETL is or what a Datawarehouse is and how they are architected, designed, developed, provisioned and sustained. Apache Kafka is great open source tool for integrating streaming data into your data lake and is not a paradigm that will replace technology agnostic paradigm name ETL. I have used Spark SQL to accomplish/realize a ETL based solution. Again Spark SQL is a tool and not a paradigm.
@podunkman2709
@podunkman2709 4 роки тому
You confuse two loosely connected areas. Kafka is NOT the successor to ETL. ETL is a completely different group of products with a completely different application. Kafka may be the next generation of ESB. In addition, you must know that in the vast majority of companies around the world their "even driven architecture" is MS Excel. Why companies like Google or Faceook have their power? Because they are really unique. Meanwhile most of companies do things like 20 years ago. For them ETL is miracle. They do not need any Kafka. It's beyond their perception.
@dataguygamer
@dataguygamer 5 років тому
Trolling title... I'm not sure if the speaker would approve of this title. It opens her idea for ridicule
@chandanjha3205
@chandanjha3205 4 роки тому
It was a nice presentation but majority of data generated by user actions are still stored in databases(SQL,Oracle) and thus ETL tools like SSIS are still needed to read them and send processed data to destinations. Some data could be in flatfiles but not too often seen these days unless we are gathering from multiple public sources. Whenever I try to read into the minds of speakers in youtube presentations to see why they are using Kafka or Spark, all they give is an example of 'word count' which is sad. Take an example of Spark, sure it can do distributional computing but so can a lot of other tools too if you have an array of cheap servers.
@chrisl.9750
@chrisl.9750 3 роки тому
ETL is not dead and if you want to be taken seriously in the world of data, I recommend you drop this suggestion...
@VoxNerdula
@VoxNerdula 5 років тому
I vant to try her curry
@MelvinStudios
@MelvinStudios 3 роки тому
Do you even know what "dead" means? ETL is used in many companies. Therefore ETL is not dead. Floppy disk is dead.
@debashishroy3485
@debashishroy3485 6 років тому
I think you are HR rather Technical ...from you scrap it is clear that you don't know both hadoop and ETL
@NothingMatress
@NothingMatress 6 років тому
Think again.
@Ranjan316
@Ranjan316 5 років тому
She is clueless i am shocked she is even allowed to talk at a summit
@IA-xh5ly
@IA-xh5ly 6 років тому
From what I’ve heard from this lady I’m making assumption she has a very little experience in ETL development (manual validation for example), she just follows the modern fashion.
@Ranjan316
@Ranjan316 5 років тому
Igor Andriychuk yup, lets see how much longer silicon valley supports such scam artists in the name of VC funding....
@Yi5Zhou
@Yi5Zhou 3 роки тому
you don't have to use this kind of name to attract viewers
@attilaviniczai7215
@attilaviniczai7215 3 роки тому
I love how americans can make acronyms out of the most important words in a title and just assume everyone knows what they abbreviate. It always amazes me how they try to get thoughts across an audience with a bunch of these 3 letter, context specific, magic words flying around.
@KC-zn4gt
@KC-zn4gt 6 років тому
It's a shame someone knows on just one tiny topic thinks she knows how it works and applies for all. On the final diagram there an icon of a DWH, I wonder how she explains how that DWH is getting populated without ETL. Oh...she probably thinks that is readymade available for her to stream from. lol.
@onlyitj
@onlyitj 6 років тому
You will be subscribe to multiple topics, and using Stream API process those message, which can potentially do the job.
@darshansangodkar6173
@darshansangodkar6173 6 років тому
I wish this presentation was given by some techie guy.
@tinameh
@tinameh 4 роки тому
Darshan Sangodkar really? I’d like to actually hear your tech talk some day. Do pick a deeply technical topic please. And an original one while you’re at it. If you struggle with that though, drop me a word. Happy to share some tips.
@atulavhad1661
@atulavhad1661 4 роки тому
@@tinameh I guess many are not aware that she was amongst ppl who built Kafka, I have seen her other talks and I found those enlightening and also built a unicorn startup.
@debashishroy3485
@debashishroy3485 6 років тому
bullshit ...I don't know which platform give these people to open their mouth even they don't have clear knowledge this shows the quality of Indian IT managers and Leaders
@Ranjan316
@Ranjan316 5 років тому
Completely agree , kafka is nice technology but this person doesn’t seem to have any idea about enterprise architecture or problems ETl tries to solve.....
@nguyen4so9
@nguyen4so9 7 років тому
Crap talks. ETL is a concept that is always there.
@TheEnfernuz
@TheEnfernuz 6 років тому
She doesn't deny it in the talk actually. She says that the batching ETL is dead / outdated, and now the streaming ETL is a way to go. Though I agree that part of the title is a bit misleading.
@temaz3334
@temaz3334 6 років тому
Shitty comment. U dont even understand what she is talking about.
@flipper71100
@flipper71100 6 років тому
People always have a tendency to resist change, as a result, they don't listen carefully
@jcrshankar
@jcrshankar 5 років тому
she mentioned etl tools not concept
@Ranjan316
@Ranjan316 5 років тому
Shankar K the title says ETL is dead, she is dumb as a rock....
@20cmusic
@20cmusic 5 років тому
2018. ETL is still alive. I really hate this kind of marketer style shitty title.
@rajeshn5829
@rajeshn5829 5 років тому
U r relly pretty
@b4bhanu
@b4bhanu 5 років тому
click bait title... kafka is great but this talk is a disaster
@ShivaKumar-ps1vh
@ShivaKumar-ps1vh 3 роки тому
Not worth......
@jianhuang7993
@jianhuang7993 6 років тому
This talk is a disaster
@danpal6737
@danpal6737 5 років тому
Rubbish material, holy cow letter from india
@MalleusDei275
@MalleusDei275 Рік тому
Lol, A silica nigre.... 😉
@MalleusDei275
@MalleusDei275 Рік тому
Your mum should have advice to you for dont play with the Hammer... Yes, Tyrannosaurus burgers were greeeeeeat.
@msftora3
@msftora3 2 місяці тому
BullSsssssssst
Event-Driven Architecture (EDA) vs Request/Response (RR)
12:00
Confluent
Переглядів 54 тис.
Glow Stick Secret 😱 #shorts
00:37
Mr DegrEE
Переглядів 78 млн
Артем Пивоваров х Klavdia Petrivna - Барабан
03:16
Artem Pivovarov
Переглядів 7 млн
Scaling Facebook Live Videos to a Billion Users
51:31
InfoQ
Переглядів 87 тис.
Apache Kafka in 6 minutes
6:48
James Cutajar
Переглядів 925 тис.
System Design: Why is Kafka so Popular?
4:20
ByteByteGo
Переглядів 65 тис.
What is Apache Flink®?
9:43
Confluent
Переглядів 19 тис.
Apache Kafka 101: Kafka Streams (2023)
8:20
Confluent
Переглядів 102 тис.
Learn Apache Airflow in 10 Minutes | High-Paying Skills for Data Engineers
12:38
Scaling Instagram Infrastructure
51:12
InfoQ
Переглядів 274 тис.
System Design: Why is Kafka fast?
5:02
ByteByteGo
Переглядів 1 млн
Subscribe for more Coding Tips! 🔥I wish I knew this When Istarted Programming #school #software
0:34
Apple, как вас уделал Тюменский бренд CaseGuru? Конец удивил #caseguru #кейсгуру #наушники
0:54
CaseGuru / Наушники / Пылесосы / Смарт-часы /
Переглядів 3,1 млн
Apple Event - May 7
38:22
Apple
Переглядів 6 млн