Data Pipeline Frameworks: The Dream and the Reality | Beeswax

  Переглядів 34,515

Data Council

Data Council

День тому

Get the slides: www.datacouncil.ai/talks/data...
ABOUT THE TALK:
There are several commercial, managed service and open source choices of data pipeline frameworks on the market. In this talk, we will discuss two of them, the AWS Data Pipeline managed service and the open source software Airflow. These frameworks have very different feature sets and operational models, however, they have both benefited us and fallen short of our needs in similar ways.
To understand the reasons, we analyze our experience of first building a data processing platform on Data Pipeline, and then developing the next generation platform on Airflow. We find that managed service and open source framework are leaky abstractions and thus both frameworks required us to understand and build primitives to support deployment and operations.
Likewise, we discuss the necessity of implementing cross-cutting aspects such as logging, monitoring, security and configuration, which arises from the shortcomings of existing pre-implemented components. Generalizing from specific pain points and solutions, we posit that almost any organization building a data platform using a pipeline framework or service will run into many of the same issues, because opinionated framework/service implementations will conflict with an organization's existing code, preferences and procedures.
So where is the line? What value can you expect to get from a data pipeline framework or service? What will you need to wrap, integrate with or fully implement yourself? To develop a robust data pipeline platform for your organization, you will need to bridge the gap between the framework dream and production reality. This talk will help you do that.
ABOUT THE SPEAKER:
Mark Weiss is a Senior Software Engineer at Beeswax, the online advertising industry’s first extensible programmatic buying platform, where he focuses on designing and building data processing infrastructure and applications supporting reporting and machine learning. He has previously held various engineering individual contributor and leadership roles, and has worked on ETL systems and data-driven distributed platforms for much of his career. Mark has spoken previously at DataEngConf NYC, and regularly speaks and mentors at the NYC Python Meetup. He is also blogs and hosts the podcast "Using Reflection" at www.usingreflection.com, and can be found on Github, Twitter and LinkedIn under @marksweiss. He lives in Brooklyn, NY
ABOUT DATA COUNCIL:
Data Council (www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
FOLLOW DATA COUNCIL:
Twitter: / datacouncilai
LinkedIn: / datacouncil-ai

КОМЕНТАРІ: 7
@igrai
@igrai 5 років тому
Learned a few things about workflows programming and pain points. Thanks!
@rembautimes8808
@rembautimes8808 3 роки тому
Good talk very insightful
@burtgulash
@burtgulash 3 роки тому
Airflow is Ruby on Rails and Django of ETL world. It gives you the monolithic batteries included framework which ships with a lot of know how, so that you don't need to reinvent the wheel. The moment you need to do something custom, you're out of luck. Wait for the microframeworks to come.
@nashdashflash
@nashdashflash 3 місяці тому
It's been > 2 years, what would you say has come or is on the horizon?
@gilbertg.96
@gilbertg.96 5 років тому
Thanks for the wonderful talk. I'm just getting familiar with data engineering tools, mostly on AWS. Just curious, is AWS step functions ideal in a scenario like this, and what are some perspectives that could be formed towards step functions and data engineering?
@anonymous4711_
@anonymous4711_ 4 роки тому
Pretty cool tech, very good talk. Imagine all this brainpower and engineering didn't go into this surveillance ad system, but to process climate data, research cancer or pandemics. Tragic.
@matthewprestifilippo7673
@matthewprestifilippo7673 4 роки тому
The most interesting thing he said comes at the end where he talks about wrapping airflow
Glow Stick Secret 😱 #shorts
00:37
Mr DegrEE
Переглядів 117 млн
LIVE - Парад Победы в Москве. 9 Мая 2024
2:27:56
AKIpress news
Переглядів 2,2 млн
😨Новая Война в GTA 5 Online #shorts
00:40
King Dm
Переглядів 1,7 млн
The Harsh Reality of Being a Data Scientist
12:10
Sundas Khalid
Переглядів 481 тис.
Database vs Data Warehouse vs Data Lake | What is the Difference?
5:22
Alex The Analyst
Переглядів 690 тис.
The Highs and Lows of Building an Adtech Data Pipeline |  TripleLift
34:52
7 Best Practices for Implementing Apache Iceberg
57:01
Tabular
Переглядів 2,6 тис.
Я Создал Новый Айфон!
0:59
FLV
Переглядів 2,9 млн
Вы поможете украсть ваш iPhone
0:56
Romancev768
Переглядів 385 тис.
С Какой Высоты Разобьётся NOKIA3310 ?!😳
0:43
СЛОМАЛСЯ ПК ЗА 2000$🤬
0:59
Корнеич
Переглядів 2,2 млн