Apache Spark Core - Practical Optimization Daniel Tomes (Databricks)

День тому

Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: databricks.com/product/unifie...
Connect with us:
Website: databricks.com
Facebook: / databricksinc
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-nam...

КОМЕНТАРІ: 32

@premendrasrivastava114 3 роки тому

Great video, almost covered points that cause performance issues. It's easy to code spark application because of available api but it's important to understand architecture of spark to get the real value out of spark.

@TheSQLPro 3 роки тому

Very deep technical explanation on Spark optimization. Nice stuff, thank you for sharing with the community.

@jalenorion2622 2 роки тому

instaBlaster...

@ayushy2537 4 роки тому

Very nice explanation of the Spark Architecture and terminologies.

@dhrub6 2 роки тому

One of the best videos on Spark job optimization.

@randysuarezrodes288 2 роки тому

Awesome video. Where we can find the slides-reference material?

@SpiritOfIndiaaa 3 роки тому

thanks a lot , @27 min where from you got Stage 21 input read i.e. 45.4g + 8.6g = 54g ?

@mrajcok 3 роки тому

From stages 19 and 20, the "Shuffle Write" column numbers. These become input to stage 21, as indicated by the DAG graph.

@harishreddyanam2078 3 роки тому

@@mrajcok can you please explain about Lazy loading?

@mrajcok 3 роки тому

@@harishreddyanam2078 just do a google search for "spark lazy loading". Look at the pages that come back from the search on stackoverflow and from the Spark official docs. It would be too much to try and explain in a comment.

@cozos 4 роки тому

Are the suggested sizes (target shuffle size, target file write) compressed or uncompressed?

@mrajcok 3 роки тому

Normally you use compressed sizes for these. When Spark writes the output of a stage to disk, by default (config item spark.shuffle.compress is true by default) it will compress the data to reduce I/O, which is normally what you want. The Spark UI shows the compressed sizes, so that's another reason to be using compressed sizes in your calculations. As he discusses in the video, if your data compresses "more than average", then you'll want more partitions than average, since the decompressed data will consume more executor memory than average.

@pshar2931 Рік тому

Excellent presentation but terrible screenshots..very hard to read what's written.

@MULTIMAX1989 3 роки тому

@30 min , how did he get to the percentage, only 60% of the cluster is being utilized?

@samuel_william 2 роки тому

540p/96=5.625. He says that after running 5 sets only 62 percent will be utilized, it is because of the reminder. Incase if he uses 480partitions then 480/96 will give a whole value of 5

@loganboyd 3 роки тому

Great Video! I've been thinking about our spark input read partitions since we have a lot of heavily compressed data using BZIP2. Sometimes this compress is 20-25X. So a 128MB blah.csv.bz2 file is really a 2.5-3GB blah.csv file. Should we reduce the value in our spark.sql.files.maxPartitionBytes to accommodate this and have it result in more partitions created on read?

@mrajcok 3 роки тому

Most likely "yes", but it depends on how much executor memory you have per core, how many cores you have, and how much processing/CPU you use to process a partition. You want all of your cores busy (i.e., at least one input partition per core). If you do that, but you then run out of executor memory, you could try smaller (hence more) partitions by setting maxPartitionBytes even lower.

@rishigc 2 роки тому

@39:22 I did not understand why my job would fail if I used a coalesce to reduce the number of partitions while writing output. Can anyone please explain ? What happens when coalesce() is pushed up to the previous stage ? How does that makes the job to fail ?

@jachu61 Рік тому

As spark is lazy it will look what it has to do as the last step and go backwards to do that. So e.g. it needs to write data to 10 files because you used coalesce(10) at the end. In earlier steps he has to do huge join to calculate this data but because you want it in the end in only 10 files it will already do shuffling for the join into those 10 partitions which would be huge and would not fit into memory failing the job.

@rishigc 2 роки тому

I didn't understand the part @31:00 where he chooses 480 partitions instead of 540. Can anyone please explain why

@ffckinga1375 Рік тому

becaus he wants to give every core the same amount of partitions and go away from skewness

@jachu61 Рік тому

He has 96 cores so 480 partitions (less but a little bigger) will allow him to process them in 5 "batches". If he would stay with 540 this would take 6 "batches" and the last one would only utilize 60 cores and remaining 36 being idle so not that opitmal as first solution.

@sashgorokhov 4 роки тому

If you have several joins on one table, how do you set shuffle partitions count for specific join? As i currently understand, this config is rendered into physical plan only when action is triggered.

@prudvisagar6369 4 роки тому

nice question this the one I wanted to know. I have not get the answer yet.

@josephkambourakis 3 роки тому

spark.conf.set(spark.sql.shuffle.partitions, value), but spark 3.0 should fix this kind of thing for you with AQE

@loganboyd 3 роки тому

@@josephkambourakis Really excited for AQE, we are upgrading our cluster in about a month

@MrEternalFool 3 роки тому

29:20 If shuffle still is 550 GB why is columnar compression good?

@aj0032452 3 роки тому

Got the same question!

@bikashpatra119 3 роки тому

and what is the logic behind using the target size of 100 MB when your input is 54G and shuffle spill is 550G?

@mrajcok 3 роки тому

He is saying that the compressed input to stage 21 is 54G, but the uncompressed size is 550G, so the data compresses very well (about 10-to-1), hence the need for more partitions than "normal" because the data "blows out" more than normal--ten times as much when uncompressed. So, he decided to use 100M per input partition in his calculation rather than his normal recommended value of 200M of compressed input per (input) partition, to account for the more-than-typical "blow out"/decompressed size. If the data instead compressed at at 5-to-1 ratio (which is likely more typical), then try using the typical/normally-recommended 200M per partition in your calculations as a starting point.

@mrajcok 3 роки тому

I should note that the choice of 100M or 200M of compressed input per partition is also dependent on how much executor memory you have. He has 7.6G per core. If you have half that, then you want your input partitions to also be 50% smaller. I often run Spark jobs with just 1-1.5G per core, and my rule-of-thumb is therefore much lower: 30-50M per partition.

@7777linpaws 2 роки тому

@@mrajcok Thanks for your replies. These are very helpful. Just one question though : Do you assume only about 25-30% of available memory utilization per core for shuffle to get to that 30-50M figure?