Building a GPU cluster for AI

  Переглядів 95,481

Lambda Cloud

Lambda Cloud

День тому

Whitepaper: lambdalabs.com/gpu-cluster/ec...
Learn, from start to finish, how to build a GPU cluster for deep learning. We'll cover the entire process, including cluster level design, rack level design, node level design, CPU and GPU selection, power distribution, storage, and networking.
This talk is based on the Lambda Echelon GPU Cluster whitepaper. The whitepaper can be found above.
Slides for the talk can be found here:
files.lambdalabs.com/How%20to%...
Errata:
- Slide 46 contains an erroneous diagram with a connection from the storage server to the compute fabric network, the storage server does not connect ot the compute fabric network. The correct diagram is available in the whitepaper.

КОМЕНТАРІ: 45
@randahan215
@randahan215 8 місяців тому
Extraordinary presentation. Covered all the important topics in depth and with real teaching talent. Many thanks!!
@randahan215
@randahan215 4 місяці тому
Most professional and holistic explanation I heard about this topic. Thank you so much!!
@yassinebouchoucha
@yassinebouchoucha Рік тому
Thank you for highlighting an underrated topic/options that company should re-consider within their compute infrastructure.
@dr.mikeybee
@dr.mikeybee 7 місяців тому
Thank you. You got me started years ago with your lambda stack -- the only way I could get TensorFlow installed on Linux.
@NSPK-
@NSPK- 7 місяців тому
Very expert suggestions for hpc and compute sizing.
@ilyboc
@ilyboc 6 місяців тому
Really good analysis and presentation!
@brianwesley28
@brianwesley28 3 роки тому
Thanks for the video.
@anatolystrashkevich7621
@anatolystrashkevich7621 Рік тому
very informative, thanks!
@uzairqarni7782
@uzairqarni7782 3 місяці тому
This was amazing. Thank you.
@thePyiott
@thePyiott Рік тому
Great insight!
@carlschumacher5510
@carlschumacher5510 2 роки тому
Its nice to see a holistic explanation of designing / building / installing a complex multi-rack system...As someone that has spent years working on both sides of the "analog/digital divide" (physical data center world / digital world's various segments), the un-sexy physical aspects of available rack space / power / cooling / floor loading / network uplink bandwidth are often overlooked (often assumed)...A semi arrives with a pallet: "Hey Carl, you can have this online in a couple days, right?"
@lambdacloud
@lambdacloud 2 роки тому
Hey Carl, thanks for the kind comment. Glad you like the video. It's always funny how difficult it can be to 'bridge the divide' between the physical world and virtual world. Many SWEs expect to be able to "spin up" 1000 servers with an API call and forget that there are actual physical objects and tons of people that actually make that happen when you're on-prem.
@sanaullah-qureshi
@sanaullah-qureshi Рік тому
very informative , thank you.
@peterxyz3541
@peterxyz3541 11 місяців тому
Thanks. I’m planning on building a “massive” 2 GPU system for home use.
@fundoo203
@fundoo203 6 місяців тому
How did it go man? I also want to build something like that and then stumbled on this video, which is excellent
@vtrandal
@vtrandal 2 роки тому
Excellent.
@ProjectPhysX
@ProjectPhysX 2 роки тому
Lots and lots of A100 GPUs. Every single one of them is a monster, almost 2x faster memory than the next best GPU. An entire room full of A100 racks... holy cow.
@natexetan5732
@natexetan5732 2 роки тому
thanks for the inspiration
@austynr
@austynr Рік тому
Genius bait and switch. Props!
@metal_mo
@metal_mo Рік тому
Lambda needs an explanation on the difference between "building" and "designing".
@HarishN.J
@HarishN.J 3 дні тому
Hey Stephen, this is highly informative. I work on this clustering. Now am able to connect the dots and get the bigger picture. where can i read about the relationship between numa topology and GPU peering capability.
@petevenuti7355
@petevenuti7355 8 місяців тому
What if I have a model that I just want to run as provided, it hasn't really been optimized to run around the cluster and has memory requirements greater than any individual system I have. I feel safe to assume that for that specific case a shared distributed memory model would be the solution to run that specific app, yes? Is there any distribution of Linux that has support for such a memory model? It doesn't have to be a full-blown single system image. Perhaps a patch to the memory management driver so storage can be treated as an extension of system memory and not swap memory? Does any such software exist?
@glennisholcomb592
@glennisholcomb592 8 місяців тому
I have three computers, and a nas, and a external hub. I think that I don’t need a another server because of the NAS. As far as my architecture goes, is there anything else that you can advise?
@Bloodycub666
@Bloodycub666 Рік тому
I just love this kind things. How do i can start this kind bussnes how i can find customer for like small node and start building up
@programmingwiththotho4641
@programmingwiththotho4641 3 місяці тому
Your are insane, thank you
@eyadmufti
@eyadmufti Рік тому
it is a lecture more than a tutorial, Thx.
@ikbo
@ikbo 2 роки тому
Do you guys have a gpu cluster optimized for 3d rendering.
@rosenangelow6082
@rosenangelow6082 6 місяців тому
Tell me how difficult it is so i can buy your solution kind of talk
@jleonardoperez5402
@jleonardoperez5402 Місяць тому
Looking for work would love to help
@chaoticblankness
@chaoticblankness 4 місяці тому
Very Based
@meng-hub
@meng-hub 9 місяців тому
Does it work in man????
@nathanthomas9395
@nathanthomas9395 2 роки тому
Does lambda products (gpu cluster) ship with a manual to help you set up the servers for use
@mengxu2026
@mengxu2026 2 роки тому
Our group ordered around 10 lambda PCs 1 year ago. Right now more than 5 have problems. Some of them do not start up. Mine gets stuck randomly....
@yugr
@yugr 2 роки тому
Have you tried looking into the reasons?
@lambdacloud
@lambdacloud 2 роки тому
Meng Xu, you can email support@lambdalabs.com 24/7 or call +1 (866) 711-2025 during business hours. Sorry to hear you're having issues, I'm sure we'll be able to resolve them quickly.
@danielleza908
@danielleza908 Рік тому
Our team has 5 lambda laptops, they work perfectly for over a year now.. We also have a workstation with 3 GPUs, works great too.
@ravnodinson
@ravnodinson 7 місяців тому
Hell yes Lambda Lambda Lambda.
@julianfiacconi709
@julianfiacconi709 Рік тому
Still most relevant today, 2 years later. Thanks.
@JustPlainRob
@JustPlainRob 3 місяці тому
Now if only I was a billionaire so I could make use of this great information...
@thinkinginsomething1859
@thinkinginsomething1859 9 місяців тому
Half Life man!
@huaveihuavei1045
@huaveihuavei1045 3 роки тому
headeggs
@harshikamahesh9459
@harshikamahesh9459 10 днів тому
Talk about what ur expert.. don’t talk useless stuff without knowing all facts
@orthodoxNPC
@orthodoxNPC 2 роки тому
speak UP
@mikepict9011
@mikepict9011 10 місяців тому
This dudes in full submission mode . Sad
host ALL your AI locally
24:20
NetworkChuck
Переглядів 240 тис.
What runs ChatGPT? Inside Microsoft's AI supercomputer | Featuring Mark Russinovich
16:28
Помилка,  яку зробило військове керівництво 🙄
01:00
Радіо Байрактар
Переглядів 396 тис.
КТО СМОГ ПОБЕДИТЬ?😳
00:36
МЯТНАЯ ФАНТА
Переглядів 602 тис.
skibidi toilet 73 (part 2)
04:15
DaFuq!?Boom!
Переглядів 24 млн
What is NVIDIA Networking and what happened to Mellanox?
8:10
Scan IT Solutions
Переглядів 15 тис.
NVIDIA REFUSED To Send Us This - NVIDIA A100
23:46
Linus Tech Tips
Переглядів 9 млн
Deep-dive into the AI Hardware of ChatGPT
20:15
High Yield
Переглядів 310 тис.
$90000 NVIDIA A100 GPU Server
5:25
YANGCOM Korea
Переглядів 431 тис.
A Computer Cluster Made With BROKEN PCs
24:34
Hardware Haven
Переглядів 194 тис.
Why flat earthers scare me
8:05
Sabine Hossenfelder
Переглядів 229 тис.
HUAWEI БЕЗ GOOGLE: ЕСТЬ ЛИ ЖИЗНЬ? | РАЗБОР
11:49
Как открыть дверь в Jaecoo J8? Удобно?🤔😊
0:27
Суворкин Сергей
Переглядів 976 тис.
iPhone 15 Precision Finding | Find Your Friends | Apple
2:52