Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

  Переглядів 8,268

MLOps.community

MLOps.community

6 місяців тому

Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
// Abstract
Getting the right LLM inference stack means choosing the right model for your task, and running it on the right hardware, with proper inference code. This talk will go through popular inference stacks and set-ups, detailing what makes inference costly. We'll talk about the current generation of open-source models and how to make the best use of them, but we will also touch on features currently missing from the open-source serving stack as well as what the future generations of models will unlock.
// Bio
Timothée Lacroix, aged 31, is Chief Technical Officer in charge of technical issues relating to product efficacy and research. Started as an engineer at Facebook AI Research in 2015 in New York, where he completed his thesis between 2016 and 2019, in collaboration with École des Ponts, on tensor factorization for recommender systems. He continued his career at Meta until 2023 when he co-founded @Mistral-AI.
// Sign up for our Newsletter to never miss an event:
mlops.community/join/
// Watch all the conference videos here:
home.mlops.community/home/col...
// Check out the MLOps Community podcast: open.spotify.com/show/7wZygk3...
// Read our blog:
mlops.community/blog
// Join an in-person local meetup near you:
mlops.community/meetups/
// MLOps Swag/Merch:
mlops-community.myshopify.com/
// Follow us on Twitter:
/ mlopscommunity
//Follow us on Linkedin:
/ mlopscommunity

КОМЕНТАРІ: 13
@evermorecurious91
@evermorecurious91 4 місяці тому
This is gold!!!
@iandanforth
@iandanforth 6 місяців тому
There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.
@eduardoalvarez7152
@eduardoalvarez7152 3 місяці тому
The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided. Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.
@mndflctzn
@mndflctzn 5 місяців тому
This is awesome. Thanks for sharing super useful
@windmaple
@windmaple 6 місяців тому
Great talk!
@boussouarsari4482
@boussouarsari4482 2 місяці тому
It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?
@taohe9699
@taohe9699 3 місяці тому
Great talk! is there link to the slides for this talk?
@MLOps
@MLOps 23 дні тому
Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
@Gerald-iz7mv
@Gerald-iz7mv Місяць тому
hi what benchmark he run to generate the plots? any open source github links?
@janilbolswong1953
@janilbolswong1953 5 місяців тому
@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left
@jjh5474
@jjh5474 5 місяців тому
I guess "memory movement" mean movement from GPU memory(HBM) to GPU computing component. Model parameter stored in GPU memory not in compute component. So for computing model parameter moved from HBM to compute component every forward pass.
@fraternitas5117
@fraternitas5117 8 днів тому
yes, it needs to be loaded in the gpu all the time. advanced users optimize their applications by sending an equal number of bytes as the memory maximum to optimize the utilizations of all memory in the clock cycle.
@AbdulK-kr2jv
@AbdulK-kr2jv 26 днів тому
What a horrible unethical response on the ethics of training data
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Переглядів 1,8 млн
Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM
12:21
Google for Developers
Переглядів 1,7 тис.
Mistral AI: Frontier AI in Your Hands | NVIDIA GTC 2024
23:42
NVIDIA Developer
Переглядів 7 тис.
Webinar: How to Speed Up LLM Inference
25:01
Deci AI
Переглядів 6 тис.
Go Production:  ⚡️ Super FAST LLM (API) Serving with vLLM !!!
11:53
1littlecoder
Переглядів 24 тис.
Jeff Dean (Google): Exciting Trends in Machine Learning
1:12:30
Rice Ken Kennedy Institute
Переглядів 165 тис.
What is Retrieval-Augmented Generation (RAG)?
6:36
IBM Technology
Переглядів 463 тис.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Переглядів 72 тис.
Внутренности Rabbit R1 и AI Pin
1:00
Кик Обзор
Переглядів 1,6 млн
Samsung UE40D5520RU перезагружается, замена nand памяти
0:46
Слава 100пудово!
Переглядів 3,8 млн
How Neuralink Works 🧠
0:28
Zack D. Films
Переглядів 26 млн
Портативная PS 5 🎮 #ps5 #expressly
0:22
ExpresSLY Shorts
Переглядів 296 тис.