Fast LLM Serving with vLLM and PagedAttention

  Переглядів 14,865

Anyscale

Anyscale

7 місяців тому

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past three months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.
About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.
www.anyscale.com/
If you're interested in a managed Ray service, check out:
www.anyscale.com/signup/
About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
docs.ray.io/en/latest/
#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

КОМЕНТАРІ: 12
@hemanthsethuram6740
@hemanthsethuram6740 2 місяці тому
Beautiful adaptation of a fundamental idea of paging, reference counting and copy -on-write.👌
@dinoscheidt
@dinoscheidt 6 місяців тому
Full circle dynamic memory management and garbage collection. Great talk!
@LiangyueLi
@LiangyueLi 9 днів тому
great work
@simonguo1048
@simonguo1048 3 місяці тому
Such an elegant idea and amazingly clear explanation!
@vaporeon2822
@vaporeon2822 2 дні тому
Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.
@harshadkunjir5800
@harshadkunjir5800 6 місяців тому
This is so great!
@mshonle
@mshonle 6 місяців тому
It seems like there would be a performance increase for beam search as well? (That is, in addition to the memory savings it gets.) Would be great to see some benchmarks for that!
@billykotsos4642
@billykotsos4642 4 місяці тому
sick
@alankhor2000
@alankhor2000 3 місяці тому
I think the last question was asked on impact on latency
@julien3578
@julien3578 3 місяці тому
brilliant guys
@ameynaik2743
@ameynaik2743 6 місяців тому
Is vLLM engine running on the host?
@fxhp1
@fxhp1 3 місяці тому
you run the server on the host that has the GPU installed, the server can be accessible over an API remotely using openai's client. follow me for more AI vids
SkyPilot: Run AI on Any Cloud
30:09
Anyscale
Переглядів 1,5 тис.
顔面水槽がブサイク過ぎるwwwww
00:58
はじめしゃちょー(hajime)
Переглядів 77 млн
Піхотинці - про потребу у людях
00:57
Суспільне Новини
Переглядів 1 млн
Meta Announces Llama 3 at Weights & Biases’ conference
26:16
Weights & Biases
Переглядів 72 тис.
Webinar: How to Speed Up LLM Inference
25:01
Deci AI
Переглядів 6 тис.
OpenAI "SHOCKED" Everyone! Voice, Vision, & Free?!
8:58
Theoretically Media
Переглядів 12 тис.
LangChain Explained in 13 Minutes | QuickStart Tutorial for Beginners
12:44
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Переглядів 220 тис.
What is RAG? (Retrieval Augmented Generation)
11:37
Don Woodlock
Переглядів 65 тис.
顔面水槽がブサイク過ぎるwwwww
00:58
はじめしゃちょー(hajime)
Переглядів 77 млн