Fast LLM Serving with vLLM and PagedAttention
Published : 12-10-2023 - Duration : 00:32:07 - Like : 478
Youtube : Download Convert to MP3
Description :
LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm...
![](http://www.okevideotube.com/img/indodax468x60.gif)
Related Videos :
![]() |
Intellectual Property with GenAI: What LLM Developers Need to Know By: Anyscale |
![]() |
[1hr Talk] Intro to Large Language Models By: Andrej Karpathy |
![]() |
Erlang Factory SF 2016 - The Climb: Experiencing the Rise of Elixir from the Inside By: Erlang Solutions |
![]() |
Enabling Cost-Efficient LLM Serving with Ray Serve By: Anyscale |
![]() |
Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral By: MLOps.community |
![]() |
FlashAttention - Tri Dao | Stanford MLSys #67 By: Stanford MLSys Seminars |