Переглядів 19,394
New research has dropped showing how the Llama model can be drastically shrunk without reducing output quality. This new method means it can take advantages of specialized hardware and perform so much faster than before that Nvidia should be scared.
This video is based on this paper: arxiv.org/pdf/2402.17764.pdf