SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

🧀

View Website Github arXiv Preprint Academic Project Page Template Creative Commons License

Sequoia introduces a scalable, robust, and hardware-aware speculative decoding framework for serving large language models like Llama2-70B with low latency on consumer GPUs like RTX4090. Sequoia achieves notable speed improvements over previous systems, facilitates faster model inference, and adapts to different hardware with no loss of output quality. Its robustness is attributed to handling speculation budgets efficiently and integrating a sampling without replacement algorithm to ensure consistency in generated temperatures.

Sequoia reduces token latency to 0.57s for Llama2-70B on RTX4090.
Vicuna-33B model can be served on a 2080Ti GPU with a TBT of 0.87s.
Sequoia provides scalable solutions across different GPUs and models.
The framework adapts to hardware variations with dynamic adjustment.
Sequoia is robust against temperature variation in speculation tokens.

View Website Github arXiv Preprint Academic Project Page Template Creative Commons License

Social

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency