Sequoia introduces a scalable, robust, and hardware-aware speculative decoding framework for serving large language models like Llama2-70B with low latency on consumer GPUs like RTX4090. Sequoia achieves notable speed improvements over previous systems, facilitates faster model inference, and adapts to different hardware with no loss of output quality. Its robustness is attributed to handling speculation budgets efficiently and integrating a sampling without replacement algorithm to ensure consistency in generated temperatures.