FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

🧀

View Website FlashAttention GitHub FlashAttention Paper Together AI Careers

FlashAttention-3 brings new optimizations on Hopper GPUs, achieving 1.5-2.0x the speed of its predecessor with FP16 and nearly 1.2 PFLOPS with FP8, all while reducing quantization error. These improvements enhance efficiency and enable longer context AI models.

FlashAttention-3 is 1.5-2x faster than FlashAttention-2.
Achieves up to 740 TFLOPS with FP16.
Reaches close to 1.2 PFLOPS with FP8.
Optimizes for modern Hopper GPUs.
Reduces quantization error by 2.6x.

View Website FlashAttention GitHub FlashAttention Paper Together AI Careers

Social

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision