Mistral AI announced the release of Pixtral 12B, the first-ever multimodal model that combines a 400M parameter vision encoder with a 12B parameter multimodal decoder. It's trained on interleaved image and text data, offering excellent performance on multimodal tasks without compromising on text benchmarks. Pixtral 12B surpasses larger models with 52.5% on the MMMU reasoning benchmark and shows robust abilities in chart understanding, document question answering, and instruction following. The model is licensed under Apache 2.0 and is available to try on La Plateforme or Le Chat.