Chameleon: Meta's New Multi-Modal LLM

🧀

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models demonstrate state-of-the-art performance in image captioning and competitive results in other tasks, outperforming several existing models in diverse evaluations.

Chameleon models can process images and text.
Training approach and alignment recipe provided.
Outperforms Llama-2 in text tasks.
Competitive with Mixtral 8x7B and Gemini-Pro.
Exceeds GPT-4V in mixed-modal evaluation.

View Website ArXiv Entry PDF

Social

Chameleon: Meta's New Multi-Modal LLM