Chameleon: Meta's New Multi-Modal LLM


We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models demonstrate state-of-the-art performance in image captioning and competitive results in other tasks, outperforming several existing models in diverse evaluations.

  • Chameleon models can process images and text.
  • Training approach and alignment recipe provided.
  • Outperforms Llama-2 in text tasks.
  • Competitive with Mixtral 8x7B and Gemini-Pro.
  • Exceeds GPT-4V in mixed-modal evaluation.