We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models demonstrate state-of-the-art performance in image captioning and competitive results in other tasks, outperforming several existing models in diverse evaluations.