Transfusion: Predict the Next Token and Diffuse Images with One Multimodal Model

🧀

Transfusion introduces a novel method to train multi-modal models on both discrete and continuous data, combining language modeling with diffusion techniques. The pretraining of various sized models up to 7B parameters showcases significant scaling benefits over traditional methods, especially in uni- and cross-modal contexts.

Pretrained on text and image data
Models up to 7B parameters
Scales better than image tokenization
Modality-specific encoders and decoders
Generates competitive images and text

View Website ArXiv Paper View PDF DOI

Social

Transfusion: Predict the Next Token and Diffuse Images with One Multimodal Model