🧀 BigCheese.ai

Social

Transfusion: Predict the Next Token and Diffuse Images with One Multimodal Model

🧀

Transfusion introduces a novel method to train multi-modal models on both discrete and continuous data, combining language modeling with diffusion techniques. The pretraining of various sized models up to 7B parameters showcases significant scaling benefits over traditional methods, especially in uni- and cross-modal contexts.

  • Pretrained on text and image data
  • Models up to 7B parameters
  • Scales better than image tokenization
  • Modality-specific encoders and decoders
  • Generates competitive images and text