A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images?


The article discusses GPT-4o's process of encoding high-res images into tiles and posits speculative CNN architectures for the transformation. It questions the number 170 used by OpenAI for token counting and suggests strategies such as pyramid strategy for image encoding. Finally, it explores GPT-4o's approach towards OCR and its handling of alpha channels in images.

  • GPT-4o charges 170 tokens per 512x512 image tile.
  • CLIP model embeds images as a single vector, unlike GPT-4o.
  • GPT-4o might use an CNN architecture mixing CLIP and YOLO.
  • The 'pyramid strategy' could explain GPT-4o's token structure.
  • GPT-4o ignores the alpha channel, treating transparent white as white.