🧀 BigCheese.ai

Social

A review of OpenAI o1 and how we evaluate coding agents

🧀

The Cognition Team explores the capabilities of OpenAI's new o1-mini and o1-preview models through their AI software engineering agent, Devin. Enhancements in reasoning and performance are highlighted, as well as the importance of reliable evaluation methodology for coding agents.

  • Devin uses LLMs for reasoning about code.
  • o1 models boast improved analytical ability.
  • Devin-Base showed significant performance improvements.
  • A structured evaluation metric, cognition-golden, is used.
  • The team aims to create realistic and reproducible environments.