A review of OpenAI o1 and how we evaluate coding agents

🧀

View Website Cognition AI Devin AI Cognition Careers

The Cognition Team explores the capabilities of OpenAI's new o1-mini and o1-preview models through their AI software engineering agent, Devin. Enhancements in reasoning and performance are highlighted, as well as the importance of reliable evaluation methodology for coding agents.

Devin uses LLMs for reasoning about code.
o1 models boast improved analytical ability.
Devin-Base showed significant performance improvements.
A structured evaluation metric, cognition-golden, is used.
The team aims to create realistic and reproducible environments.

View Website Cognition AI Devin AI Cognition Careers

Social

A review of OpenAI o1 and how we evaluate coding agents