Kagi has introduced the LLM Benchmarking Project to evaluate major large language models on their reasoning, coding, and instruction-following abilities. The benchmarks apply novel and frequently changing challenges to prevent overfitting. Results show varying accuracy, cost, latency, and token speed among models. OpenAI's gpt-4o leads the rankings with an accuracy of 52%, showcasing the evolving landscape of LLM performance for applications like Kagi Search's reasoning and instruction-following features.