Stanford HAI Finds AI Benchmarks Hitting Ceiling
Stanford's HAI 2025 AI Index reveals that leading AI models now saturate most major benchmarks, raising urgent questions…
38 articles about 'ai benchmarks'
Stanford's HAI 2025 AI Index reveals that leading AI models now saturate most major benchmarks, raising urgent questions…
Hugging Face debuts an open leaderboard for evaluating agentic AI systems, bringing transparency to one of AI's fastest-…
Google DeepMind launches Gemini 2.5 Ultra, its most powerful AI model yet, featuring a million-token context window and …
OpenAI releases GPT-5.5 Instant with 52.5% fewer hallucinations and major math gains, replacing GPT-5.3 Instant for all …
Anthropic's Claude 4 sets new records on GPQA and other graduate-level evaluations, outperforming GPT-4o and Gemini Ultr…
Elon Musk's xAI releases Grok 3.5, which outperforms OpenAI's GPT-5 across major mathematical reasoning benchmarks.
Anthropic's Claude Opus 4 achieves state-of-the-art results on GPQA Diamond, outperforming OpenAI and Google on PhD-leve…
OpenAI's GPT-5 Turbo achieves breakthrough scores on complex reasoning benchmarks, outpacing rivals by significant margi…
A growing debate among researchers and industry leaders questions whether popular AI benchmarks reflect genuine intellig…
Mistral AI releases Codestral 2.0, a code-focused LLM matching or exceeding GPT-5 on key coding benchmarks.
The context window race is effectively over. The real competition now shifts to reasoning depth, efficiency, and archite…
Carnegie Mellon researchers unveil new techniques to enhance reasoning capabilities in vision-language models, closing k…