🏷️ Large Model Evaluation

3 articles about 'Large Model Evaluation'

Latest AI Models Still Make Three Types of Systematic Reasoning Errors

2026-05-02 research 👁 9

The ARC Prize Foundation analyzed 160 test runs of OpenAI's and Anthropic's latest models on the ARC-AGI-3 benchmark, id…

2026-04-30 research 👁 11

A new preregistered study using option-order randomization experiments found that when large language models are prompte…

2026-04-30 opinion 👁 13

DeepSeek V4's technical report has sparked industry-wide frenzy, but beyond the impressive specs on paper, 10 frontline …