Model Evaluation - AI News

UK AI Safety Institute Teams Up With Anthropic

2026-05-05 industry 👁 10

The UK AI Safety Institute announces a landmark partnership with Anthropic to conduct pre-deployment evaluations of fron…

2026-05-02 research 👁 9

The ARC Prize Foundation analyzed 160 test runs of OpenAI's and Anthropic's latest models on the ARC-AGI-3 benchmark, id…

2026-05-01 research 👁 15

A new study introduces CL-bench Life, a benchmark that systematically evaluates the ability of large language models to …

2026-05-01 research 👁 11

A research team has released BatteryPass-12K, the first publicly available benchmark dataset for Digital Battery Passpor…

2026-05-01 research 👁 12

A latest arXiv paper proposes a Bayesian statistics-based LLM production model migration framework that enables reliable…

2026-04-30 research 👁 11

A new preregistered study using option-order randomization experiments found that when large language models are prompte…

2026-04-30 research 👁 10

A latest arXiv paper proposes a method to correct performance estimation bias for minority class sub-concepts in imbalan…

2026-04-30 research 👁 11

A systematic study covering 115 large language models has released the DenialBench benchmark, quantitatively analyzing h…

2026-04-30 llm 👁 10

The developer community has launched a new benchmarking tool specifically designed to evaluate whether large language mo…

2026-04-30 opinion 👁 13

DeepSeek V4's technical report has sparked industry-wide frenzy, but beyond the impressive specs on paper, 10 frontline …

2026-04-29 research 👁 10

Researchers introduce XTC-Bench, the first benchmark to systematically evaluate semantic consistency between visual unde…

2026-04-29 research 👁 10

A latest arXiv paper reexamines the existential assumption of the 'True Target' in machine learning from a philosophical…