UK AI Safety Institute Teams Up With Anthropic
The UK AI Safety Institute announces a landmark partnership with Anthropic to conduct pre-deployment evaluations of fron…
14 articles about 'Model Evaluation'
The UK AI Safety Institute announces a landmark partnership with Anthropic to conduct pre-deployment evaluations of fron…
The ARC Prize Foundation analyzed 160 test runs of OpenAI's and Anthropic's latest models on the ARC-AGI-3 benchmark, id…
A new study introduces CL-bench Life, a benchmark that systematically evaluates the ability of large language models to …
A research team has released BatteryPass-12K, the first publicly available benchmark dataset for Digital Battery Passpor…
A latest arXiv paper proposes a Bayesian statistics-based LLM production model migration framework that enables reliable…
A new preregistered study using option-order randomization experiments found that when large language models are prompte…
A latest arXiv paper proposes a method to correct performance estimation bias for minority class sub-concepts in imbalan…
A systematic study covering 115 large language models has released the DenialBench benchmark, quantitatively analyzing h…
The developer community has launched a new benchmarking tool specifically designed to evaluate whether large language mo…
DeepSeek V4's technical report has sparked industry-wide frenzy, but beyond the impressive specs on paper, 10 frontline …
Researchers introduce XTC-Bench, the first benchmark to systematically evaluate semantic consistency between visual unde…
A latest arXiv paper reexamines the existential assumption of the 'True Target' in machine learning from a philosophical…