Agent Evaluation - AI News

New Benchmark BTF-2: Evaluating Strategic Reasoning Capabilities of AI Forecasting Agents

2026-04-30 research 👁 12

A new arXiv paper introduces "Bench to the Future 2," a benchmark that systematically evaluates reasoning strategy diffe…

2026-04-29 research 👁 10

A research team introduces the BenchGuard framework, the first to leverage frontier large language models to automatical…

2026-04-29 research 👁 10

Researchers introduce GAIA-v2-LILT, a refined pipeline combining functional alignment and cultural adaptation to address…