New Benchmark BTF-2: Evaluating Strategic Reasoning Capabilities of AI Forecasting Agents
A new arXiv paper introduces "Bench to the Future 2," a benchmark that systematically evaluates reasoning strategy diffe…
3 articles about 'Agent Evaluation'
A new arXiv paper introduces "Bench to the Future 2," a benchmark that systematically evaluates reasoning strategy diffe…
A research team introduces the BenchGuard framework, the first to leverage frontier large language models to automatical…
Researchers introduce GAIA-v2-LILT, a refined pipeline combining functional alignment and cultural adaptation to address…