llm evaluation - AI News

0% Pass Rate: New Benchmark Stumps All AI Models

2026-05-07 research 👁 9

The creators of SWE-Bench release a devastating new benchmark where Claude, GPT, and Gemini all score zero, exposing fun…

2026-05-07 tutorial 👁 12

A practical guide to measuring LLM quality using RAGAS and DeepEval, two leading open-source evaluation frameworks.

2026-05-06 industry 👁 9

Weights and Biases unveils a dedicated MLOps platform designed to streamline LLM evaluation pipelines for enterprise AI …

2026-05-06 research 👁 8

IIT Bombay researchers release a comprehensive evaluation framework targeting AI performance across dozens of underserve…

2026-05-06 opinion 👁 10

A growing debate among researchers and industry leaders questions whether popular AI benchmarks reflect genuine intellig…

2026-05-05 app 👁 8

Weights and Biases releases Weave 2.0, a comprehensive automated evaluation framework designed to streamline LLM testing…

2026-05-05 app 👁 9

Weights and Biases releases Weave, an open-source platform for monitoring, evaluating, and debugging LLM applications in…

2026-05-05 tutorial 👁 10

A complete guide to using the RAGAS framework for measuring and improving LLM output quality in RAG pipelines.

2026-05-04 research 👁 9

MathNet introduces 30,000 competition-level math problems to rigorously test AI mathematical reasoning, raising the bar …

2026-04-29 research 👁 10

A latest arXiv paper systematically quantifies the 'self-preference bias' phenomenon when large language models serve as…