📑 Table of Contents

Every Top AI Model Scores 0% on ProgramBench

📅 · 📁 Research · 👁 8 views · ⏱️ 11 min read
💡 Meta, Stanford, and Harvard launch ProgramBench, a brutal new benchmark that asks AI to build software from scratch. GPT, Claude, and Gemini all fail completely.

All 9 Top AI Models Score Zero on Brutal New Software Benchmark

Every leading AI model — including GPT, Claude, and Gemini — has scored a flat 0% on ProgramBench, a devastating new benchmark that asks AI to write entire software programs from scratch. Released by a joint team from Meta, Stanford University, and Harvard University, the benchmark tests 9 frontier models across 200 real-world software projects, and not a single model managed to pass even one task.

The result is a sobering reality check for an industry that has spent the past year celebrating AI coding agents. While models have gotten remarkably good at fixing bugs and completing code snippets, ProgramBench reveals they cannot yet do what human software engineers do every day: build functional software from the ground up.

Key Takeaways

  • ProgramBench tests AI models on 200 real software projects with a unified evaluation framework
  • 9 frontier AI models were tested, including GPT, Claude, and Gemini — all scored 0%
  • The benchmark was created by the same team behind SWE-Bench, the industry-standard coding benchmark
  • Unlike previous benchmarks, ProgramBench asks models to write entire programs from scratch, not just fix bugs
  • The project includes systematic anti-cheating measures and standardized scaffolding
  • Lead author John Yang is a Stanford PhD student who also created SWE-Bench and SWE-agent

Not Bug Fixing — Building Software From Nothing

Here is the challenge ProgramBench poses to AI: you receive a usage document for a tool like FFmpeg, along with a compiled executable file. Now, write the entire program from scratch so it produces identical behavior.

This is fundamentally different from what existing benchmarks measure. SWE-Bench, the current gold standard for evaluating AI coding ability, gives models a complete codebase and asks them to locate and fix specific bugs or implement particular features. The model works within existing architecture, existing patterns, and existing tests.

ProgramBench throws all of that away. There is no existing codebase to reference. No architecture to follow. No test suite to guide the implementation. The AI must understand what a piece of software does from its documentation and binary alone, then recreate the entire thing — every module, every function, every edge case.

This distinction matters enormously. In the real world, the hardest part of software engineering is not patching a known bug. It is making the thousands of architectural decisions required to build a working system from a blank file.

The Team Behind the Benchmark Has Serious Credibility

ProgramBench was not created by outsiders trying to embarrass AI companies. The lead author, John Yang, is a PhD student at Stanford and the original creator of both SWE-Bench and SWE-agent — two of the most widely used tools for evaluating AI coding performance.

SWE-Bench has become the de facto standard that companies like OpenAI, Anthropic, and Google use to demonstrate their models' coding capabilities. When these companies announce improvements in coding performance, they almost always cite SWE-Bench scores. Yang and his collaborators are now saying: passing SWE-Bench is not enough.

The collaboration between Meta, Stanford, and Harvard gives the benchmark institutional weight. The research paper is available at programbench.com, and the benchmark is designed to be reproducible and extensible — a critical feature that distinguishes rigorous academic benchmarks from one-off demonstrations.

Why Previous 'AI Builds Software' Claims Fall Short

Over the past year, headlines about AI building software from scratch have proliferated. Anthropic demonstrated a group of parallel Claude instances writing a C compiler. Cursor published blog posts about long-duration autonomous programming sessions. Epoch AI's MirrorCode project explored similar territory.

But ProgramBench's creators identified a critical flaw in all these demonstrations:

  • Each case tested only a handful of projects, making results statistically meaningless
  • The scaffolding and prompting were manually tuned for each specific task
  • There was no standardized evaluation criteria across different projects
  • Anti-cheating measures were absent or minimal
  • Results were not reproducible by independent researchers

ProgramBench addresses every one of these shortcomings. With 200 tasks, unified scaffolding, systematic anti-contamination checks, and a standardized evaluation pipeline, it brings the rigor of a true benchmark to a problem that has previously been evaluated only through cherry-picked anecdotes.

The difference between a curated demo and a systematic benchmark cannot be overstated. A company can always find one project where its model performs impressively. A benchmark with 200 diverse tasks reveals the model's actual capability distribution — and right now, that distribution is centered squarely on zero.

What This Reveals About the State of AI Coding

The 0% pass rate does not mean AI coding tools are useless. Far from it. Models like GitHub Copilot, Claude, and Cursor genuinely accelerate developer productivity on day-to-day tasks. They excel at:

  • Autocompleting code within existing files
  • Translating natural language descriptions into function implementations
  • Fixing bugs when given sufficient context
  • Writing unit tests for existing code
  • Refactoring and optimizing known patterns

What they cannot do — as ProgramBench definitively shows — is handle the full complexity of software creation. Building a program from scratch requires a qualitatively different set of capabilities:

  • Architectural reasoning: deciding how to decompose a problem into modules
  • Long-horizon planning: maintaining coherence across thousands of lines of code
  • Implicit knowledge: understanding unstated conventions, edge cases, and platform behaviors
  • Self-verification: knowing whether the code actually works without external test suites
  • Resource management: handling memory, concurrency, and system-level concerns

Current models appear to lack all of these capabilities at the level required for from-scratch software construction. The benchmark reveals a chasm between 'impressive coding assistant' and 'autonomous software engineer' that the industry has been too eager to paper over.

The Leaderboard Problem and Benchmark Gaming

ProgramBench also arrives at a moment when the AI community is increasingly skeptical of benchmark scores. Major model providers have been accused of 'teaching to the test' — optimizing their models specifically for popular benchmarks like SWE-Bench, MMLU, and HumanEval.

This creates a misleading picture of AI capabilities. A model might achieve an impressive 50% on SWE-Bench while being fundamentally unable to build even the simplest software project independently. ProgramBench's 0% across all models suggests that previous benchmark scores may have been inflating our perception of AI coding ability.

The anti-cheating measures built into ProgramBench are particularly noteworthy. The team implemented systematic contamination checks to ensure models have not simply memorized the source code of test projects during training. This is a growing concern as training datasets expand to include virtually all public code on the internet.

Industry Implications Are Significant

For enterprise leaders evaluating AI coding investments, ProgramBench delivers a clear message: AI coding agents are powerful assistants, not autonomous developers. Organizations should continue investing in AI-augmented development workflows while maintaining realistic expectations about what these tools can and cannot do independently.

For AI researchers, the benchmark opens a new frontier. The gap between bug-fixing (where models perform reasonably well) and program creation (where they score zero) represents one of the most important unsolved problems in AI. Closing this gap likely requires advances in planning, reasoning, and long-context coherence that go beyond simply scaling existing architectures.

For developers using AI tools daily, the results validate what many have intuited: these tools are extraordinary at tactical coding tasks but struggle with strategic software decisions. The most effective workflow remains human-led architecture with AI-assisted implementation.

Looking Ahead: What Comes Next

ProgramBench sets a clear, measurable target for the AI industry. The first model to score even 10% on this benchmark will represent a genuine breakthrough in autonomous software engineering. Given the current trajectory of model improvements, that milestone could arrive within 12 to 24 months — but the 0% baseline suggests it will require more than incremental improvements.

The benchmark also raises fundamental questions about the path to Artificial General Intelligence (AGI). If the most capable AI systems on Earth cannot replicate software that a skilled human developer could build in a few weeks, the gap between current AI and human-level intelligence may be wider than recent hype suggests.

John Yang and the ProgramBench team have given the AI industry something it desperately needed: an honest, rigorous, and humbling measure of where we actually stand. The leaderboards that have dominated AI discourse for the past 2 years just got a new column — and every entry reads zero.