📑 Table of Contents

DeepMind's Aletheia Cracks 13 Erdős Problems—Then Hits a Blunder

📅 · 📁 Research · 👁 11 views · ⏱️ 12 min read
💡 Google DeepMind's Aletheia system solved 13 long-standing Erdős conjectures in 7 days, but the process revealed AI's embarrassing blind spots.

Google DeepMind's new mathematical reasoning system, dubbed Aletheia, has cracked 13 of legendary mathematician Paul Erdős's unsolved conjectures in just 7 days. But buried in the triumph is an awkward revelation: the AI sometimes spent dozens of pages on rigorous derivations—only to discover the problem statement itself was flawed.

The results, published in a paper on arXiv, mark one of the most ambitious attempts to weaponize large language models against open problems in pure mathematics. Yet they also expose the fundamental gap between computational brute force and genuine mathematical insight.

Key Takeaways

  • Aletheia solved 13 out of 700 Erdős conjectures that had remained open for up to 50 years
  • The system processed candidates through a brutal multi-stage pipeline, filtering 700 problems down to 200 candidates, then 63, and finally 13 confirmed solutions
  • In several cases, the AI produced extensive multi-page proofs before realizing the problem's premises contained errors
  • The system is built on Gemini's Deep Think mode, leveraging massive compute to generate candidate solutions
  • Human mathematicians still performed final verification on all 13 results
  • The work raises serious questions about whether AI is truly accelerating science—or just industrializing a narrow slice of it

The Erdős Bounty: 50 Years of Unsolved Challenges

Paul Erdős, widely regarded as the 20th century's most prolific mathematician, left behind hundreds of open conjectures before his death in 1996. Each came with a cash bounty ranging from $50 to $5,000—modest sums that belied their extraordinary difficulty.

For half a century, some of the world's brightest mathematical minds attacked these problems and came away empty-handed. The conjectures span combinatorics, number theory, graph theory, and probabilistic methods—areas where human intuition has traditionally been essential.

DeepMind's decision to target this specific collection was strategic. Erdős problems are well-documented, precisely stated, and carry an almost mythical status in the mathematics community. Solving even one would generate headlines. Solving 13 in a week? That's a publicity bonanza.

Inside Aletheia's Brutal Filtering Pipeline

Aletheia's approach is less 'beautiful mind' and more 'industrial processing plant.' The system doesn't experience mathematical insight the way humans do. Instead, it operates a ruthlessly efficient elimination pipeline that would feel right at home in a Silicon Valley growth-hacking playbook.

Here's how the pipeline works:

  • Stage 1 — Ingestion: All 700 known Erdős conjectures are loaded into the system as the initial candidate pool
  • Stage 2 — Deep Think Generation: Gemini's Deep Think mode burns through massive compute resources, generating approximately 200 candidate solution approaches
  • Stage 3 — Natural Language Verification: An automated verifier checks logical consistency, eliminating solutions with broken reasoning chains, reducing the pool to 63
  • Stage 4 — Formal Verification: Remaining candidates undergo more rigorous automated checking
  • Stage 5 — Human Review: Professional mathematicians examine the final outputs, confirming 13 as valid solutions

The sheer attrition rate tells its own story. From 700 problems, only 13 survived—a success rate of roughly 1.9%. This is not an AI that 'understands' mathematics. It's a system that generates enormous volumes of candidate reasoning and then filters aggressively.

When AI Writes 30 Pages—Then Discovers the Problem Was Wrong

Perhaps the most revealing moment in the entire project wasn't a triumph—it was a blunder. In multiple instances, Aletheia produced extensive derivations spanning dozens of pages, following impeccable logical chains, only to arrive at a devastating conclusion: the problem statement itself contained errors or was based on flawed premises.

This is the AI equivalent of writing a perfect 30-page essay answering the wrong exam question. A human mathematician would typically catch such issues early through intuition, domain experience, or a simple gut check. Aletheia, lacking any such metacognitive ability, plowed ahead with industrial determination.

The incident highlights a critical limitation of current AI reasoning systems. They can follow logical chains with superhuman endurance and precision, but they lack the higher-order judgment to step back and ask: 'Does this problem even make sense?'

This blind spot isn't unique to Aletheia. OpenAI's o1 and o3 reasoning models, Anthropic's Claude with extended thinking, and other frontier systems all share this vulnerability. They optimize for forward reasoning without the capacity for the kind of skeptical, self-reflective questioning that characterizes expert human thought.

13 Out of 700: Is AI Really Accelerating Science?

The headline '13 Erdős conjectures solved' sounds transformative. But context matters enormously here.

The 13 solved problems, while genuinely impressive, were not necessarily the hardest or most consequential in the collection. The system naturally gravitated toward problems amenable to its particular strengths—those where brute-force exploration of solution spaces could yield results. The truly deep, conceptually challenging conjectures remained untouched.

Compare this to AlphaProof, DeepMind's earlier mathematical AI that earned a silver medal at the International Mathematical Olympiad in 2024. That system also demonstrated impressive problem-solving but operated in a constrained competition format with well-defined problem types.

Several critical questions emerge:

  • Are these 13 solutions genuinely novel, or do they combine and reconfigure existing mathematical techniques in ways that a sufficiently dedicated human team could have achieved?
  • Does the 1.9% success rate justify the enormous computational cost involved?
  • How many of the 'failed' attempts produced misleading or subtly incorrect results that could waste human researchers' time?
  • Can this approach scale to problems that require fundamentally new mathematical concepts rather than recombination of existing ones?

The honest answer is that AI is not yet 'doing mathematics' in any meaningful sense. It is performing extraordinarily sophisticated pattern matching and logical chain construction at scale. This is valuable, but it is categorically different from mathematical creativity.

The 'Logic Laundering' Problem in AI Research

Critics have described Aletheia's approach as 'logic laundering'—taking existing mathematical knowledge, running it through a massive computational pipeline, and producing outputs that look like original research but are fundamentally derivative.

This criticism has teeth. The system's training data includes virtually the entire corpus of published mathematics. Its 'solutions' inevitably draw on patterns, techniques, and partial results that human mathematicians have developed over decades. The question of genuine novelty versus sophisticated recombination is not easily resolved.

This pattern extends beyond mathematics. Across scientific disciplines, AI systems are increasingly being deployed to 'solve' open problems. In protein folding, AlphaFold genuinely transformed the field. In drug discovery, results have been more mixed. In mathematics, the jury is still very much out.

The risk is that AI-generated solutions create a false sense of progress. If the mathematical community begins accepting AI outputs without sufficiently rigorous human verification, subtle errors could propagate through the literature. The Aletheia team's insistence on human review for all final results suggests they are aware of this danger.

What This Means for Mathematicians and AI Researchers

For working mathematicians, Aletheia represents both a powerful tool and a cautionary tale. The system can serve as an incredibly productive 'research assistant'—generating candidate approaches, checking logical consistency, and exploring solution spaces that would take humans months or years.

But it cannot replace mathematical judgment. The embarrassing blunder of pursuing flawed problem statements illustrates that human oversight remains essential at every stage of the process.

For AI researchers, the project offers important lessons:

  • Scale alone is insufficient: Throwing more compute at mathematical reasoning produces diminishing returns without better metacognitive capabilities
  • Verification is harder than generation: The multi-stage filtering pipeline reflects the fundamental asymmetry between producing candidate solutions and confirming their validity
  • Domain expertise matters: The system's inability to detect flawed premises shows that genuine mathematical reasoning requires more than logical chain construction
  • Transparency is essential: Publishing the full pipeline, including failures, builds more trust than cherry-picked success stories

Looking Ahead: The Race to Build Mathematical AI

DeepMind is not alone in pursuing mathematical AI. OpenAI has invested heavily in formal reasoning capabilities for its o-series models. Meta AI has developed Llemma and other math-focused models. Startups like Harmonic are building dedicated theorem-proving systems.

The next major milestone will likely be an AI system that can not only solve existing open problems but formulate genuinely new conjectures—problems that no human has thought to ask. That capability remains firmly out of reach for current architectures.

In the meantime, Aletheia's 13 solutions represent a meaningful but modest step forward. They demonstrate that AI can contribute to mathematical research at the frontier. They also demonstrate, with uncomfortable clarity, that the gap between computational power and genuine mathematical understanding remains vast.

The $50 to $5,000 bounties on Erdős's conjectures were never really about the money. They were about the glory of solving problems that pushed the boundaries of human thought. Whether an AI system deserves that glory—or merely the computational equivalent of a participation trophy—remains an open question that no amount of compute can resolve.