📑 Table of Contents

AI Aces STEM but Fails This One Liberal Arts Question

📅 · 📁 Opinion · 👁 7 views · ⏱️ 12 min read
💡 AI models now outscore top human students in science and math, yet a simple humanities question exposes a critical weakness.

AI Crushes Human Top Scorers in Science — Then Hits a Wall

Artificial intelligence models from OpenAI, Google, and Anthropic now routinely outscore the highest-performing human students on standardized science and math exams. Yet a seemingly simple liberal arts question — one requiring genuine lived experience, emotional nuance, and cultural self-awareness — has exposed what may be AI's most fundamental limitation.

The revelation has sparked a global conversation about what incoming college students should actually study in the age of AI, and where human intelligence still holds an unassailable advantage. As models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro achieve near-perfect scores on physics, chemistry, and calculus exams, the 'soft skills' gap is becoming the most important story in education and workforce planning.

Key Takeaways

  • Leading AI models score in the 95th–99th percentile on STEM standardized tests, outperforming virtually all human test-takers
  • A single humanities-style prompt — asking for personal reflection rooted in lived experience — caused multiple top models to produce generic, unconvincing responses
  • The gap highlights AI's core weakness: it can compute but cannot truly experience
  • Students entering college in 2025 may benefit more from doubling down on humanities, ethics, and creative disciplines than from competing with AI on pure technical skills
  • Employers are increasingly valuing 'human-proof' competencies like empathy, cultural fluency, and narrative reasoning
  • The finding aligns with broader research showing LLMs struggle with subjective judgment, ambiguity, and authentic self-expression

AI's STEM Dominance Is Now Undeniable

The numbers are staggering. OpenAI's GPT-4o scores above the 90th percentile on the AP Physics, AP Calculus, and AP Chemistry exams. Google's Gemini 1.5 Pro has demonstrated similar dominance on international math olympiad problems. Anthropic's Claude 3.5 Sonnet consistently aces graduate-level science benchmarks.

In China, where the annual Gaokao college entrance exam is a high-stakes national event, AI models have been tested against provincial top scorers — the equivalent of valedictorians. The results are sobering for human competitors. Multiple AI systems scored higher than the top 0.1% of human test-takers in mathematics, physics, and chemistry sections.

This isn't a marginal advantage. On many quantitative benchmarks, the gap between AI and the best human performers is now measured in double-digit percentage points. The era of AI as a 'pretty good calculator' is over. These systems genuinely outperform elite human minds on structured STEM problems.

The Liberal Arts Question That Broke AI

Then came the humanities prompt. The question, which has circulated widely on Chinese social media and education forums, asked test-takers to write a reflective essay drawing on personal experience, emotional growth, and cultural identity. It required the writer to connect a specific life event to a broader philosophical or social insight — the kind of task that rewards authenticity, vulnerability, and genuine self-knowledge.

Every major AI model produced responses that were technically fluent but emotionally hollow. The essays read like well-structured Wikipedia summaries of what a human might feel, rather than genuine expressions of lived experience. Evaluators — both human graders and AI researchers — noted several consistent failures:

  • No authentic personal narrative: AI fabricated plausible-sounding but generic 'experiences' that lacked specificity and emotional weight
  • Surface-level cultural engagement: Models referenced cultural concepts without demonstrating genuine understanding or personal connection
  • Formulaic emotional arcs: Every AI response followed a predictable structure — challenge, reflection, growth — without the messiness of real human development
  • Inability to surprise: Human top scorers produced unexpected insights and unique perspectives; AI responses were competent but predictable
  • Lack of genuine voice: The essays could have been written by any model, about any person, in any context — they had no distinctive authorial presence

This failure isn't a bug that can be patched with more training data. It reflects a structural limitation of large language models: they predict statistically likely next tokens based on patterns in training data. They have never experienced joy, loss, confusion, or wonder. They can describe these states with impressive accuracy, but they cannot write from within them.

Why This Matters More Than Any Benchmark Score

The implications extend far beyond exam scores. As AI continues to automate technical and analytical tasks across industries, the skills that remain uniquely human become exponentially more valuable. This is not a theoretical argument — it is already reshaping hiring practices at major companies.

McKinsey's 2024 workforce report estimated that demand for social-emotional skills in the workplace will grow by 26% by 2030, while demand for basic cognitive skills (the kind AI excels at) will decline by 14%. Similarly, the World Economic Forum's Future of Jobs Report ranks analytical thinking alongside creative thinking and curiosity as the top skills for 2025 — notably placing human-centric skills on equal footing with technical ones.

The liberal arts question didn't just expose an AI weakness. It illuminated what might be the most important career insight for the class of 2025: in a world where machines can solve differential equations faster than any human, the ability to write a genuinely moving essay about your grandmother's garden may be the more valuable skill.

The Incoming Freshman's 'Counter-Attack Guide'

Education commentators have begun circulating what they call a 'counter-attack guide' for students entering college this fall. The core advice represents a dramatic inversion of the conventional wisdom that STEM majors guarantee career security. Key recommendations include:

  • Double major strategically: Pair a technical field with humanities, philosophy, or creative writing to build a skill set AI cannot replicate
  • Develop your authentic voice: Practice personal essay writing, journaling, and reflective thinking — these are the competencies AI fails at most consistently
  • Study ethics and philosophy: As AI deployment accelerates, organizations desperately need people who can reason about values, fairness, and unintended consequences
  • Build cross-cultural fluency: Deep understanding of specific cultural contexts — not surface-level 'diversity awareness' — remains a uniquely human strength
  • Learn to collaborate with AI, not compete against it: Use AI tools for computation, research, and drafting, while focusing your own development on judgment, taste, and emotional intelligence
  • Invest in embodied experiences: Travel, volunteer work, artistic practice, and community engagement build the kind of knowledge that no dataset can provide

This guidance aligns with recent moves by major universities. Stanford expanded its undergraduate humanities requirements in 2024. MIT now requires all engineering students to complete coursework in ethics and social science. Harvard's most popular course remains a psychology class focused on happiness and human flourishing — not a computer science offering.

The Technical Explanation: Why LLMs Struggle With Subjectivity

From a technical perspective, the failure is well-understood. Large language models are trained on vast corpora of text using next-token prediction. They learn statistical patterns — what words, phrases, and structures tend to follow each other. This makes them extraordinarily good at tasks with clear patterns and objective answers: math, coding, scientific reasoning, and factual recall.

But subjective, experience-based writing has no 'correct' answer to converge on. The best human essays on personal topics are great precisely because they are unexpected — they break patterns rather than follow them. A model optimized to produce the most statistically likely response will, by definition, produce something average.

Researchers at DeepMind and Anthropic have published papers noting this limitation. Even techniques like RLHF (Reinforcement Learning from Human Feedback) and constitutional AI can improve tone and safety but cannot inject genuine experience into a model's outputs. The problem is not alignment — it is ontological. The model has no self to express.

Compared to benchmarks like MMLU or HumanEval, where GPT-4-class models score above 85%, performance on subjective writing evaluation remains stubbornly inconsistent. There is no clear path to solving this with scale alone.

Looking Ahead: The Human Premium in an AI-Saturated World

The coming decade will likely see AI's STEM capabilities continue to accelerate. OpenAI's o3 model and its successors are expected to achieve near-perfect scores on virtually all standardized scientific assessments by 2026. Google's Gemini 2.0 roadmap includes explicit goals for mathematical reasoning at the research level.

But the liberal arts gap shows no signs of closing. If anything, as AI-generated content floods the internet, authentically human writing and thinking will become rarer — and therefore more valuable. The premium on genuine human insight, cultural depth, and emotional authenticity is set to rise, not fall.

For students, professionals, and organizations, the strategic implication is clear. The question is no longer 'Can AI do this task?' — for most technical tasks, the answer is increasingly yes. The question is 'What can I do that AI cannot?' And right now, the answer looks a lot like the humanities.

The AI that aced quantum physics stumbled on a question about what it means to be human. That may be the most important exam result of 2025.