📑 Table of Contents

When Code Is Generated by AI, How Do We Test It?

📅 · 📁 Opinion · 👁 11 views · ⏱️ 8 min read
💡 SmartBear VP Fitz Nowlan explores the fundamental challenges facing traditional software testing paradigms after AI agents introduce non-determinism, and why data locality and data construction are emerging as new core competitive advantages.

Introduction: A Disruptive Question

"When you don't know what's in the code, how do you test it?" This seemingly absurd question is becoming a real challenge that the entire software engineering field must confront.

In a recent technical conversation hosted by SmartBear, host Ryan sat down with Fitz Nowlan, the company's VP of AI and Architecture, for an in-depth discussion. The core topic struck at the fundamental dilemma of software testing in the AI era — as LLM-driven intelligent agents begin to autonomously generate and execute code, the basic assumptions underlying traditional testing methodologies are being dismantled one by one.

The Collapse of Traditional Testing Paradigms

The classic logic of software testing is built on a straightforward premise: developers know what the code is supposed to do, and testers verify whether the code actually does it. Inputs are deterministic, outputs are predictable, and test cases can be exhaustively enumerated, reproduced, and regression-tested.

However, LLM-driven AI agents have fundamentally shattered this premise. Fitz Nowlan points out that when AI agents are integrated into software systems, they introduce fundamental "non-determinism." The same input, processed through a large language model, may produce different outputs; the same task instruction may lead an AI agent to choose entirely different execution paths. This means the basic frameworks of traditional assert-based testing, regression testing, and even integration testing all face the risk of becoming ineffective.

Even more challenging is the testing problem posed by MCP (Model Context Protocol) servers. As the standard protocol connecting large language models with external tools and data sources, MCP is rapidly gaining adoption. But when an MCP server's behavior is driven by an LLM's reasoning results, testers are no longer dealing with a deterministic API endpoint, but rather a "thinking interface" — its responses depend on the model's understanding, context assembly, and even fine-tuning of temperature parameters.

From "Validating Code" to "Validating Behavior"

Facing this paradigm shift, Nowlan offers a key insight: we need to move from "testing code correctness" to "testing the reasonableness of system behavior."

This implies transformations on several levels:

First, redefining testing objectives. Traditional testing pursues "perfectly consistent results," while testing AI systems requires accepting "results within a reasonable range." This demands that testing frameworks introduce entirely new mechanisms such as fuzzy matching, semantic equivalence judgment, and boundary tolerance.

Second, fundamentally reinventing testing methods. When the system under test is itself non-deterministic, the significance of a single test run is greatly diminished. It is replaced by statistical testing — observing whether output distributions conform to expected patterns through large numbers of repeated executions. This is closer to hypothesis testing in scientific experiments than quality inspection in engineering.

Third, restructuring testing infrastructure. Testing MCP servers requires simulating complete LLM reasoning environments, including context management, tool invocation chains, and state tracking across multi-turn interactions. This places unprecedented demands on the testing toolchain.

Data Locality: The Underestimated New Moat

Another thought-provoking point from the conversation is this: as source code becomes extremely easy to generate, real value is shifting toward "data locality" and "data construction."

Nowlan's logic is clear — if anyone can generate a piece of functional code in minutes with the help of AI, then the scarcity of code itself drops dramatically. What truly creates differentiation is the kind of data you possess, how closely that data aligns with business scenarios, and how you construct and organize that data for AI systems to use.

This perspective has far-reaching implications for the software industry:

  • For enterprises, the strategic value of proprietary data assets is further amplified. The ability to build high-quality, structured data pipelines deeply coupled with business operations will become a core competitive advantage in the AI era.
  • For the testing field, generating and managing test data is becoming more critical than testing the code itself. How to construct effective test datasets for non-deterministic AI systems, and how to ensure test data covers edge cases and long-tail distributions — these questions urgently need new methodologies.
  • For developers, understanding data flows and mastering data engineering skills may be more important than being proficient in any particular programming language.

Industry Implications: AI-Driven Reshaping of the Testing Toolchain

As a globally leading software testing tool provider, SmartBear's focus on this topic is no coincidence. In fact, the entire testing tool market is facing a structural transformation driven by AI.

Currently, exploration is already underway in multiple directions: LLM-based automatic test case generation, AI-powered intelligent test result evaluation, and the development of specialized testing frameworks for AI systems. But as Nowlan points out, these efforts are still in their early stages, and the industry has yet to form consensus best practices.

One noteworthy trend is that testing itself may also become "non-deterministic." Future testing systems may no longer output simple binary "pass/fail" conclusions, but instead provide confidence scores, risk assessments, and behavioral profiles, helping development teams make smarter decisions amid uncertainty.

Outlook: Embracing Uncertainty in Software Engineering

We are witnessing a fundamental turning point in software engineering. The deterministic development paradigm built over the past half century — write code, write tests, verify, deploy — is being shaken by the rise of AI agents.

This doesn't mean testing is becoming less important. Quite the opposite — it is becoming more important and more difficult. Future software quality assurance systems will need to find a new equilibrium between determinism and non-determinism, requiring new theoretical frameworks, new toolchains, and a new engineering culture.

As Nowlan hinted in the conversation, teams and companies that can be the first to solve the challenge of "how to test code you don't understand" will gain a competitive edge in the AI era of software development. And for the industry as a whole, this revolution in testing paradigms has only just begun.