📑 Table of Contents

Your LLM Problem Is Fundamentally a Data Problem

📅 · 📁 Opinion · 👁 11 views · ⏱️ 9 min read
💡 Open Metadata co-creator Harsha Chintalapani argues that most challenges enterprises face deploying LLMs stem not from insufficient model capabilities, but from a lack of underlying data governance. Real-time, structured production data is the true bottleneck for AI implementation.

When Enterprise AI Keeps Failing, Where Does the Problem Lie?

Large language models (LLMs) are penetrating every corner of the enterprise at an unprecedented pace — from intelligent customer service to code generation, from data analysis to decision support. Yet a puzzling phenomenon is emerging: even when adopting the most advanced models, many enterprises' AI projects still underperform or fail outright.

So where exactly is the problem? Harsha Chintalapani, co-founder and CTO of Collate and co-creator of Open Metadata, offers a direct answer: "Your LLM problem is fundamentally a data problem."

In a recent in-depth interview, Chintalapani systematically explained why AI and large language models struggle with real-time, structured production data, and how enterprises should break through the bottleneck at the data governance level.

Models Aren't Omnipotent: The LLM's 'Data Blind Spots'

There is a widespread misconception in the industry that a sufficiently powerful model can solve any problem. The capabilities of top-tier models like GPT-4, Claude, and Qwen are indeed impressive, but the shortcomings they expose in enterprise-grade applications cannot be ignored.

Chintalapani pointed out that LLMs often fall short in the following scenarios:

  • Real-time data: Training data for large models has an inherent time lag and cannot perceive production data that changes moment by moment within enterprise systems. Even when external data sources are connected through RAG (Retrieval-Augmented Generation) architecture, if the underlying data itself is chaotic, the retrieved results won't be any better.

  • Structured data: Core enterprise business data is mostly stored in relational databases, data warehouses, and data lakes, with strict schemas and complex inter-table relationships. LLMs excel at processing natural language text but lack an innate ability to understand this type of highly structured data.

  • Data context: A field named "revenue" — does it include tax or not? Is it monthly or annual data? What's the calculation methodology? The model has no way of knowing these critical metadata contexts.

"The upper limit of a model's capability is often determined not by its parameter count, but by the quality of data it can access." This view is becoming a growing consensus among AI practitioners.

The Data Governance Deficit: The Overlooked 'Foundation Work'

If AI applications are a building, then data is the foundation. Over the past few years, industry attention has been almost entirely focused on the model layer — larger parameters, longer context windows, stronger reasoning capabilities. But few have seriously examined whether the foundation beneath their feet is solid.

In practice, Chintalapani has observed that enterprises commonly face the following data-level issues:

1. Missing or Fragmented Metadata

A vast number of enterprises lack unified metadata management for their data assets. Critical information such as the meaning of data tables, field definitions, data lineage, and quality status is scattered across different teams' documents, wikis, and even personal notes. When LLMs need to understand this data, there is simply no reliable "instruction manual" to be found.

2. Inconsistent Data Quality

Dirty data, duplicate data, and stale data pervade production environments. An AI insight generated from erroneous data can be far more harmful than no insight at all. Chintalapani emphasized that without resolving data quality issues, even the most powerful model is just "garbage in, garbage out."

3. Persistent Data Silos

Despite the concept of centralized data platforms being popular for years, the reality is that most enterprises' data remains scattered across various business systems, lacking unified discovery and access mechanisms. LLMs need to integrate data across systems to answer business questions, but data silos make this goal difficult to achieve.

4. Insufficient Data Access and Security Governance

Connecting LLMs to production data introduces serious security challenges. Which data can the model access? Which fields contain sensitive information requiring masking? Without a robust data access governance framework, enterprises either refuse to let AI touch core data out of security concerns, or inadvertently cause data breaches.

Open Metadata's Approach: Making Data 'Self-Describing'

As the co-creator of Open Metadata, Chintalapani isn't just raising problems — he's also working on solutions. Open Metadata is an open-source metadata platform designed to provide enterprises with unified data discovery, governance, and observability capabilities.

Its core philosophy can be summarized as: Make data capable of describing itself, so AI can truly understand it.

Specifically, this approach involves several key components:

  • Unified metadata layer: Aggregating data asset information scattered across various locations into a centralized platform, including table structures, field descriptions, data lineage, data quality metrics, and usage frequency. This provides LLMs with a "knowledge graph" for understanding enterprise data.

  • Automated data profiling: Continuously monitoring data quality through automated tools, promptly detecting anomalies, missing values, and formatting errors to ensure that data fed into AI systems is trustworthy.

  • Semantic layer construction: Adding business semantic annotations to raw data fields so that LLMs understand not only what the data "is" but also what it "means." For example, clearly annotating that a particular revenue field represents "net revenue after refunds, calculated monthly, denominated in RMB."

  • Fine-grained access control: Role- and policy-based data access controls to ensure LLMs comply with enterprise security and compliance requirements when querying data.

Industry Implications: From the 'Model Arms Race' to the 'Data Infrastructure Race'

Chintalapani's perspective is not an isolated one. Recently, an increasing number of industry voices have begun calling for attention to data issues in AI implementation.

Data platform giants such as Snowflake and Databricks have significantly increased their investments in metadata management and data governance. Gartner's latest report also indicates that by the end of 2025, more than 70% of enterprise AI project failures will be attributable to data quality and data governance issues rather than insufficient model capabilities.

The implications of this trend are profound:

First, when planning AI strategy, enterprises should place data governance on equal footing with model selection. Investing massive resources in fine-tuning models or building complex prompt engineering is less effective than first solidifying the data foundation.

Second, data teams and AI teams need closer collaboration. Traditionally, data engineers handle data pipelines while ML engineers handle model training, with the two teams operating independently. In the LLM era, data quality directly determines AI output quality, making deep integration between the two essential.

Third, the value of open-source data governance tools will become even more prominent. Projects like Open Metadata lower the barrier for enterprises to build metadata management capabilities, enabling small and medium-sized businesses to establish the data infrastructure needed to support AI applications.

Outlook: Data Readiness Will Become a Core Metric of AI Competitiveness

As large model capabilities become increasingly homogenized, the models themselves are becoming a form of "infrastructure" — the focus of competitive differentiation is shifting from "who has the stronger model" to "who has the better data."