Hidden Risks in Frontier AI Companies' Internal Model Use: New Study Calls for Risk Reporting Mechanisms
An Overlooked Safety Blind Spot: Internal Model Use Risks
When we discuss AI safety, the focus tends to center on the public deployment phase of models. However, a new paper published on arXiv (arXiv:2604.24966v1) has exposed a long-overlooked safety blind spot — the internal use phase at frontier AI companies before a model's official release poses risks that cannot be ignored.
The paper notes that frontier AI companies typically deploy their most advanced models internally for weeks or even months before public release, using them for safety testing, evaluation, and iterative optimization. Take Anthropic as an example: the company recently developed a new model class called "Mythos Preview" with advanced cybersecurity-related capabilities. This model had been in internal use for at least six weeks before being publicly announced.
Why Does Internal Use Pose Unique Risks?
The researchers argue that the risks generated during this internal use phase are fundamentally different from those during external deployment — and should not be underestimated.
First, the earliest exposure to capability frontiers. Internal developers are the first group in the world to access the most powerful model capabilities. During a phase when safety measures have yet to be fully established, these models may be used to explore boundary capabilities, including assessments in high-risk domains such as cyberattacks, biosecurity, and autonomous action. Even when conducted for testing purposes, this early exposure itself constitutes a form of risk.
Second, the absence of internal safety guardrails. Publicly deployed models typically undergo multiple layers of safety alignment and usage restrictions, but internal test versions often lack these protections. Developers may interact with models in their "raw state," obtaining outputs that would be impossible after safety filtering.
Third, information asymmetry. External regulators, independent auditors, and the public know virtually nothing about this internal use phase. The model's performance in internal environments, the scope of capabilities tested, and any unexpected behaviors that may arise all remain in an information black box.
Risk Reporting Framework: From Black Box to Transparency
To address these issues, the paper proposes a risk reporting framework for developers' internal AI model use. The core elements of this framework include:
- Systematic documentation of internal use activities: Requiring frontier AI companies to maintain structured records of internal model use scenarios, testing scope, access permissions, and more, creating traceable usage logs.
- Tiered risk event reporting: Establishing a tiered reporting mechanism so that when internal testing reveals models exhibiting dangerous capabilities or anomalous behaviors beyond expectations, these are reported and assessed according to established protocols.
- Introducing external audit perspectives: While protecting trade secrets, allowing independent third parties to review risk management practices during the internal use phase, breaking down completely closed information barriers.
- Linking to public deployment risk assessments: Ensuring that risk signals identified during internal testing phases are effectively transmitted to safety decision-making processes before public release.
Current Industry Practices and Gaps
Currently, major frontier AI companies including Anthropic, OpenAI, and Google DeepMind have all established varying degrees of internal safety evaluation processes. Anthropic's Responsible Scaling Policy (RSP), OpenAI's Preparedness Framework, and similar initiatives all involve capability assessments before model release.
However, the paper identifies significant shortcomings in existing frameworks: these policies primarily focus on the final judgment of whether a model is "fit for release," while devoting little attention to risk management of the internal use process itself. In other words, the industry has invested enormous effort in "threshold assessments" while neglecting the equally critical period "before the threshold."
The "Mythos Preview" case is particularly illustrative — a model with advanced cybersecurity capabilities ran internally for six weeks. During those six weeks, who was using it? Which features were accessed? Were there any concerning capability demonstrations? These questions lack systematic answering mechanisms under existing frameworks.
Far-Reaching Implications for AI Governance
This research carries significant implications for global AI governance discussions.
From a regulatory standpoint, AI legislation being developed by various countries mostly focuses on public deployment and commercial applications of models, lacking clear regulations for internal use during the development phase. Although the EU AI Act imposes full-lifecycle management requirements on high-risk AI systems, specific enforcement standards for the internal testing phase remain incomplete.
From an industry self-regulation perspective, this research holds up a mirror for frontier AI companies — while externally emphasizing "responsible AI," can internal practices withstand the same scrutiny? Establishing robust internal risk reporting mechanisms is not merely a response to external regulation but an intrinsic need for corporate risk management.
Outlook: Safety Governance Must Cover the Full Lifecycle
As the capabilities of frontier AI models continue to escalate, risks during the internal use phase will become increasingly significant. For artificial general intelligence (AGI) or AI systems with autonomous action capabilities that may emerge in the future, risk management during internal testing phases will become a matter of existential importance.
The value of this paper lies in expanding the scope of AI safety attention from "post-release" to "pre-release," reminding us that truly responsible AI development must cover the model's complete lifecycle — from the first internal test to final decommissioning. Along this extended chain, negligence at any single link could become a breach point for risk.
📌 Source: GogoAI News (www.gogoai.xin)
🔗 Original: https://www.gogoai.xin/article/frontier-ai-internal-model-use-risks-reporting-framework
⚠️ Please credit GogoAI when republishing.