📑 Table of Contents

Multimodal Agents Aim to Transform Smartphone UX

📅 · 📁 AI Applications · 👁 7 views · ⏱️ 13 min read
💡 OPPO reveals a multimodal companion agent that continuously perceives screen activity, builds personalized memory, and proactively assists users throughout their phone sessions.

OPPO Unveils Always-On Multimodal Agent for Smartphones

OPPO is set to reveal a new multimodal companion agent that fundamentally rethinks how users interact with their smartphones — shifting from passive, wake-word-triggered assistants to an always-on AI that continuously perceives, remembers, and proactively acts on behalf of the user. The technology will be detailed by OPPO senior algorithm engineer Liu Peng at the upcoming AICon Global AI Development and Application Conference in Shanghai on June 26–27, 2025.

Unlike conventional smartphone assistants such as Apple's Siri or Google Assistant — which respond only when explicitly summoned — OPPO's system ingests a real-time temporal video stream of the phone's screen to build a persistent understanding of user behavior. The result is an agent that doesn't just answer questions but genuinely 'accompanies' users throughout their entire mobile experience.

Key Takeaways

  • Always-on perception: The agent uses the phone's screen video stream as its primary input, enabling continuous awareness of user activity without explicit wake commands.
  • Personalized memory: A dedicated memory architecture allows the agent to accumulate user preferences, habits, and context over time.
  • Proactive execution: Rather than waiting for instructions, the agent can anticipate needs and take actions autonomously.
  • Three core algorithms: Screen multimodal understanding, persistent contextual memory, and intelligent action planning form the technical backbone.
  • Engineering-ready: The presentation focuses on moving agent systems from demo prototypes to production-grade deployments.
  • Industry momentum: Over 50 leading Chinese tech companies — including Tencent, Alibaba, Huawei, and Kuaishou — are converging at AICon to discuss real-world agent deployment.

From Wake-Word to Companion: Why the Paradigm Shift Matters

Smartphones are the most intimate computing devices on the planet. The average user spends over 4 hours per day on their phone, generating a rich stream of behavioral data across messaging, browsing, shopping, and entertainment. Yet today's AI assistants remain fundamentally reactive — they activate only when a user says 'Hey Siri' or 'OK Google,' handle a single query, and then go dormant.

This single-turn interaction model represents a massive missed opportunity. Users constantly switch between apps, consume content, and make decisions that an intelligent agent could enhance — if only it were paying attention. OPPO's approach flips this model entirely by treating the screen itself as a continuous sensor.

The implications are significant for both user experience and the competitive landscape. As Apple Intelligence, Google Gemini, and Samsung Galaxy AI race to embed large language models into smartphones, the company that cracks persistent, context-aware assistance could gain a decisive edge in the next generation of mobile computing.

How the System Works: Screen-First Multimodal Architecture

At the heart of OPPO's approach is a technical architecture that processes the phone's screen as a temporal video stream. Rather than relying on app-level APIs or accessibility hooks, the system 'watches' what the user sees — much like a human looking over someone's shoulder, but with machine-level precision and memory.

Liu Peng's presentation will detail 3 core algorithmic pillars:

  • Screen multimodal understanding: The system parses visual, textual, and interactive elements on the screen in real time, building a semantic understanding of what the user is doing at any given moment. This goes beyond simple OCR — it involves understanding UI layouts, content hierarchies, and interaction patterns.
  • Personalized memory and context accumulation: Unlike stateless assistants that forget everything after each interaction, this agent maintains a persistent memory graph. It learns user preferences, tracks recurring behaviors, and builds a longitudinal profile that improves over time.
  • Proactive action planning and execution: Armed with continuous perception and rich memory, the agent can identify opportune moments to intervene — suggesting relevant information, automating repetitive tasks, or surfacing contextual recommendations without being asked.

This architecture draws on advances in vision-language models (VLMs), temporal sequence modeling, and on-device inference optimization. The challenge of running such a system on a mobile device — with limited compute, battery, and thermal constraints — makes the engineering dimension particularly compelling.

The Engineering Gap: Moving Agents from Demo to Production

One of the central themes at AICon Shanghai is the stark gap between impressive agent demos and reliable production systems. Liu Peng's talk falls under the conference's 'Agent System Architecture and Engineering Practice' track, which directly addresses this challenge.

The agent community has seen no shortage of viral demos — from AutoGPT to various autonomous browsing agents — but few have achieved the reliability, latency, and safety requirements needed for consumer deployment. Smartphone agents face even higher bars:

  • Latency: Users expect sub-second responses; any perceptible delay breaks the companion illusion.
  • Privacy: Continuously processing screen content raises significant data protection concerns, especially under regulations like GDPR and China's PIPL.
  • Battery and compute: Running multimodal models on-device demands aggressive optimization — quantization, model distillation, and efficient attention mechanisms are all critical.
  • Safety and trust: An agent that can proactively execute actions must have robust guardrails to prevent unintended consequences.

These engineering challenges explain why most smartphone AI features today remain limited to cloud-based, single-turn interactions. OPPO's willingness to share production-level insights suggests the company has made meaningful progress on these fronts.

Industry Context: The Global Race for On-Device AI Agents

OPPO's work sits within a rapidly accelerating global trend. Apple introduced Apple Intelligence at WWDC 2024, emphasizing on-device processing and privacy-first AI. Google has been embedding Gemini Nano directly into Pixel devices for on-device summarization and smart replies. Samsung's Galaxy AI suite, powered in part by Google's models, offers real-time translation and photo editing.

However, none of these implementations have achieved the 'always-on companion' paradigm that OPPO describes. Most Western smartphone AI features remain tool-like — powerful but episodic. The concept of a persistent agent that continuously monitors screen activity and builds long-term user memory represents a qualitative leap beyond current offerings.

Chinese tech companies have been particularly aggressive in pushing the boundaries of on-device AI. Huawei's HarmonyOS has integrated increasingly sophisticated AI capabilities, while Xiaomi and Vivo have announced their own large-model strategies for smartphones. The presence of executives from Tencent, Alibaba, Huawei, Kuaishou, and Fliggy at AICon underscores the breadth of industry investment in agent technologies across China's tech ecosystem.

Compared to the Western approach — where Apple and Google emphasize privacy and restraint — Chinese manufacturers appear more willing to experiment with deeply integrated, proactive AI systems. This divergence could produce meaningfully different user experiences across markets.

What This Means for Developers and the Mobile Ecosystem

For mobile developers, the shift toward companion agents has profound implications. If the phone's OS-level agent can perceive and understand app content through screen analysis, it effectively creates a new interaction layer above individual apps. This could disrupt traditional app distribution models, reduce time spent in specific apps, and create new opportunities for developers who build agent-compatible experiences.

For AI engineers, the technical stack required — real-time VLMs, on-device inference, persistent memory systems, and proactive planning — represents a convergence of multiple research frontiers. The demand for engineers who can bridge multimodal AI research and mobile systems engineering is likely to surge.

Key implications include:

  • New UX paradigms: Designers will need to rethink notification systems, permission models, and user control mechanisms for always-on agents.
  • Privacy architecture: On-device processing becomes non-negotiable; cloud-dependent approaches will face user and regulatory resistance.
  • Model efficiency: The premium on small, fast, accurate multimodal models will intensify, driving investment in quantization, pruning, and architecture search.
  • Platform power: OS-level agents could shift power from app developers to platform owners, echoing historical patterns seen with app stores and browser defaults.

Looking Ahead: The Companion Agent Era

OPPO's presentation at AICon Shanghai offers a preview of what could become the dominant smartphone interaction model within the next 2–3 years. The trajectory from reactive assistants to proactive companions mirrors the broader evolution of AI systems — from tools that respond to prompts toward agents that understand context, maintain memory, and take autonomous action.

The key question is whether users will embrace this level of AI integration into their most personal device. Trust, transparency, and control will be decisive factors. An agent that proactively helps without feeling intrusive — and that earns user confidence through consistent, safe behavior — could become indispensable. One that oversteps will face swift rejection.

AICon Shanghai, with its focus on moving agents from prototype to production, arrives at a critical inflection point for the industry. The conference runs June 26–27 and features over 50 speakers from leading technology companies and research institutions. For those tracking the future of AI-powered mobile experiences, OPPO's session on multimodal companion agents is one to watch closely.

The smartphone — already the most ubiquitous computing platform in history — may be on the verge of its most significant interaction redesign since the introduction of the touchscreen.