📑 Table of Contents

CodeGraphVLP: Breaking Through Long-Horizon Robot Manipulation Bottlenecks with Code Planning and Semantic Graphs

📅 · 📁 Research · 👁 10 views · ⏱️ 10 min read
💡 A research team proposes the CodeGraphVLP framework, combining a Code-as-Planner module with semantic graph state representation to address critical challenges faced by Vision-Language-Action models in non-Markovian long-horizon tasks, significantly improving robustness and generalization in robot manipulation.

Introduction: The 'Shortsightedness' Dilemma of VLA Models

Vision-Language-Action (VLA) models are widely regarded as a promising direction for general-purpose robot manipulation. However, current mainstream approaches face a fundamental limitation — they are typically trained and deployed as short-horizon policies, assuming that "the latest observation frame is sufficient to infer the next action." This Markovian assumption may hold in simple scenarios, but it frequently breaks down in real-world long-horizon complex tasks.

Recently, a new paper published on arXiv titled "CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models" introduces a novel framework that systematically addresses the core challenges of VLA models in non-Markovian long-horizon tasks by combining Code-as-Planner with Semantic-Graph State representation.

The Core Problem: Why Do Current VLA Models 'Fail' in Long-Horizon Tasks?

To understand the value of CodeGraphVLP, we first need to clarify the three major pain points facing existing VLA models:

First, temporal occlusion of critical information. In long-horizon manipulation tasks, task-relevant evidence may only appear during early stages of the trajectory, or become occluded during subsequent operations. For example, a robot may need to first open a drawer to observe an object's position, then close the drawer to perform other steps, and finally return to retrieve the object. If the model relies solely on the current observation, it completely loses memory of earlier critical information.

Second, the challenge of visual distractors. Real-world scenes are filled with clutter and distractors, making fine-grained visual grounding extremely fragile. A simple tabletop manipulation task may contain dozens of irrelevant objects, and the VLA model's visual encoder can easily be led astray.

Third, the planning complexity of long-horizon tasks. Decomposing a high-level instruction into dozens of sub-steps while maintaining global consistency at each step places an enormous burden on end-to-end VLA models.

Technical Approach: CodeGraphVLP's Dual-Engine Architecture

The core innovation of the CodeGraphVLP framework lies in its elegant integration of two complementary modules into a unified system:

Code-as-Planner

Unlike traditional natural language planning, CodeGraphVLP leverages large language models (LLMs) to generate task plans in the form of program code. Code inherently possesses structured, executable, and logically rigorous properties, capable of expressing complex logic such as conditional branches, loops, and exception handling.

The advantage of this approach is clear: the code planner can explicitly define preconditions and completion criteria for each subtask, naturally handle task dependencies, and perform state checks and dynamic adjustments through program logic during execution. Compared to the ambiguity of natural language descriptions, code provides unambiguous task decomposition.

Semantic-Graph State Representation

This is the key to how CodeGraphVLP solves the non-Markovian problem. The research team designed a semantic graph-based scene state representation method that encodes objects in the environment, their attributes, and spatial and semantic relationships between objects into a graph structure.

Compared to raw pixels or feature vectors, semantic graphs offer the following significant advantages:

  • Information compression and persistence: Visual observations are compressed into structured semantic information that can be efficiently stored and updated across time steps, thereby breaking the Markovian constraint
  • Distractor resilience: Graph representations naturally filter out visual noise and irrelevant information, retaining only task-relevant semantic entities
  • Relational reasoning capability: Graph structures support explicit modeling of inter-object relationships, enabling compositional reasoning

Synergy Between the Two Modules

In practice, the code planner handles high-level task decomposition and execution flow control, while the semantic graph state serves as a "memory system," providing complete scene context for each decision step. When the code planner needs to determine whether a subtask is complete or decide on the next operation, it can query the semantic graph for current and historical scene state information.

This architecture achieves a decoupling of planning and perception: the planning layer operates in an abstract semantic space, unaffected by low-level visual noise; the perception layer focuses on converting raw visual input into structured semantic representations, providing a reliable information foundation for the planning layer.

Technical Analysis: Why This Approach Deserves Attention

A Systematic Response to the Non-Markovian Problem

In robotics and reinforcement learning, non-Markovian behavior has long been a classic challenge. Previous solutions typically relied on recurrent neural networks (such as LSTMs) or Transformer attention mechanisms to implicitly model historical dependencies. However, these methods often face gradient vanishing, excessive computational overhead, or attention dilution in long-sequence scenarios.

CodeGraphVLP adopts the semantic graph as an explicit state representation, offering a more elegant and interpretable solution path. The semantic graph can be viewed as a form of "external memory" whose updates and queries carry clear semantic meaning, facilitating debugging and verification.

Reliability and Scalability of Code Planning

Using code as a planning medium is not entirely new — prior work such as Google's Code-as-Policies has explored this direction. However, CodeGraphVLP's innovation lies in deeply integrating code planning with structured scene representation, enabling the code not only to call low-level action primitives but also to directly manipulate and query the semantic graph, achieving a true "perception-planning closed loop."

Moreover, the modular nature of code means that new manipulation skills can be encapsulated as functions and reused, providing a natural interface for continuous learning and capability expansion.

Recently, research enthusiasm for VLA models has been surging in both academia and industry. From Google DeepMind's RT series to Tsinghua University's RoboFlamingo, various groups are exploring how to make large models truly drive robot manipulation. CodeGraphVLP's work reveals an important insight: simply scaling up model size or training data may not be sufficient to solve long-horizon manipulation problems — introducing structured intermediate representations and explicit planning mechanisms is equally critical.

Potential Impact and Limitations

From an application perspective, CodeGraphVLP's framework design holds significant reference value for scenarios requiring long-horizon complex manipulation, such as home service robots, industrial assembly, and warehouse logistics. In these scenarios, tasks often involve dozens of steps, object states change frequently, and environments contain numerous distractors.

Of course, the approach also faces some unresolved challenges. The construction and maintenance of semantic graphs depend on reliable object detection and tracking systems, which remain an active research topic in open-world environments. Additionally, the quality of the code planner's output depends on the capabilities of the underlying LLM, and its generalization ability when facing out-of-distribution tasks still requires validation.

Outlook: Toward Truly General-Purpose Robot Intelligence

CodeGraphVLP's research provides a clear technical roadmap for the evolution of VLA models: from "shortsighted reactive policies" to "cognitive systems with memory and planning capabilities." As LLMs continue to improve in code generation and 3D scene understanding technologies advance, this "code planning + semantic graph state" paradigm is poised to become a critical building block for next-generation robot operating systems.

At the intersection of Artificial General Intelligence (AGI) and Embodied AI, enabling robots to not only "see" but also "remember" and "think ahead" will be the key to determining whether robots can truly enter everyday life. CodeGraphVLP has taken a solid step in this direction.