📑 Table of Contents

DeepSeek Gave AI a Cyber Finger — And Now It Can Truly See

📅 · 📁 Research · 👁 11 views · ⏱️ 10 min read
💡 While OpenAI, Google, and Anthropic compete on visual resolution, DeepSeek has taken a different path. Through visual grounding and pointing-based understanding mechanisms, DeepSeek enables AI not just to "see clearly" but to "see with understanding" — redefining the boundaries of multimodal AI.

An Arms Race Over "Seeing"

The competition in multimodal AI is reaching a fever pitch. OpenAI's GPT-4o continues to improve image comprehension accuracy, Google's Gemini series touts its native multimodal architecture, and Anthropic's Claude keeps doubling down on visual capabilities. The three giants share a remarkably similar playbook — make AI "see more clearly." Higher resolution, larger visual encoders, more image-text training data — as if sharper eyes alone could enable AI to understand the world.

But DeepSeek posed a different question: Does seeing clearly actually mean seeing with understanding?

The answer is clearly no. A person with 20/20 vision standing before an abstract painting doesn't necessarily understand it better than a bespectacled art critic. DeepSeek's research team started from this simple intuition and embarked on a technical path radically different from the mainstream.

From "Eyes" to "Fingers": A Critical Metaphorical Shift

In the process of learning to perceive the world, human infants exhibit a key behavior repeatedly validated by developmental psychology — pointing. At around 9 to 12 months of age, infants begin pointing at objects. This is not merely a means of communication but a hallmark of cognitive leap. Through the act of pointing, infants anchor their attention on specific objects, building a bridge between language and vision.

DeepSeek's multimodal research is essentially about equipping AI with this "cyber finger."

In its Janus series of multimodal models and related research, DeepSeek has explored a core capability known as "Visual Grounding." Unlike traditional multimodal models that feed an entire image as a vague feature vector into a language model, DeepSeek's approach enables AI to "point out" the object it's discussing within an image — using bounding box coordinates, region segmentation annotations, and spatial relationship descriptions.

Put simply, when you ask AI "who is smiling in this picture," a traditional model might say "the woman on the left is smiling." DeepSeek's model not only provides the answer but can precisely "show you" — highlighting that person's position in the image and even annotating the facial region where the smile appears.

Technical Breakdown: How DeepSeek Does It

Decoupled Visual Encoding: Separating Understanding from Generation

DeepSeek made a bold design decision in the Janus architecture: completely decoupling the encoding paths for visual understanding and visual generation. Traditional approaches typically share a single visual encoder for both "image captioning" and "text-to-image" tasks, but DeepSeek argues these two tasks have fundamentally different demands on visual information.

Understanding tasks require high-level semantic features — what is this object, what is it doing, how does it relate to its surroundings. Generation tasks require low-level detail features — texture, lighting, color distribution. Mixing the two actually causes mutual interference.

This decoupled design allows the understanding pathway to focus purely on "comprehension" without being constrained by the need to "render accurately."

Coordinate Linguification: Making AI Speak in Math

Another key innovation from DeepSeek is integrating spatial coordinate information into language expression. When answering questions, the model can naturally output structured information like "Object A is located in the [x1, y1, x2, y2] region." This isn't simple post-processing — the model learns to align visual space with language space during the training phase itself.

This means AI no longer makes vague comments about an image. Instead, it can precisely anchor every object under discussion, much like a human pointing at a blueprint while discussing specifics.

Fusing Chain-of-Thought with Visual Reasoning

DeepSeek has extended its Chain-of-Thought advantages accumulated in language models into the visual domain. When handling complex visual problems, the model first performs spatial localization, then logical reasoning, and finally delivers conclusions. This "point first, think next, then speak" workflow simulates the cognitive process humans use when observing complex scenes.

For example, when presented with a screenshot of a geometry problem, the model first identifies and locates each geometric element, then annotates known conditions, and proceeds with step-by-step derivation — rather than attempting to "see the answer at a glance" like traditional models.

Why This Path May Matter More

The Ceiling of the Resolution Race

The visual resolution race among OpenAI, Google, and Anthropic is essentially a form of "brute-force aesthetics" — using more compute, larger models, and more data to improve perceptual accuracy. But this path faces clear diminishing returns. Upgrading resolution from 720p to 1080p may yield significant comprehension improvements, but going from 4K to 8K yields negligible gains.

More critically, the bottleneck for many visual understanding tasks isn't resolution at all. Humans can extract rich information from a blurry old photograph. What AI lacks isn't pixels — it's understanding.

Real-World Needs in Practical Scenarios

In real-world applications, users typically need not what AI "sees" but what AI "understands."

  • Medical imaging: Doctors need AI to pinpoint the exact location of lesions, not just say "there may be an abnormality"
  • Autonomous driving: Systems need to locate every traffic participant's coordinates and trajectory, not just recognize "there's a car ahead"
  • Industrial quality inspection: Engineers need AI to annotate the specific position and type of defects, not just deliver a "pass/fail" verdict
  • Document understanding: Users need AI to point out where information sits in a document, not just extract text

In all these scenarios, "pointing accurately" matters far more than "seeing clearly." DeepSeek's technical approach aligns perfectly with these real-world demands.

Big Opportunities for Smaller Models

DeepSeek's consistent philosophy is achieving better results with smaller models. In the multimodal domain, the introduction of visual grounding capabilities offers a way to "win through ingenuity" — achieving comparable or even superior real-world performance with far fewer parameters than competitors, thanks to more refined understanding mechanisms.

This is consistent with DeepSeek's strategy in language models: don't compete on who has the biggest model — compete on who has the smartest one.

Industry Landscape: Two Diverging Paths

Two clear technical trajectories are forming in the multimodal AI space:

The Perception School (represented by OpenAI, Google, and Anthropic): The core logic is to enhance perceptual capabilities, enabling AI to capture more and clearer visual information, believing that "seeing enough leads to understanding."

The Cognition School (represented by DeepSeek): The core logic is to enhance cognitive capabilities, enabling AI to build deeper understanding from limited visual input, believing that "comprehension matters more than perception."

These two paths are not entirely opposed, and the ultimate optimal solution likely lies in their convergence. But at the current stage, DeepSeek has chosen a path overlooked by most, and that in itself carries significant industry implications — it proves that in the multimodal AI space, Chinese teams are not merely following but also defining the questions.

Looking Ahead: When AI Learns to Point Things Out

Equipping AI with a "cyber finger" may sound like a minor tweak, but the paradigm shift behind it is profound. It means visual interaction between AI and humans is evolving from "descriptive" to "interactive" — AI is no longer a passive image narrator but a collaborative partner who can "point at the picture and discuss problems" with you.

Imagine a future scenario: you send an architectural blueprint to AI, and instead of vaguely commenting "nice design," it points its finger (bounding box) at a corner of the drawing and says "the load-bearing structure here may have an issue," then points to another location and says "this pipeline layout conflicts with the fire escape over there."

This isn't science fiction — this is the technical path DeepSeek is paving.

While everyone else is competing over who has the biggest eyes, DeepSeek chose to first teach AI how to use its finger. This cyber finger may just be the key to unlocking the next door in multimodal AI.