📑 Table of Contents

YOSE: Selecting Only Essential Tokens for Faster Video Object Removal

📅 · 📁 Research · 👁 12 views · ⏱️ 5 min read
💡 Researchers propose YOSE, a method that significantly improves the efficiency of Diffusion Transformer-based video object removal by selecting only essential tokens, potentially solving the critical bottleneck of excessive inference latency in current approaches.

Video Object Removal Is Impressive, but Too Slow

Diffusion Transformer (DiT)-based video generation technology has made remarkable strides in recent years, delivering stunning visual results particularly in video object removal tasks. However, these methods universally face a thorny problem — excessively high inference latency. Taking MiniMax Remover, currently the best-performing solution, as an example, its processing speed is only about 10 FPS, still a considerable distance from real-time applications.

The root cause of this bottleneck lies in the fact that existing methods perform dense computation across the entire spatiotemporal token space. Even when the masked region requiring inpainting may only occupy a small portion of the frame, the model still needs to process all tokens in full. This waste of computational resources is precisely the core problem YOSE aims to solve.

YOSE: Select Only the Essential Tokens

A recent paper published on arXiv introduces a new method called "YOSE" (You Only Select Essential Tokens), built on a concise yet powerful core idea — since only localized regions need new content generation in video object removal, why not concentrate computational resources on these "essential tokens"?

YOSE's design philosophy stems from an intuitive observation: in video object removal scenarios, the masked regions typically occupy only a small proportion of the overall frame. Traditional DiT methods perform self-attention computation across all spatiotemporal tokens, with complexity growing quadratically with the number of tokens. This incurs enormous computational overhead when processing high-resolution or temporally extended videos.

Through an intelligent selection mechanism, YOSE identifies the subset of tokens that are truly "essential" to the generation outcome and executes core Transformer computations only on these tokens. Unselected tokens are handled through lightweight processing or by directly reusing existing information, thereby dramatically reducing computational load while maintaining generation quality.

Technical Significance and Industry Impact

From a technical perspective, YOSE's contributions are reflected across several dimensions:

Balancing Efficiency and Quality: The video generation field has long faced the dilemma of "great results but too slow to run." YOSE's token selection strategy provides a structured acceleration pathway for DiT architectures. Unlike generic compression methods such as distillation or quantization, it fully leverages the inherent sparsity characteristics of the video removal task itself.

Task-Aware Computation Allocation: YOSE's approach represents a "task-aware" computational paradigm — dynamically allocating compute resources based on the specific characteristics of a given task. This concept is applicable not only to video object removal but could also inspire acceleration research for other localized editing tasks, such as video inpainting, localized style transfer, and more.

Advancing Practical Deployment: Video object removal is in high demand across scenarios including film and television post-production, short-form video creation, and privacy protection. The current processing speed of 10 FPS means a 10-second video could require tens of seconds or longer to process. YOSE's acceleration approach has the potential to push this technology toward near-real-time application levels.

Outlook

As video generation models continue to scale up, inference efficiency is becoming a critical factor constraining real-world deployment. The "selective computation" philosophy represented by YOSE aligns with recent academic explorations in dynamic token pruning, sparse attention, and related directions, collectively pointing toward a clear trend: future video generation models must not only "generate well" but also "compute smartly."

The research is currently available on arXiv, and the subsequent release of complete experimental data and code is well worth following.