MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Introducing MARC: Faster, Lighter, Smarter Video AI

Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate video, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q&A.

Current solutions often rely on training-free token merging, but this inevitably leads to significant information loss and a drop in performance.

We are excited to introduce MARC: Memory-Augmented Reinforcement Learning-based Token Compression. MARC is a novel, two-stage framework that slashes the visual token load by an astonishing 95% while maintaining the reasoning ability of a full-frame model.

How MARC Works: Retrieve, Then Compress

MARC is a retrieve-then-compress method that integrates a visual memory system with a novel reinforcement learning (RL) training strategy to achieve its exceptional efficiency and performance.

  1. The Visual Memory Retriever (VMR)

The first core component, the Visual Memory Retriever (VMR), is inspired by how the human brain processes and recalls memories. Humans don't process continuous visual streams uniformly; instead, we segment experiences into discrete, semantically coherent events.

MARC

This "memory-first" strategy dramatically reduces the search space, removes unnecessary redundant information, and mitigates the negative effects of naive compression.

  1. Compression Group Relative Policy Optimization (C-GRPO)

The second component is the Compression Group Relative Policy Optimization (C-GRPO), a novel RL-based distillation technique designed specifically for aggressive token compression.
After the VMR retrieves the relevant segments, C-GRPO takes over to distill the reasoning capability from a full-frame Teacher Network to an aggressively compressed Student Network.

The key innovation is the introduction of a retention alignment reward (rc​). Unlike standard RL methods that only reward for correct answers, C-GRPO directly encourages the compressed input's performance (acomp​) to match the full-frame teacher's performance (afull​).

MARC

Impressive Results and Efficiency

We conducted extensive experiments on six video benchmarks covering video reasoning and general video understanding. The results demonstrate that MARC sets a new standard for efficiency without sacrificing accuracy.

MARC

Identical Mean Performance: MARC-3B achieves a mean performance score of 42.20, virtually matching the 64-frame Qwen2.5-VL-3B baseline's score of 42.21.

Extreme Compression: This performance is achieved while using only a single frame's worth of tokens as input, a 95% reduction in visual tokens.

Outperforming Competitors: MARC-3B's average score of 42.20 also surpasses larger models like InternVL3.5-4B and Gemma-3-4B.

Major Gains in Efficiency

MARC

The drastic reduction in input tokens translates directly into massive real-world efficiency gains:

  • GPU Memory Reduction: MARC reduces peak GPU memory usage by 72% (from 41.63 GB to 11.48 GB). This enables deployment on much smaller, more common GPUs.
  • Latency Improvement: Generation latency is reduced by 23.9%. This leads to an 11.1% reduction in overall end-to-end latency for processing a single video sample.

Conclusion and Future Impact

MARC provides a robust and practical solution to the computational challenges of video language models. By deftly integrating a structured visual retrieval mechanism (VMR) with a powerful reinforcement learning-based compression algorithm (C-GRPO), we have demonstrated that it is possible to achieve both superior performance and exceptional efficiency. This breakthrough enables the practical deployment of high-performing VLMs in real-world, latency-sensitive, and resource-constrained environments, making a true impact on applications such as real-time video question answering, surveillance, and autonomous driving.