Introducing MARC: Faster, Lighter, Smarter Video AI
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate video, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q&A.
Current solutions often rely on training-free token merging, but this inevitably leads to significant information loss and a drop in performance.
We are excited to introduce MARC: Memory-Augmented Reinforcement Learning-based Token Compression. MARC is a novel, two-stage framework that slashes the visual token load by an astonishing 95% while maintaining the reasoning ability of a full-frame model.
How MARC Works: Retrieve, Then Compress
MARC is a retrieve-then-compress method that integrates a visual memory system with a novel reinforcement learning (RL) training strategy to achieve its exceptional efficiency and performance.
- The Visual Memory Retriever (VMR)
The first core component, the Visual Memory Retriever (VMR), is inspired by how the human brain processes and recalls memories. Humans don't process continuous visual streams uniformly; instead, we segment experiences into discrete, semantically coherent events.

This "memory-first" strategy dramatically reduces the search space, removes unnecessary redundant information, and mitigates the negative effects of naive compression.
- Compression Group Relative Policy Optimization (C-GRPO)
The second component is the Compression Group Relative Policy Optimization (C-GRPO), a novel RL-based distillation technique designed specifically for aggressive token compression.
After the VMR retrieves the relevant segments, C-GRPO takes over to distill the reasoning capability from a full-frame Teacher Network to an aggressively compressed Student Network.
The key innovation is the introduction of a retention alignment reward (rc). Unlike standard RL methods that only reward for correct answers, C-GRPO directly encourages the compressed input's performance (acomp) to match the full-frame teacher's performance (afull).

Impressive Results and Efficiency
We conducted extensive experiments on six video benchmarks covering video reasoning and general video understanding. The results demonstrate that MARC sets a new standard for efficiency without sacrificing accuracy.

Identical Mean Performance: MARC-3B achieves a mean performance score of 42.20, virtually matching the 64-frame Qwen2.5-VL-3B baseline's score of 42.21.
Extreme Compression: This performance is achieved while using only a single frame's worth of tokens as input, a 95% reduction in visual tokens.
Outperforming Competitors: MARC-3B's average score of 42.20 also surpasses larger models like InternVL3.5-4B and Gemma-3-4B.
Major Gains in Efficiency

The drastic reduction in input tokens translates directly into massive real-world efficiency gains:
- GPU Memory Reduction: MARC reduces peak GPU memory usage by 72% (from 41.63 GB to 11.48 GB). This enables deployment on much smaller, more common GPUs.
- Latency Improvement: Generation latency is reduced by 23.9%. This leads to an 11.1% reduction in overall end-to-end latency for processing a single video sample.
Conclusion and Future Impact
MARC provides a robust and practical solution to the computational challenges of video language models. By deftly integrating a structured visual retrieval mechanism (VMR) with a powerful reinforcement learning-based compression algorithm (C-GRPO), we have demonstrated that it is possible to achieve both superior performance and exceptional efficiency. This breakthrough enables the practical deployment of high-performing VLMs in real-world, latency-sensitive, and resource-constrained environments, making a true impact on applications such as real-time video question answering, surveillance, and autonomous driving.
Read more

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure:

Memories-S0: An Efficient and Accurate Framework for Security Video Understanding
Security video understanding plays a pivotal role in fields such as smart cities and public safety. However,its development has long been constrained by the scarcity of high-quality, large-scale surveillance domain data.