VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

The field of video understanding faces a scalability bottleneck. While Transformers dominate video modeling, their quadratic complexity hinders efficiency, and pure Mamba architectures often overfit and fail to generalize. To bridge this gap, we introduce VideoMAP, a hybrid Mamba-Transformer framework with frame-wise autoregressive pretraining. As shown in figure below, VideoMAP effectively combines Mamba’s sequential efficiency with Transformer’s global reasoning, making it a scalable foundation for long-horizon video analysis.

VideoMAP

Methodology

Hybrid Mamba–Transformer Encoder

VideoMAP’s hybrid design strikes a balance between efficiency and context modeling. The architecture integrates Mamba and Transformer blocks at a 4:1 ratio, forming the VideoHybrid backbone. This hybridization mitigates Mamba’s tendency to overfit while retaining its linear-time advantages.

VideoMAP

Frame-wise Autoregressive Pretraining

Instead of reconstructing masked frames, VideoMAP predicts the next frame’s semantic embeddings using CLIP-based representations. This teaches temporal continuity, allowing the model to understand how actions and context evolve. In simple terms, it learns to anticipate ‘what happens next’ rather than fill in missing pixels.

VideoMAP

VideoLLM Based on VideoMAP

We extend VideoMAP into a VideoLLM using the LLaVA framework, combining image-based MAP and video-based VideoMAP within a shared vision encoder. Through three stages — visual alignment, multimodal understanding, and instruction tuning — the system unifies image and video representations for the LLM. This hybrid design reduces the gap between modalities, improves multimodal reasoning, and lowers GPU memory usage compared with Transformer- or Mamba-only models.

VideoMAP

Benchmark Results

VideoMAP achieves state-of-the-art results on multiple benchmarks. Across datasets like Kinetics-400, Something-Something V2, and COIN, the hybrid autoregressive design significantly outperforms previous methods while using less computation.

VideoMAP

[Table-1: Comparison with the state-of-the-art methods on Kinetics-400]

VideoMAP

[Table-2: Comparison with the state-of-the-art methods on Something-Something V2]

Key Advantages

  • In the benchmark results above, VideoMAP scales effectively up to 300M parameters without overfitting, outperforming pure VideoMamba models.
  • Furthermore, the table below demonstrates a +2.6% accuracy gain on K400-small compared with VideoHybrid+UMT, showing strong data efficiency.
VideoMAP
  • In terms of memory efficiency, the VideoMAP-LLaVA experiments show a 40% reduction in GPU memory usage, as illustrated in the figure below.
VideoMAP

Conclusion

This work addresses the critical challenges in efficient video understanding, with a focus on scalability and data efficiency. We introduce VideoMAP, a Hybrid Mamba-Transformer framework that features a specialized pre-training approach designed to significantly mitigate over-fitting. Experimental results demonstrate that VideoMAP outperforms existing models in both performance and scal-ability across various datasets and showcase the potential of VideoMAP as a visual encoder for multimodal large language models, emphasizing its ability to reduce memory usage and enable the processing of longer video sequences.

Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: