User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing
benchmarks are mostly visual-only, neglecting the crucial role of audio.
To address this, researchers introduce UGC-VideoCap—a new benchmark for detailed audio-visual
captioning—and UGC-VideoCaptioner-3B, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement
learning.
UGC-VideoCap Benchmark Design
The benchmark contains 1,000 short TikTok videos annotated through three stages—audio, visual, and audio-visual captions—plus 4,000 QA pairs for unimodal and cross-modal testing. Below figure depicts the annotation pipeline integrating raw video, sound, and scene descriptions verified through human-in-the-loop QA.

Evaluation and Findings
UGC-VideoCap extends beyond visual-only datasets such as MSR-VTT or VidCapBench. Table-1 outlines its multimodal coverage, while Table-2 compares model performance across commercial and open models.
- Gemini-2.5 Flash: Highest overall score (76.73%), excelling at detailed multimodal reasoning.
- Gemini-2.5 Pro: Achieves 73.78%, slightly weaker on OCR and sound event interpretation.
- Open-source Models: Qwen2.5-Omni and MiniCPM-o-2.6 trail behind, showing fragmented multimodal integration.
- Key Insight: Most models still process audio and visuals separately, limiting cohesive understanding.

[Table-1 Benchmark comparison for video caption evaluation]

[Table-2 Evaluation on UGC-VideoCap Benchmark]
The UGC-VideoCaptioner Model
UGC-VideoCaptioner-3B is trained via two stages as shown in the figure below:
- Stage 1 – Distillation: Gemini-2.5 Flash auto-labels 20k TikTok clips, transferring knowledge to Qwen2.5-Omni-3B.
- Stage 2 – Reinforcement Learning: GRPO fine-tunes caption fluency and factual precision using LLM-based scoring. Instead of relying on critics, GRPO ranks multiple generated captions per sample and rewards human-like coherence and audio-visual consistency as depicted below.

Quantitative Results
Table-3 reports the main outcomes. The 1k SFT + 1k RL setup improves performance by +7.83 points, approaching the 20k SFT model’s accuracy. GRPO enhances caption richness and reduces hallucinations, achieving a 60.01 average score—close to Gemini-2.5 Flash’s 76.73 benchmark.

[Table-3 Performance comparison on audio-visual captioning]
Key Insights
- Audio is essential: Ignoring sound cues causes up to a 25-point accuracy loss.
- Efficient small-data training: With just 2k samples, distillation can rival large-scale training.
- Human-in-the-loop QA: Ensures reliable, semantically rich multimodal annotations.
Conclusion
UGC-VideoCaptioner and the UGC-VideoCap benchmark redefine multimodal video captioning by merging detailed sound, text, and visual understanding. Through Gemini-guided distillation and GRPO training, UGC-VideoCaptioner delivers rich, human-like captions efficiently. This work marks a significant advance toward AI systems that can both see and hear the fast-evolving world of UGC content.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure:

Memories-S0: An Efficient and Accurate Framework for Security Video Understanding
Security video understanding plays a pivotal role in fields such as smart cities and public safety. However,its development has long been constrained by the scarcity of high-quality, large-scale surveillance domain data.