Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. Traditional editors like Adobe Premiere or DaVinci Resolve require manual story construction across hours of footage. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: “Summarize this documentary as a 3-minute cinematic recap.” The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure:

Core Contributions of this work include:
- A modular prompt-driven framework for coherent, story-centric video summarization.
- A semantic indexing pipeline linking timestamps with plot, emotion, and dialogue metadata.
- An interpretable agentic workflow that produces reusable intermediates like storyboards and scripts.
System Architecture
The system consists of three modular layers:
1. Video Comprehension & Semantic Indexing
To enable robust interaction with narrative-rich video, the system transforms multimodal input into a compact, interpretable textual representation. This occurs in three phases:
- Semantic Indexing: Videos are segmented into 5–15-minute overlapping windows. Gemini 2.0 Flash extracts hierarchical summaries and emotional tone.
- Refinement & Alignment: A refinement stage corrects entity drift, enforces causality, and aligns summaries into a narrative graph.
- Video Index: Outputs a timestamped JSON of key scenes, dialogue, and semantic embeddings.
2. Video-Centric QA
The system answers questions by converting its video index into a structured, memory-efficient text prompt combined with the user’s query. An auxiliary agent decides if visual evidence is required; if so, it retrieves relevant clips by timestamp and semantic similarity, enabling accurate, context-grounded answers to complex prompts.
3. Prompt-Driven Editing Workflow
After indexing, users issue freeform prompts for creative editing, such as: “Create a trailer highlighting the antagonist’s transformation.” The system delegates tasks across specialized editing agents that plan, retrieve, and render video content autonomously.


Evaluation and Studies
We conducted four human-centered user studies to evaluate the system’s reliability, narrative comprehension, and creative quality across 400+ long-form videos:
- Study 1 – Reliability & Usability: Rated 4.55/5 in quality and 4.45/5 in usability, 20× higher than baseline Gemini models.
- Study 2 – Narrative Retention: Achieved >4.7/5 accuracy in timestamp and causal alignment; The refinement stage was crucial for maintaining narrative continuity.
- Study 3 – Video Quality: The AI-generated recaps were comparable to human-edited summaries in genre consistency and overall watchability.
- Study 4 – User Experience: Compared with Descript and OpusClip, the system achieved higher task completion rates and was rated easier to use.
Conclusion
The paper introduces a prompt-driven, agentic video editing system that enables natural language interaction for complex, long-form media. By building a structured semantic index, it automates tasks like highlight extraction and narrative summarization. Evaluations show that its modular, agent-based design improves usability, factual grounding, and reliability over existing editors. While challenges such as latency and limited fine control remain, the system lays a foundation for transparent, interpretable, and scalable AI-assisted video editing.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Memories-S0: An Efficient and Accurate Framework for Security Video Understanding
Security video understanding plays a pivotal role in fields such as smart cities and public safety. However,its development has long been constrained by the scarcity of high-quality, large-scale surveillance domain data.