X-lebench: A benchmark for extremely long egocentric video understanding

Understanding human behavior from first-person video has long been a challenge in AI research. Most benchmarks today focus on short clips or moderately long videos, rarely exceeding an hour. But in reality, our lives unfold over many hours, with rich temporal dependencies and contextual continuity. X-LeBench introduces the first benchmark designed specifically for extremely long egocentric videos, spanning from 23 minutes up to 16.4 hours, to push AI models toward reasoning over long time horizons.

Dataset and Life-Logging Simulation Pipeline

To overcome the difficulty of collecting continuous day-long recordings, the researchers created a life-logging simulation pipeline. This pipeline combines synthetic daily plans with real-world footage from Ego4D, generating coherent multi-hour video life logs. The process unfolds in three stages:

  • Persona Generation: Create diverse character profiles (location + MBTI type), daily agendas, and activity chunks.
  • Video Extraction: Select from 7,852 Ego4D clips, extracting time, scene, and activity information.
  • Matching & Simulation: Align daily plan chunks with real video clips to build continuous simulated life logs. The result is 432 video life logs divided into short (~2.3h), medium (~5.3h), and long (~8.6h) categories. They cover 135 daily scenarios such as cooking, shopping, commuting, and watching TV.
X-lebench
[Overview of life-logging simulation pipeline]
X-lebench
[Dataset Statistics]

Benchmark Tasks
X-LeBench introduces a suite of tasks designed to evaluate long-form video understanding. These tasks challenge models not only to recognize what happens in short clips but also to track, summarize, and reason across hours of activity:

  • Temporal Localization: Identify when specific objects, people, or actions occur.
  • Summarization: Generate single-video, multi-video, and holistic summaries.
  • Action Counting: Count occurrences of specific actions.
  • Summary Ordering: Reconstruct the chronological order of shuffled summaries.
X-lebench

[Tasks in this benchmark and their corresponding examples. Q and A denote query and answer examples, respectively.]

Performance of Existing MLLM Approaches

  • Poor overall performance: Across all tasks, existing MLLMs struggled significantly on long videos.
  • Retrieve-Socratic: Best in temporal localization, outperforming Gemini-1.5 Flash by +8.26% recall.
  • Gemini-1.5 Flash: Strongest in summarization, but limited by token constraints, leading to information loss on long inputs.
  • Ordering: Accuracy above 85% on short videos, but below 25% on long ones, highlighting difficulty in temporal reasoning.
X-lebench
[Evaluation Results on X-LeBench]
X-lebench

[Performance comparison on X-LeBench. (a): The performance comparison of each method across tasks for different data categories. (b): The performance comparison across different data categories for different tasks.]

Conclusion

X-LeBench sets a new milestone for egocentric video research, redefining “long video” as multi-hour, continuous,and context-rich recordings. It reveals that today’s multimodal large language models remain far from mastering such inputs. By providing 432 synthetic yet realistic video life logs and a comprehensive suite of tasks, X-LeBenchlays the foundation for advancing AI toward robust long-term video understanding and real-world memory systems.

Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: