How well can AI reasonss about the dynamic, four-dimensional world we live in? Humans effortlessly recall routes, reverse mental timelines, and describe how actions unfold over time. For multimodal large language models (MLLMs), however, this kind of spatio-temporal reasoning remains a major challenge. The ST-Think project introduces new datasets and training methods designed to push MLLMs closer to human-like reasoning about time and space.
Background and Challenge
Most benchmarks today only measure spatial reasoning, such as recognizing objects or static layouts. They fail to test whether AI can reason across time—tracking paths, understanding direction changes, or inferring what happened before or after an event. Without this, applications like self-driving, robotics, and AR/VR assistants remain limited in reliability.
How ST-Think Tackle the Challenge
- Ego-ST Bench: A new dataset with 789 egocentric videos and 5,000+ QA pairs, covering path descriptions, direction shifts, landmark changes, and action changes. Each has both forward and backward reasoning tasks.

Demonstration of tasks for Ego-ST bench. It includes 8 QAs for 4 types of tasks, which are route description, direction description, landmark description and action description.
- ST-R1 Training: A two-stage method that combines supervised chain-of-thought (CoT) training with GRPO reinforcement learning. This allows models to learn step-by-step reasoning and refine it via reward optimization.

[Spatial Temporal Reasoning Model. Our model is trained in two stages: (1) Create Chain of Thought (CoT) data for supervised fine-tuning (SFT). (2) Enhancing the model using the rule-based reinforcement learning GRPO algorithm.]
- Reverse Reasoning: Unique focus on teaching models to 'think backwards,' mimicking human retrospective reasoning.
Results
- Benchmark Tests: On Ego-ST Bench, both open-source (Qwen, LLaVA, InternVL) and closed-source (o3-mini, Gemini) models were evaluated. Results showed reasonable performance on static tasks like landmark recognition, but poor performance on dynamic path and direction reasoning.

- Model Gaps: Most models scored highest on landmarks, lowest on direction reasoning—highlighting a key limitation in temporal understanding.

- ST-R1 Gains: With only limited high-quality CoT data and GRPO, ST-R1 significantly outperformed traditional supervised fine-tuning. In cross-dataset generalization, ST-R1 achieved over 30% improvement compared to baselines.


Future Directions
Looking ahead, ST-Think points toward new horizons for AI reasoning. Promising directions include expanding beyond video to incorporate multimodal signals like audio and haptics, improving long-term temporal modeling to handle extended sequences, and developing more efficient algorithms that can scale to real-world deployment in robotics, AR/VR, and assistive technologies.
Conclusion
ST-Think advances the frontier of spatio-temporal reasoning for multimodal large language models. By introducing Ego-ST Bench and the ST-R1 training method, it challenges models not just to recognize what they see, but to reason about how events unfold and even reverse in time. This work brings AI closer to understanding our 4D world, with wide-reaching implications for navigation, robotics, and human-AI interaction.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: