
we propose Memories-S0, an innovative framework specifically designed for security video understanding. At the data level, we leverage powerful video generation models (such as Veo 3) to create a massive and diverse set of synthetic surveillance videos, effectively alleviating the difficulties associated with data acquisition.
Tech Report
2025-10-10

Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate video, the computational cost explodes. MARC introduces memory-augmented RL token compression for efficient video understanding.
Publication
2025-10-09

User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. We introduce UGC-VideoCap—a new benchmark for detailed audio-visual captioning—and UGC-VideoCaptioner-3B, a compact model distilled from Gemini-2.5 Flash.
Publication
2025-10-05

Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts.
Publication
2025-09-28

We introduce CULTURE3D, a large-scale, high-fidelity dataset for cultural-heritage 3D reconstruction, comprising 41,006 drone-captured images across 20 diverse scenes, enabling rigorous evaluation of modern Gaussian-based scene rendering.
Publication
2025-07-09

We propose ST-R1 training paradigm, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning.
Publication
2025-04-23

The field of video understanding faces a scalability bottleneck. We introduce VideoMAP, a hybrid Mamba-Transformer framework with frame-wise autoregressive pretraining that effectively combines Mamba's sequential efficiency with Transformer's global reasoning.
Publication
2025-03-16

Understanding human behavior from first-person video has long been a challenge in AI research. X-LeBench introduces the first benchmark designed specifically for extremely long egocentric videos, spanning from 23 minutes up to 16.4 hours, to push AI models toward reasoning over long time horizons.
Publication
2025-01-12

We have developed a system that acts like an external memory for humans. By combining Augmented Reality (AR) glasses with advanced AI, the project explores how everyday experiences captured in first-person video can be encoded into language, stored in databases, and later retrieved when we need them most.
Publication
2024-10-18