Memories.ai Research

Where innovation meets memory intelligence.

Research Searching

Memories-S0: An Efficient and Accurate Framework for Security Video Understanding

Memories-S0: An Efficient and Accurate Framework for Security Video Understanding

we propose Memories-S0, an innovative framework specifically designed for security video understanding. At the data level, we leverage powerful video generation models (such as Veo 3) to create a massive and diverse set of synthetic surveillance videos, effectively alleviating the difficulties associated with data acquisition.

Tech Report

2025-10-10

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate video, the computational cost explodes. MARC introduces memory-augmented RL token compression for efficient video understanding.

Publication

2025-10-09

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. We introduce UGC-VideoCap—a new benchmark for detailed audio-visual captioning—and UGC-VideoCaptioner-3B, a compact model distilled from Gemini-2.5 Flash.

Publication

2025-10-05

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts.

Publication

2025-09-28

CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

We introduce CULTURE3D, a large-scale, high-fidelity dataset for cultural-heritage 3D reconstruction, comprising 41,006 drone-captured images across 20 diverse scenes, enabling rigorous evaluation of modern Gaussian-based scene rendering.

Publication

2025-07-09

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

We propose ST-R1 training paradigm, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning.

Publication

2025-04-23

VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

The field of video understanding faces a scalability bottleneck. We introduce VideoMAP, a hybrid Mamba-Transformer framework with frame-wise autoregressive pretraining that effectively combines Mamba's sequential efficiency with Transformer's global reasoning.

Publication

2025-03-16

X-lebench: A benchmark for extremely long egocentric video understanding

X-lebench: A benchmark for extremely long egocentric video understanding

Understanding human behavior from first-person video has long been a challenge in AI research. X-LeBench introduces the first benchmark designed specifically for extremely long egocentric videos, spanning from 23 minutes up to 16.4 hours, to push AI models toward reasoning over long time horizons.

Publication

2025-01-12

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

We have developed a system that acts like an external memory for humans. By combining Augmented Reality (AR) glasses with advanced AI, the project explores how everyday experiences captured in first-person video can be encoded into language, stored in databases, and later retrieved when we need them most.

Publication

2024-10-18