We have developed a system that acts like an external memory for humans. By combining Augmented Reality (AR) glasses with advanced AI, this work explores how everyday experiences captured in first-person video can be encoded into language, stored in databases, and later retrieved when we need them most. This “Encode-Store-Retrieve” framework points toward a future where AR devices serve as practical memory augmentation assistants.
The Memory Challenge
Human memory is fallible. We forget where we placed our belongings, overlook details of past events, or struggle to recall information during critical moments. Lifelogging with AR devices offers a way to record everything we see, but raw video data is massive and impractical to search. Traditional video storage consumes terabytes per year, and existing retrieval methods are inefficient. The challenge is clear: How can we capture, store, and recall experiences in a way that is both lightweight and useful?
The Encode-Store-Retrieve Approach
We designed a memory augmentation agent inspired by human cognition: encoding, storing, and retrieving information. The workflow unfolds as follows:
- Encode: Egocentric videos are transformed into detailed text descriptions using Ego-LLaVA, a fine-tuned vision-language model.
- Store: These text encodings are converted into vector embeddings and stored in a Chroma database for efficient search.
- Retrieve: When the user asks a question like 'Where did I leave my keys?', the system retrieves relevant memory chunks and uses GPT-4 to generate an answer.

Results That Outperform Human Memory
The Encode-Store-Retrieve agent was tested extensively:
- Benchmark Performance: On the QA-Ego4D dataset, the system achieved a BLEU score of 8.3, outperforming traditional models that scored between 3.4 and 5.8.
- User Study: Using HoloLens 2, participants compared their own recollection with the agent's answers. The AI significantly outperformed humans on episodic memory questions (average score 4.1/5 vs 2.5/5 for humans).
- User Feedback: Participants valued the system's accuracy and detail, while noting concerns about privacy, constant camera use, and social acceptance.


Applications
This memory augmentation system opens doors to practical use cases:
- Finding misplaced objects at home.
- Supporting students or professionals during learning and meetings.
- Helping researchers recall details in experiments or fieldwork.
- Offering memory assistance for individuals with cognitive challenges.
Key Takeaways
- Language encoding reduces storage demands dramatically compared to raw video.
- Ego-LLaVA fine-tuning improves memory recall performance significantly.
- Combining AR and AI makes practical external memory assistants possible.
- Privacy and social comfort are key challenges that must be addressed for adoption.
Conclusion
Encode-Store-Retrieve demonstrates how AR and AI can combine to extend human memory. By transforming lifelogging videos into searchable language representations, this approach makes memory augmentation practical, efficient, and user-friendly. While limitations remain, the work represents a step toward a future where technology can act as a reliable external memory for our daily lives.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: