Security video understanding plays a pivotal role in fields such as smart cities and public safety. However, its development has long been constrained by the scarcity of high-quality, large-scale surveillance domain data. Furthermore, deploying complex video understanding models to resource-constrained edge and mobile devices for real-time, high-efficiency on-device processing remains a significant challenge. To address these pain points, we propose Memories-S0, an innovative framework specifically designed for security video understanding. At the data level, we leverage powerful video generation models (such as Veo 3) to create a massive and diverse set of synthetic surveillance videos, effectively alleviating the difficulties associated with data acquisition. At the model level, we employ a 3-billion-parameter (3B) video understanding model and introduce a series of extreme optimization strategies. Specifically, we have designed innovative input token compression algorithms and model compression algorithms, which significantly reduce the computational complexity and memory footprint of the model. This enables it to run efficiently on on-device mobile devices and provide real-time feedback. For the training strategy, we have explored and proposed an efficient post-training strategy that can substantially enhance model performance at a very low training cost. Taken together, Memories-S0 achieves groundbreaking innovations in data, model, and training strategies, offering a powerful and viable solution for the practical application of security video understanding. Experimental results demonstrate that our framework achieves State-of-the-Art performance on multiple security video understanding tasks while ensuring extremely high operational efficiency.
Introduction
The field of security video understanding has become a cornerstone of modern smart cities, public safety, and intelligent traffic management systems. By automatically analyzing and interpreting vast amounts of video data from surveillance cameras, these systems can proactively identify anomalies, track suspicious activities, and provide critical insights for law enforcement and urban planners. However, despite its immense potential, the advancement of security video understanding is hindered by two fundamental challenges. First, the acquisition of high-quality, large-scale video datasets in the surveillance domain is exceptionally difficult due to privacy concerns, legal restrictions, and the sheer logistical complexity of data collection and annotation. This data scarcity severely limits the performance and generalizability of deep learning models. Second, the deployment of powerful video understanding models on resource-constrained edge and mobile devices, which is crucial for real-time, on-site applications, remains a significant hurdle. Existing state-of-the-art models are often computationally expensive and memory-intensive, making them unsuitable for efficient real-world deployment. Existing approaches have primarily focused on improving model architectures or fine-tuning existing models on limited, publicly available datasets. While these methods have shown progress, they fail to address the core challenges of data scarcity and deployment feasibility. Data augmentation techniques are often insufficient to create the diversity needed for robust model training. Moreover, existing model compression techniques, such as pruning and quantization, often lead to a non-trivial drop in performance, compromising accuracy for efficiency. Consequently, a comprehensive framework that addresses both data-level and model-level constraints is critically needed to bridge the gap between academic research and practical security applications.
To overcome these limitations, we introduce Memories-S0, an efficient and accurate framework for security video understanding. Our contributions are threefold:
-
Data Innovation. We pioneer a novel data generation paradigm by leveraging advanced video generation models, such as Veo 3, to synthesize a massive, diverse, and high-quality dataset tailored specifically for the surveillance domain. This approach effectively mitigates the data scarcity problem and provides a rich training ground for robust models.
-
Model Efficiency. We design a highly efficient 3-billion-parameter (3B) model architecture specifically for edge-side deployment. To achieve this, we propose innovative input token compression and model compression algorithms that drastically reduce computational complexity and memory footprint without sacrificing performance.
-
Efficient Training Strategy. We develop a sophisticated and highly efficient post-training strategy that enables significant performance boosts with minimal training overhead. This approach bypasses the need for costly and time-consuming full-scale fine-tuning, making our framework highly practical.
The Memories-S0 Framework
2.1. Framework Overview
Our proposed Memories-S0 framework is an end-to-end solution designed to address the data scarcity and computational constraints inherent in security video understanding. The framework integrates three core components: a sophisticated data generation module, a highly efficient model architecture, and a novel post-training strategy. This synergy enables us to train powerful models on a large, diverse, synthetic dataset and deploy them efficiently on edge devices, all while maintaining high performance.
2.2. Data Generation and Processing
To overcome the pervasive issue of data scarcity, we leverage advanced video generation models to synthesize a domain-specific, large-scale dataset. We utilize cutting-edge models like Veo 3 to generate realistic and diverse surveillance footage. The data generation process is as follows: we define a set of prompts describing various security scenarios (e.g., "a person walking in a dimly lit hallway," "a vehicle passing a surveillance camera at night," "an unattended package left in a public space"). These prompts guide the model to create videos with diverse backgrounds, lighting conditions, and object interactions, which are critical for robust model training.
A key advantage of this synthetic data is the ability to obtain pixel-perfect annotations for tasks such as object detection, tracking, and action recognition. We employ an automated annotation pipeline that leverages the generative process to produce ground truth labels, including bounding boxes, trajectories, and temporal action segments. This process ensures the high quality and accuracy of our training data, a feature often lacking in real-world surveillance datasets.
2.3. Model Architecture and Optimization
Our framework is built upon a 3-billion-parameter (3B) video understanding model, chosen for its powerful representational capacity. To make this large model suitable for resource-constrained edge devices, we introduce two critical optimization strategies:
2.3.1. Extreme Input Token Compression
Conventional video models process long sequences of input tokens (e.g., image patches or frames), leading to high computational costs. We propose an innovative input token compression algorithm that dramatically reduces the number of tokens while preserving salient information. Our algorithm analyzes the spatiotemporal redundancy within the video frames and dynamically prunes or merges redundant tokens. For instance, in a static surveillance scene, our algorithm discards tokens from background regions, focusing the model's attention on foreground objects and their motion. This approach significantly reduces the input sequence length, thereby accelerating inference without a substantial loss in semantic information.
2.3.2. Efficient Model Compression Algorithms
To further reduce the model's footprint and latency, we apply a combination of model compression techniques. Specifically, we employ structured pruning to remove entire heads or layers that contribute minimally to performance, and utilize low-rank factorization to approximate weight matrices with smaller, more efficient representations. These methods are carefully applied to achieve a high compression ratio while maintaining model accuracy, ensuring that our 3B model can run efficiently on mobile hardware.
2.4. The Post-training Strategy
2.4.1. Motivation and Challenges
Training large-scale video models from scratch is an incredibly resource-intensive process, requiring significant computational power and time. Fine-tuning on new datasets can also be costly, especially when data is limited. We argue that for security video applications, an alternative, cost-effective strategy is needed to bridge the gap between a pre-trained general model and a high-performance, domain-specific model. Our post-training strategy is motivated by the need to achieve state-of-the-art performance with minimal training effort and resources.
2.4.2. Post-training Strategy Design
Our post-training strategy is a two-step process that leverages a small amount of domain-specific data to significantly boost model performance:
-
We employ an event-based temporal shuffling strategy combined with Reinforcement Learning (RL) algorithms to improve our model's sequential understanding. This approach allows the model to better grasp the temporal relationships within event sequences, leading to enhanced performance.
-
We have designed an efficient and effective training recipe for both Supervised Fine-Tuning (SFT) and RL. This unique recipe allows us to substantially improve the model's capabilities with a minimal computational footprint.
By focusing on these key strategies, our method enables rapid performance gains without the need for a large-scale fine-tuning dataset or extensive computational resources.
Experiments
Table 1 | Anomaly detection performance of different MLLMs with various prompting methods.

To systematically and rigorously evaluate our model's performance, we adopted the SmartHomeBench Zhao et al. (2025) as our benchmark. This dataset is widely recognized for its realistic scenarios, accurate annotations, and diverse tasks, particularly in the field of smart home video understanding. By comparing our results with the latest achievements on this benchmark, we can comprehensively and fairly demonstrate the superiority of our model across various key performance indicators. Table 1 presents VAD performance results.
3.1. Experimental Setup
We evaluated the anomaly detection performance of various large language models on a benchmark dataset. The models were tested using several prompting methods to assess their effectiveness under different conditions. The performance metrics used for evaluation were Accuracy, Precision, Recall, and F1-score.
3.2. Prompting Methods
The following prompting methods were employed to evaluate the performance of the MLLMs:
-
Zero-shot: In this method, the model is provided with a task description and an input, but no examples. It must perform the task based solely on its pre-trained knowledge.
-
Chain-of-Thought (CoT): This technique guides the model to reason step-by-step before arriving at a final answer. The prompt includes instructions to think through the problem, which often leads to improved performance on complex reasoning tasks.
-
Few-shot CoT: This is an extension of the CoT method where the prompt includes a few examples of input-output pairs, with each example demonstrating the step-by-step reasoning process.
-
In-Context Learning (ICL): This method provides the model with a few examples of correct input-output pairs in the prompt, allowing it to learn the pattern and apply it to a new input without any explicit fine-tuning.
-
Taxonomy-Driven Reflective LLM Chain (TRLC Zhao et al. (2025)): This advanced prompting strategy leverages a pre-defined taxonomy to guide the model's reasoning and enhance its understanding of the problem space. By using a structured knowledge base (the taxonomy), the model can perform more accurate and contextually relevant analysis, often reflecting on its own outputs to refine the result.
3.3. Results and Analysis
As shown in Table 1, the performance of the MLLMs varies significantly across different prompting methods. Overall, more advanced prompting techniques, such as CoT, Few-shot CoT, and TRLC, generally lead to improved performance compared to the Zero-shot baseline, demonstrating the importance of structured reasoning and contextual guidance.
The Gemini models, particularly Gemini-1.5-pro, showed strong performance, with the TRLC method achieving an impressive F1-score of 76.36. Similarly, GPT-4o and GPT-4o-mini also benefited from these methods, with GPT-4o achieving its highest F1-score of 79.04 with TRLC. The Claude-3.5-sonnet model also demonstrated competitive results, achieving its peak F1-score of 80.88 with the TRLC method, highlighting the effectiveness of providing a well-structured taxonomy to the model.
A notable observation is the strong performance of our Memories-S1(3B) model. Despite its smaller size (3B parameters) compared to the other models, it achieved a remarkable F1-score of 79.21 using a simple Zero-shot prompting method. This result is highly competitive and, in some cases, surpasses the performance of much larger models that utilize more complex prompting strategies like CoT and ICL. This indicates that the framework design of Memories-S1(3B) is exceptionally effective for anomaly detection, allowing it to perform robustly without requiring complex in-context examples or reasoning chains. This finding is significant and suggests that smaller, highly-optimized models can be a viable alternative to large-scale general-purpose models for specific tasks like anomaly detection.
Conclusion and Future Work
In this paper, we introduced Memories-S0, a novel and effective framework that addresses the two primary challenges in security video understanding: data scarcity and efficient on-device deployment. Our work pioneers a data-level innovation by leveraging advanced video generation models to synthesize a large-scale, high-quality surveillance dataset, thereby overcoming the critical hurdle of data acquisition. Furthermore, we designed a highly efficient 3-billion-parameter model tailored for edge-side applications, utilizing novel input token and model compression algorithms to achieve a substantial reduction in computational complexity and memory footprint. Finally, our proposed efficient post-training strategy enables significant performance gains with minimal training overhead.
Collectively, Memories-S0 represents a significant step forward, offering a practical and powerful solution that bridges the gap between research and real-world security applications. Our experimental results validate that our framework not only achieves state-of-the-art performance but also maintains the operational efficiency required for practical deployment.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: