We introduce CULTURE3D, a large-scale, high-fidelity dataset for cultural-heritage 3D reconstruction, comprising 41,006 drone-captured images (48MP) across 20 diverse scenes and ~10B points. Compared with prior datasets, CULTURE3D offers greater scale, diversity (indoor + outdoor), and detail, enabling rigorous evaluation of modern Gaussian-based scene rendering. We benchmark 3DGS, SuGaR, GOF, Wild Gaussian, HoGS, and City GS using a unified pipeline and common ground truth, reporting PSNR, SSIM, and LPIPS along with qualitative comparisons and failure analyses (e.g., out-of-memory). Results highlight significant scalability challenges on large, highly detailed cultural scenes, while City GS shows superior robustness. We release raw images, COLMAP assets, and textured 3D models to support future research in rendering, reconstruction, and AR/VR applications.

Introduction
Cultural-scale 3D mapping matters for preservation, navigation, and XR. Prior benchmarks are often small, synthetic, or domain-limited, which masks scalability issues. CULTURE3D focuses on realism, diversity, and resolution to expose practical bottlenecks and guide method design.
Contributions
- A large, culturally diverse dataset with consistent packaging for research
- An end-to-end pipeline releasing images, camera poses, point clouds, and textured meshes
- A unified benchmark of Gaussian-based methods with quantitative and qualitative analyses
Dataset (collection & assets)— We captured each site using a combination of structured aerial and handheld passes to ensure complete geometry and high-quality textures. Viewpoints and lighting were varied, and flight routes were carefully planned in advance. We release the full asset pack—raw images, camera intrinsics/extrinsics, sparse and dense point clouds, and textured meshes—organized in standardized scene folders for straightforward loading. To support reproducibility, we paid particular attention to ornamented surfaces and thin structures, and we maintained consistent naming and directory layout throughout.
[FIG-D1] Scene Montage: representative cultural/urban environments.

Representative CULTURE3D scenes spanning indoor and outdoor heritage sites: Petra (a), Italy Cathedral / Pisa complex (b), Forbidden City main palace (c), Pyramids and Sphinx (d), National Gallery interiors (e.1–e.2), Longmen Grotto (f), the Louvre (g), Buckingham Palace (h), Cambridge Graduation Square (i.1–i.2), Trafalgar Square (j), and Stonehenge (k). Together, these panels illustrate the dataset’s breadth and high-detail reconstructions intended for VR, architectural modeling, and heritage research.
[FIG-D2] Dataset Pipeline: acquisition → SfM/MVS → assets packaging.
Overview of the CULTURE3D pipeline: starting from raw image acquisition (a.1), we perform feature matching and sparse SfM (a.2), then dense reconstruction and mesh generation (a.3). These assets feed benchmark renderers (b; e.g., 3DGS, RealityCapture, GOF) to produce a 3D model for evaluation (c). The bottom panels illustrate model generation with map alignment (c.1) and camera pose estimation with bundle adjustment (c.2), showing recovered viewpoints and a geo-referenced top-down view.

Compact analysis
Using shared inputs/poses and PSNR↑/SSIM↑/LPIPS↓, the table shows clear scale effects. City GS is the most stable (no FAIL/OOM) and usually tops SSIM and often PSNR (Cambridge, Petra; strong at Trinity), though it tends to have higher LPIPS (worse perceptual score) and moderate runtimes (≈6–12 h). GOF delivers the best perceptual quality when it runs (e.g., Trinity LPIPS = 0.0260; Gallery PSNR = 18.36) but is heavy (≈10–23 h) and fails on Pyramid. 3DGS is by far the fastest (<1 h) and achieves competitive LPIPS on some scenes (e.g., Petra) and the best PSNR on Pyramid, but overall SSIM is lower on large, complex sites. SuGaR/Wild Gaussian/HoGS frequently hit OOM/FAIL, underscoring memory pressure from CULTURE3D’s scale and fine ornamentation. Takeaway: for broad deployment across large cultural scenes, prefer memory-robust pipelines (e.g., City GS or tiled/partitioned variants); for mid-size scenes prioritizing perceptual crispness, GOF/3DGS are attractive when resources and stability allow.
[TAB-M1] Quantitative results on CULTURE3D across representative scenes (SSIM↑/PSNR↑/LPIPS↓; time in hours). “OOM”/“FAIL” denote memory/runtime failures.

Limitations
Primarily static scenes; photometric inconsistency under extreme lighting; privacy/permit constraints in a subset of locations. Future releases can expand dynamic elements, weather/time-of-day diversity, and semantic labels.
Conclusion
CULTURE3D provides a realistic foundation for evaluating scalable Gaussian-based rendering. The benchmark highlights current limits and suggests directions: memory-aware splat scheduling, hybrid neural-geometric representations, and better handling of fine cultural details for preservation-grade fidelity.
Read more

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Large Language Models (LLMs) are rapidly evolving into powerful multimodal systems (VLMs), but when you switch from a single image to a long, high-frame-rate **video**, the computational cost explodes. This immense computational overhead—due to the sheer number of visual tokens—creates significant latency and memory bottlenecks, making it nearly impossible to deploy high-performing VLMs in real-time, resource-constrained applications like autonomous driving, surveillance, and real-time video Q\&A.

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
User-generated content (UGC) videos—like TikToks—combine fast-paced visuals, music, and speech, requiring deep multimodal understanding. Existing benchmarks are mostly visual-only, neglecting the crucial role of audio. To address this, researchers introduce **UGC-VideoCap**—a new benchmark for detailed audio-visual captioning—and **UGC-VideoCaptioner-3B**, a compact model distilled from Gemini-2.5 Flash using supervised fine-tuning (SFT) and GRPO reinforcement learning.

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Editing long-form, story-driven videos is one of the most cognitively demanding creative tasks. This paper presents an agentic video editing system that autonomously understands and reconstructs narrative media via natural-language prompts. Instead of cutting manually, users can request: *“Summarize this documentary as a 3-minute cinematic recap.”* The system produces an edited, narrated video, integrating comprehension, retrieval, and rendering, as shown this figure: