On-the-fly 3D reconstruction from monocular image sequences is a
long-standing challenge in computer vision, critical for applications such as
real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff:
per-scene optimization yields high fidelity but is computationally expensive,
whereas feed-forward foundation models enable real-time inference but struggle
with accuracy and robustness. In this work, we propose ARTDECO, a unified
framework that combines the efficiency of feed-forward models with the
reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose
estimation and point prediction, coupled with a Gaussian decoder that
transforms multi-scale features into structured 3D Gaussians. To sustain both
fidelity and efficiency at scale, we design a hierarchical Gaussian
representation with a LoD-aware rendering strategy, which improves rendering
fidelity while reducing redundancy. Experiments on eight diverse indoor and
outdoor benchmarks show that ARTDECO delivers interactive performance
comparable to SLAM, robustness similar to feed-forward systems, and
reconstruction quality close to per-scene optimization, providing a practical
path toward on-the-fly digitization of real-world environments with both
accurate geometry and high visual fidelity. Explore more demos on our project
page: https://city-super.github.io/artdeco/.