We introduce the first data-driven multi-view 3D point tracker, designed to
track arbitrary points in dynamic scenes using multiple camera views. Unlike
existing monocular trackers, which struggle with depth ambiguities and
occlusion, or prior multi-camera methods that require over 20 cameras and
tedious per-sequence optimization, our feed-forward model directly predicts 3D
correspondences using a practical number of cameras (e.g., four), enabling
robust and accurate online tracking. Given known camera poses and either
sensor-based or estimated multi-view depth, our tracker fuses multi-view
features into a unified point cloud and applies k-nearest-neighbors correlation
alongside a transformer-based update to reliably estimate long-range 3D
correspondences, even under occlusion. We train on 5K synthetic multi-view
Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and
DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively.
Our method generalizes well to diverse camera setups of 1-8 views with varying
vantage points and video lengths of 24-150 frames. By releasing our tracker
alongside training and evaluation datasets, we aim to set a new standard for
multi-view 3D tracking research and provide a practical tool for real-world
applications. Project page available at https://ethz-vlg.github.io/mvtracker.