VGGSfM: Visual Geometry Grounded Deep Structure From Motion

Jianyuan Wang1, 2,     Nikita Karaev1, 2,     Christian Rupprecht1,     David Novotny2
1Visual Geometry Group, University of Oxford, 2Meta AI

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024, Highlight

Visualization of VGGSfM reconstructions from multiple observation angles.

Abstract

Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep SfM pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.


Method

Our method extracts 2D tracks from input images, reconstructs cameras using image and track features, initializes a point cloud based on these tracks and camera parameters, and applies a bundle adjustment layer for reconstruction refinement. The whole framework is fully differentiable.

Teaser image

In-the-wild Application

Reconstruction of In-the-wild Photos with VGGSfM, displaying estimated point clouds (in blue) and cameras (orange).

Inthewild

Qualitative Visualization

Camera and point-cloud reconstructions of VGGSfM on Co3D (left) and IMC Phototourism (right).

Reconstructions

Tracking

In each row, the left-most frame contains the query image with query points (crosses). The predicted track points (dots) are shown to the right. The top-right part also highlights our track-point confidence predictions, illustrated as ellipses with extent proportional to the predicted variance.

Tracking

BibTeX

@InProceedings{wang2023vggsfm,
  author    = {Jianyuan Wang and Nikita Karaev and Christian Rupprecht and David Novotny},
  title     = {VGGSfM: Visual Geometry Grounded Deep Structure From Motion},
  year      = {2023}
}

Acknowledgements

Jianyuan Wang is supported by Facebook Research. We are deeply grateful for the insightful discussions and invaluable support provided by Vadim Tschernezki, Stanislaw Szymanowicz, Luke Melas-Kyriazi, Shangzhe Wu, Junyu Xie, Minghao Chen, Chuhan Zhang, Tengda Han, Chuanxia Zheng, Yash Bhalgat, Laurynas Karazija, Yuchao Dai, Yiran Zhong, Hongdong Li, Benjamin Graham, Junlin Han, Xingchen Liu, Ang Cao, Roman Shapovalov, João Henriques, Andrea Vedaldi, and Andrew Zisserman.