Dropping the D: RGB-D SLAM
Without the Depth Sensor

Mert Kiray1,2,3, Alican Karaomer1, Benjamin Busam1,2,3
1Technical University of Munich (TUM),   23dwe.ai,   3Munich Center for Machine Learning
DropD-SLAM Teaser

Conceptual comparison between traditional RGB-D SLAM and our proposed DropD-SLAM. Conventional pipelines require active depth sensing for scale and robustness, while DropD-SLAM achieves comparable performance from a single RGB input by leveraging pretrained modules for depth, features, and instance segmentation.

Pipeline Overview

DropD-SLAM Pipeline Overview

Figure: DropD-SLAM pipeline overview.

Abstract

We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU.