Swin3D: A Pretrained Transformer-based Backbone for Indoor Scene Understanding

Yu-Qi Yang 12    Yu-Xiao Guo 2    Jian-Yu Xiong 2    Yang Liu 2
Hao Pan 2    Peng-Shuai Wang 3    Xin Tong 2    Baining Guo 2

1 Tsinghua University    2 Microsoft Research Asia    3 Peking University


Pretrained backbones with fine-tuning have been widely adopted in 2D vision and natural language processing tasks and demonstrated significant advantages to task-specific networks. In this paper, we present a pretrained 3D backbone, named Swin3D, which first outperforms all state-of-the-art methods in downstream 3D indoor scene understanding tasks. Our backbone network is based on a 3D Swin transformer and carefully designed to efficiently conduct self-attention on sparse voxels with linear memory complexity and capture the irregularity of point signals via generalized contextual relative positional embedding. Based on this backbone design, we pretrained a large Swin3D model on a synthetic Structured3D dataset that is 10 times larger than the ScanNet dataset and fine-tuned the pretrained model in various downstream real-world indoor scene understanding tasks. The results demonstrate that our model pretrained on the synthetic dataset not only exhibits good generality in both downstream segmentation and detection on real 3D point datasets, but also surpasses the state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, +8.1 mAP@0.5 on S3DIS detection. Our method demonstrates the great potential of pretrained 3D backbones with fine-tuning for 3D understanding tasks.

Results and models

ScanNet Segmentation

Pretrained mIoU(Val) mIoU(Test)
Swin3D-S 75.2 -
Swin3D-S 75.6(76.8) -
Swin3D-L 76.2(77.5) 77.9

S3DIS Segmentation

Pretrained Area 5 mIoU 6-fold mIoU
Swin3D-S 72.5 76.9
Swin3D-S 73.0 78.2
Swin3D-L 74.5 79.8

ScanNet 3D Detection

Pretrained mAP@0.25 mAP@0.50
Swin3D-S+FCAF3D 74.2 59.5
Swin3D-L+FCAF3D 74.2 58.6
Swin3D-S+CAGroup3D 76.4 62.7
Swin3D-L+CAGroup3D 76.4 63.2

S3DIS 3D Detection

Pretrained mAP@0.25 mAP@0.50
Swin3D-S+FCAF3D 69.9 50.2
Swin3D-L+FCAF3D 72.1 54.0

Visual Results

Paper [Arxiv]

Code [Github]