Skip to content

Latest commit

 

History

History
72 lines (62 loc) · 2.87 KB

File metadata and controls

72 lines (62 loc) · 2.87 KB

3D Visual Grounding with Transformers

Introduction

3D visual grounding is the task of localizing a target object in a 3D scene given a natural language description. This work focuses on developing a transformer architecture for bounding box prediction around a target object that is described by a natural language description.

For additional details, please see our paper:
"3D Visual Grounding with Transformers"
by Stefan Frisch and Florian Stilz from the Technical University of Munich.

Setup + Dataset

For the setup and dataset preparation please check the ScanRefer github page.

Architecture

In our architecture we replaced VoteNet by 3DETR-m and added a vanilla transformer encoder to the fusion module.

Results

To reproduce our results we provide the following commands along with the results. The pretrained models are in the outputs folder. We have implemented a chunking mechanism which significantly reduced the training time compared to the normal ScanRefer. The training of the baseline model takes around 4 hours on a current GPU (NVIDIA Tesla T4).

Name Command Overall Comments
Acc@0.25IoU Acc@0.5IoU
ScanRefer (Baseline)
python scripts/train.py 
        --use_color --lr 1e-3 --batch_size 14
37.05 23.93 xyz + color + height
ScanRefer with pretrained VoteNet (optimized Baseline)
python scripts/train.py 
        --use_color --use_chunking 
        --use_pretrained "pretrained_VoteNet" 
        --lr 1e-3 --batch_size 14
37.11 25.21 xyz + color + height
Ours (pretrained 3DETR-m + GRU + vTransformer)
python scripts/train.py 
        --use_color --use_chunking 
        --detection_module 3detr 
        --match_module transformer
        --use_pretrained "pretrained_3DETR"
        --no_detection 
37.08 26.56 xyz + color + height