3D Visual Grounding with Transformers

Introduction

3D visual grounding is the task of localizing a target object in a 3D scene given a natural language description. This work focuses on developing a transformer architecture for bounding box prediction around a target object that is described by a natural language description.

For additional details, please see our paper:
"3D Visual Grounding with Transformers"
by Stefan Frisch and Florian Stilz from the Technical University of Munich.

Setup + Dataset

For the setup and dataset preparation please check the ScanRefer github page.

Architecture

In our architecture we replaced VoteNet by 3DETR-m and added a vanilla transformer encoder to the fusion module.

Results

To reproduce our results we provide the following commands along with the results. The pretrained models are in the outputs folder. We have implemented a chunking mechanism which significantly reduced the training time compared to the normal ScanRefer. The training of the baseline model takes around 4 hours on a current GPU (NVIDIA Tesla T4).

Name	Command	Overall		Comments
Name	Command	Acc@0.25IoU	Acc@0.5IoU	Comments
ScanRefer (Baseline)	python scripts/train.py --use_color --lr 1e-3 --batch_size 14	37.05	23.93	xyz + color + height
ScanRefer with pretrained VoteNet (optimized Baseline)	python scripts/train.py --use_color --use_chunking --use_pretrained "pretrained_VoteNet" --lr 1e-3 --batch_size 14	37.11	25.21	xyz + color + height
Ours (pretrained 3DETR-m + GRU + vTransformer)	python scripts/train.py --use_color --use_chunking --detection_module 3detr --match_module transformer --use_pretrained "pretrained_3DETR" --no_detection	37.08	26.56	xyz + color + height

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

3D Visual Grounding with Transformers

Introduction

Setup + Dataset

Architecture

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

3D Visual Grounding with Transformers

Introduction

Setup + Dataset

Architecture

Results