3D visual grounding is the task of localizing a target object in a 3D scene given a natural language description. This work focuses on developing a transformer architecture for bounding box prediction around a target object that is described by a natural language description.
For additional details, please see our paper:
"3D Visual Grounding with Transformers"
by Stefan Frisch and Florian Stilz
from the Technical University of Munich.
For the setup and dataset preparation please check the ScanRefer github page.
In our architecture we replaced VoteNet by 3DETR-m and added a vanilla transformer encoder to the fusion module.
To reproduce our results we provide the following commands along with the results. The pretrained models are in the outputs folder. We have implemented a chunking mechanism which significantly reduced the training time compared to the normal ScanRefer. The training of the baseline model takes around 4 hours on a current GPU (NVIDIA Tesla T4).
Name | Command | Overall | Comments | |
---|---|---|---|---|
Acc@0.25IoU | Acc@0.5IoU | |||
ScanRefer (Baseline) | python scripts/train.py
--use_color --lr 1e-3 --batch_size 14 |
37.05 | 23.93 | xyz + color + height |
ScanRefer with pretrained VoteNet (optimized Baseline) | python scripts/train.py
--use_color --use_chunking
--use_pretrained "pretrained_VoteNet"
--lr 1e-3 --batch_size 14 |
37.11 | 25.21 | xyz + color + height |
Ours (pretrained 3DETR-m + GRU + vTransformer) | python scripts/train.py
--use_color --use_chunking
--detection_module 3detr
--match_module transformer
--use_pretrained "pretrained_3DETR"
--no_detection |
37.08 | 26.56 | xyz + color + height |