arXiv | IEEE Xplore | Website | Video
This repository is the official implementation of the paper:
Few-Shot Panoptic Segmentation With Foundation Models
Markus Käppeler*, Kürsat Petek*, Niclas Vödisch*, Wolfram Burgard, and Abhinav Valada.
*Equal contribution.IEEE International Conference on Robotics and Automation (ICRA), 2024
If you find our work useful, please consider citing our paper:
@inproceedings{kaeppeler2024spino,
title={Few-Shot Panoptic Segmentation With Foundation Models},
author={Käppeler, Markus and Petek, Kürsat and Vödisch, Niclas and Burgard, Wolfram and Valada, Abhinav},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2024},
pages={7718-7724}
}
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments.
- Create conda environment:
conda create --name spino python=3.8
- Activate environment:
conda activate spino
- Install dependencies:
pip install -r requirements.txt
- Install torch, torchvision and cuda:
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
- Compile deformable attention:
cd panoptic_segmentation_model/external/ms_deformable_attention & sh make.sh
- Install pre-commit githook scripts:
pre-commit install
- Upgrade isort to 5.12.0:
pip install isort
- Update pre-commit:
pre-commit autoupdate
- Linter (pylint) and formatter (yapf, iSort) settings can be set in pyproject.toml.
To generate pseudo-labels for the Cityscapes dataset, please set the path to the dataset in the configuration files (see list below).
Then execute run_cityscapes.sh from the root of the panoptic_label_generator
folder.
This script will perform the following steps:
- Train the semantic segmentation module using the configuration file configs/semantic_cityscapes.yaml.
- Train the boundary estimation module using the configuration file configs/boundary_cityscapes.yaml.
- Generate the panoptic pseudo-labels using the configuration file configs/instance_cityscapes.yaml.
We also support the KITTI-360 dataset. To generate pseudo-labels for KITTI-360, please adapt the corresponding configuration files.
Instead of training the modules from scratch, you can also use the pretrained weights provided at these links:
- Cityscapes: https://drive.google.com/file/d/1FjJYpkEO9enpsahevD8PMn3nP_O0sNnT/view?usp=sharing
- KITTI-360: https://drive.google.com/file/d/1Eod444VoRLKw6dOeDSLuvfUQlJ5FAwM_/view?usp=sharing
To train a panoptic segmentation model on a given dataset, e.g., the generated pseudo-labels, execute train.sh.
Before running the code, specify all settings:
- python_env: Set the name of the conda environment (e.g. "spino")
- alias_python: Set the path of the python binary to be used
- WANDB_API_KEY: Set the wand API key of your account
- CUDA_VISIBLE_DEVICES Specifies the device ids of available GPUs
- Set all remaining arguments:
- nproc_per_node: Number of processes per node (usually node=GPU server), this should be equal to the number of devices specified in CUDA_VISIBLE_DEVICES
- master_addr: IP address of GPU server to run the code on
- master_port: Port to be used for server access
- run_name: Name of the current run, a folder will be created with this name including all the files to be created (pretrained weights, config file etc.) and this name will also appear on wandb
- project_root_dir: Path to where the folder with the run name will be created
- mode: Mode of the training, can be "train" or "eval"
- resume: If specified, the training will be resumed from the specified checkpoint
- pre_train: Only load the specified modules from the checkpoint
- freeze_modules: Freeze the specified modules during training
- filename_defaults_config: Filename of the default configuration file with all configuration parameters
- filename_config: Filename of the configuration file that acts relative to the default configuration file
- comment: Some string
- seed: Seed to initialize "torch", "random", and "numpy"
- Set available flags:
- eval: Only evaluate the model specified by resume
- debug: Start the training in debug mode
Additionally,
- ensure that the dataset path is set correctly in the corresponding config file, e.g., train_cityscapes_dino_adapter.yaml.
- set the
entity
andproject
parameters forwandb.init(...)
in misc/train_utils.py.
Download the following files:
- leftImg8bit_sequence_trainvaltest.zip (324GB)
- gtFine_trainvaltest.zip (241MB)
- camera_trainvaltest.zip (2MB)
After extraction, one should obtain the following file structure:
── cityscapes
├── camera
│ └── ...
├── gtFine
│ └── ...
└── leftImg8bit_sequence
└── ...
Download the following files:
- Perspective Images for Train & Val (128G): You can remove "01" in line 12 in
download_2d_perspective.sh
to only download the relevant images. - Test Semantic (1.5G)
- Semantics (1.8G)
- Calibrations (3K)
After extraction and copying of the perspective images, one should obtain the following file structure:
── kitti_360
├── calibration
│ ├── calib_cam_to_pose.txt
│ └── ...
├── data_2d_raw
│ ├── 2013_05_28_drive_0000_sync
│ └── ...
├── data_2d_semantics
│ └── train
│ ├── 2013_05_28_drive_0000_sync
│ └── ...
└── data_2d_test
├── 2013_05_28_drive_0008_sync
└── 2013_05_28_drive_0018_sync
For academic usage, the code is released under the GPLv3 license. For any commercial purpose, please contact the authors.
This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant No 468878300 and the European Union’s Horizon 2020 research and innovation program grant No 871449-OpenDR.