[ Install | Datasets | Training | Models | Evaluation | Demo | References | License ]
This is the official PyTorch implementation for the system proposed in the paper :
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
⟹ Unified Visual Odometry : Our holistic visualization of depth and motion estimation from self-supervised monocular training.
@inproceedings{lee2021learning,
title={Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency},
author={Lee, Seokju and Im, Sunghoon and Lin, Stephen and Kweon, In So},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
year={2021}
}
Our code is tested with CUDA 10.2/11.0, Python 3.7.x (conda environment), and PyTorch 1.4.0/1.7.0.
At least 2 GPUs (each 12 GB) are required to train the models with batch_size=4
and maximum_number_of_instances_per_frame=3
.
Create a conda environment with PyTorch library as :
conda create -n my_env python=3.7.4 pytorch=1.7.0 torchvision torchaudio cudatoolkit=11.0 -c pytorch
conda activate my_env
Install prerequisite packages listed in :
pip3 install -r requirements.txt
or install manually the following packages :
opencv-python
imageio
matplotlib
scipy==1.1.0
scikit-image
argparse
tensorboardX
blessings
progressbar2
path
tqdm
pypng
open3d==0.8.0.0
Please install torch-scatter
and torch-sparse
following this link.
pip3 install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
We provide our KITTI-VIS and Cityscapes-VIS dataset (download link), which is composed of pre-processed images, auto-annotated instance segmentation, and optical flow.
-
Images are pre-processed with SC-SfMLearner.
-
Instance segmentation is pre-processed with PANet.
-
Optical flow is pre-processed with PWC-Net.
We associate them to operate video instance segmentation as implemented in datasets/sequence_folders.py
.
Please allocate the dataset as the following file structure :
kitti_256 (or cityscapes_256)
└ image
└ $SCENE_DIR
└ segmentation
└ $SCENE_DIR
└ flow_f
└ $SCENE_DIR
└ flow_b
└ $SCENE_DIR
├ train.txt
└ val.txt
Training and validation scenes can be randomly generated in train.txt
and val.txt
.
You can train the models on KITTI-VIS by running :
sh scripts/train_resnet_256_kt.sh
You can train the models on Cityscapes-VIS by running :
sh scripts/train_resnet_256_cs.sh
Please indicate the location of the dataset with $TRAIN_SET
.
The hyperparameters (batch size, learning rate, loss weight, etc.) are defined in each script file and default arguments in train.py
. Please also check our main paper.
During training, checkpoints will be saved in checkpoints/
.
You can also start a tensorboard
session by running :
tensorboard --logdir=checkpoints/ --port 8080 --bind_all
and visualize the training progress by opening https://localhost:8080 on your browser.
For convenience, we provide two breakpoints (supported with pdb), commented as BREAKPOINT
in train.py
.
Each breakpoint represents an important point in projecting the object.
BREAKPOINT-1 : Breakpoint after the 1st projection with camera motion. Visualize ego-warped images. BREAKPOINT-2 : Breakpoint after the 2nd projection with each object motion. Visualize fully-warped images and motion fields.
You can visualize the intermediate outputs with the commented code. This will improve your visibility on debugging the code.
We provide KITTI-VIS and Cityscapes-VIS pretrained models (download link).
The architectures are based on the ResNet18 encoder. Please see the details of them in models/
.
Models trained under three different conditions are released :
KITTI : Trained on KITTI-VIS using ImageNet (ResNet18) pretrained model. CS : Trained on Cityscapes-VIS using ImageNet (ResNet18) pretrained model. This model is only for the pretraining and demo. CS+KITTI : Pretrained on Cityscapes-VIS, and finetuned on KITTI-VIS.
We evaluate our depth estimation following the KITTI Eigen split.
For the evaluation, it is required to download the KITTI raw dataset provided on the official website.
Tested scenes are listed in kitti_eval/test_files_eigen.txt
.
You can evaluate the models by running :
sh scripts/run_eigen_test.sh
Please indicate the location of the raw dataset with $DATA_ROOT
, and the models with $DISP_NET
.
We demonstrate our results as follows :
Models | Abs Rel | Sq Rel | RMSE | RMSE log | Acc 1 | Acc 2 | Acc 3 |
---|---|---|---|---|---|---|---|
ResNet18, 832x256, ImageNet → KITTI | 0.112 | 0.777 | 4.772 | 0.191 | 0.872 | 0.959 | 0.982 |
ResNet18, 832x256, Cityscapes → KITTI | 0.109 | 0.740 | 4.547 | 0.184 | 0.883 | 0.962 | 0.983 |
For convenience, we also provide precomputed depth maps in this link.
We demonstrate Unified Visual Odometry, which shows the results of depth, ego-motion, and object motion holistically.
You can visualize them by running :
sh scripts/run_demo.sh
Please indicate the location of the image samples with $SCENE
. We recommend to visualize Cityscapes scenes since it contains more dynamic objects than KITTI.
More results are demonstrated in this link.
-
SC-SfMLearner (NeurIPS 2019, our baseline framework)
-
PANet (CVPR 2018, instance segmentation for data pre-processing)
-
PWC-Net (CVPR 2018, optical flow for data pre-processing)
-
PyTorch-Sparse (PyTorch library for sparse tensor representation)
-
Struct2Depth (AAAI 2019, object scale loss)
-
Depth from Video in the Wild (ICCV 2019, motion field representation)
The source code is released under the MIT license.