This repository contains a trained model and the code for my research project in deep learning.
The goal of this project is to develop a deep learning model able to compute depth from a single image. The main constraint is a lightweight architecture that enables real-time inference on NVIDIA Jetson TX2 CPU.
This work is based on FastDepth and a distillation strategy using MiDaS as teacher.
The FastDepth model used is MobileNet-NNConv5(depthwise), with additive skip connections. The main modification consists in the replacement of the encoder with MobileNetV3-Large. The skip connection are also revisited taking into account the different encoder architecture. Moreover, feature upsampling through the nearest neighbour operator in the decoder phase have been replaced with bilinear interpolation.
The distillation strategy consists in using a large collection of images to generate proxy labels with a robost model like MiDaS. The Places dataset may be an appropriate image collection because of the multitude of environment categories (indoor and outdoor) and the sheer size of the dataset (more than 10 million images). For training purposes only 450 000 were selected randomly. The images were first resized and then randomly cropped to have a resolution of 512x384. Once the depth map have been generated through MiDaS a further resize was done to bring the resolution of both the image and depth map to 256x192, which is the input size of the lightweight model. It has taken 20+ hours to generate the depth maps using NVIDIA Tesla P100 GPU.
Comparison against prior work and the teacher (MiDaS) on NYU Depth test split
Rete | Input size | Training Dataset | Abs Rel | Sq Rel | RMSE log | RMSE | |||
---|---|---|---|---|---|---|---|---|---|
MiDaS | 512x384 | MIX | 0.091 | 0.058 | 0.391 | 0.121 | 0.919 | 0.987 | 0.997 |
FastDepth | 224x224 | NYU | 0.165 | 0.154 | 0.604 | 0.208 | 0.771 | 0.937 | 0.980 |
PydNet | 640x320 | WILD | 0.130 | 0.091 | 0.493 | 0.168 | 0.827 | 0.969 | 0.994 |
FastDepth | 640x320 | WILD | 0.129 | 0.090 | 0.492 | 0.167 | 0.833 | 0.971 | 0.994 |
This Work | 256x192 | Places | 0.115 | 0.077 | 0.449 | 0.150 | 0.867 | 0.976 | 0.995 |
- Download NYU Depth V2 test split
cd nyu_depth_v2
wget http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/nyu_depth_v2_labeled.mat
wget http://horatio.cs.nyu.edu/mit/silberman/indoor_seg_sup/splits.mat
- To evaluate run:
python main.py --config config/config.yaml
-
Wofk, D., Ma, F., Yang, T.J., Karaman, S., & Sze, V. (2019). FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In IEEE International Conference on Robotics and Automation (ICRA).
-
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, & Vladlen Koltun (2022). Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3).
-
Rene Ranftl, Alexey Bochkovskiy, & Vladlen Koltun (2021). Vision Transformers for Dense Prediction. ICCV.
-
Aleotti, F., Zaccaroni, G., Bartolomei, L., Poggi, M., Tosi, F., & Mattoccia, S. (2021). Real-time single image depth perception in the wild with handheld devices. Sensors, 21.
-
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.