Visual-Transformers-Mono-Odometry

In this project, we aim to develop a transformer-based model architecture for visual odometry(VO) that can accurately estimate the position and orientation of a robot using images as input. To achieve this, we will first collect the dataset and preprocess the image sequences and then implement and train the proposed model architecture. The performance of the model will be evaluated and based on the setbacks of the pre-trained model, we will fine-tune it in other scenes. This deep-learning approach to visual odometry improves the accuracy and reliability of the systems across various applications.

Team Members

Hritvik Choudhari
Sumedh Reddy Koppula
Ashutosh Reddy Atimyala
Mohammed Maaruf Vazifdar
Venkata Sairam Polina

Approach

In our research, we aim to address the challenge of scale recovery in monocular systems. To do so, we will leverage the depth map estimated by a deep learning technique, specifically a transformer-based network.

Architecture

Part 1 - Dense Prediction

Embedded Phase: We begin by extracting non-overlapping patches from the input image utilizing a ResNeXT101 feature extractor to generate tokens.
Processing Tokens for feeding into Transformers: These tokens are enhanced with positional and readout embeddings and routed through several transformer stages.
Reassemble Phase: Tokens from several stages are reassembled into an image-like representation at many resolutions and merged using fusion modules, which build a fine-grained prediction gradually.
Fusing Feature Maps: The feature maps are upsampled using residual convolutional units in the fusion blocks. Our architecture leverages fine tuned modified version of hybrid Dense Prediction Transformer (DPT) model on KITTI odometry dataset.

Part 2 - Visual Odometry using scale estimation

Feature detection and matching: Used Accelerated Segment Test (FAST) corner detection algorithm for feature detection.Used iterative Lucas-Kanade method for feature Matching.
Scale: We estimated the relative scale in MVO by using the depth from module 1 and aligning it with the triangulated depth to generate the scale, which is obtained by a RANSAC regressor with a depth ratio vector as input.

ViT-DPT architecture

VO result plot

Results

Overall, our visual odometry model achieved good accuracy on the KITTI dataset, with low errors on all evaluation metrics. This demonstrates the effectiveness of our approach and the importance of accurate depth maps for visual odometry estimation.

Depth estimation using DPT

Visual Odometry metrices obtained

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Final project results		Final project results
MVO_Transformers_Final.ipynb		MVO_Transformers_Final.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual-Transformers-Mono-Odometry

Team Members

Approach

Architecture

Part 1 - Dense Prediction

Part 2 - Visual Odometry using scale estimation

Results

About

Releases

Packages

Contributors 2

Languages

sumedhreddy90/Dense-Prediction-Transformer-Based-Visual-Odometry

Folders and files

Latest commit

History

Repository files navigation

Visual-Transformers-Mono-Odometry

Team Members

Approach

Architecture

Part 1 - Dense Prediction

Part 2 - Visual Odometry using scale estimation

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages