Summary: In this project, we transferred the target from the first video to the second one. Additionally, we altered the characteristics of the source audio to match those of the target audio. We then blended these two projects into a single project.
Please read our research paper (Project_latex.pdf) for complete explanations in all respects about this repo. Because we only mentioned a summary in this Github README.
View the videos below to gain a comprehensive understanding of our project (the longer video is too low due to Github limits).
video_2023-12-07_18-47-34.mp4
video_2023-12-06_14-54-28.mp4
This project goes beyond simply finding and evaluating models; it combines selected models into an interesting application. Our chosen model for image segmentation is SAM, and we've successfully applied it to video (Approach 3 and 2). We also utilized the DETR (End-to-End Object Detection) model with a ResNet-50 backbone, and pushed the dockerized application to Docker Hub for use. For voice conversion, we selected the so-vits-svc-fork Model, which enables us to change a singer's voice to any desired sound. Finally, we integrated these two applications into one: Combination of vi & vc.
To use the video segmentation part, navigate to the video inpainting
directory. If you want to segment an image, go to the image segmentation
folder, where you will find notebooks with the relevant codes. If you want to apply image inpainting, we have a combination of segmentation and inpainting for Photoshop work in a notebook called image inpainting
(this section may be made a separate project in the future). Lastly, if you want to apply image segmentation to a video, go to the applying Is to video
directory. In this directory, there are three notebooks about SAM; we recommend using Approach 3. The last notebook implements the DETR model with ResNet-50 backbone. In addition, all Docker-related files are in the video inpainting directory for future modifications.
For the voice cloning
directory, it contains a single notebook that specifies model training, testing, and usage. For more information, please refer to the paper.
Finally, to use the combination of these two applications (video inpaintig and voice conversion), which uses SAM for the video section, go to the combination of vi & vc
directory.
- Python 3.x
- Google Colab or Jupyter Notebook
- Docker
- Linux system (Ubuntu 22.04)
- Nvidia GPU
- CUDA Toolkit
Clone the repository:
https://github.com/Amirrezahmi/Video-Inpainting-and-Voice-Cloning.git
Detailed usage instructions are provided in each directory's notebooks. Please refer to them for specific steps to run the models.
For some samples please visit our Drive. For more examples visit our paper.
Contributions are welcome. Please open an issue to discuss the change or improvement you want to make, or create a pull request to propose your changes.