This is my project as a part of internship in ELEKS. My task was to predict 3D bounding boxes for objects on photos. To this end, I chose Google Objectron dataset as my primary data. There are several classes available. For simplicity I chose only one - cup class. Also, I train my model only on photos with one cup in it, omitting pictures with two or more cups.
Google's annotation provides coordinates of bounding box for each frame. It describes 9 points id 3d coordinate system (8 of a box and 1 in the center of a box), so 27 values total.
At first I build a simple custom model which predicts 27 values to get baseline.
This model produced MSE loss value at about 0.1805
Then I tried VGG16 model with frozen convolutional layers.
The result is much better than baseline - the MSE is 0.022
Then I unfroze weights and trained the model again.
Unfortunatelly, it didn't help a bit.
In the end, I got something like this