- Use Convolution Network as feature extraction.
- Use fully connected layers to predict output probabilites and coordinates.
In the yolo paper, the author pretrain on 20 convolution layers on the ImageNet dataset.
Here the convolution layers is used as feature extraction.
Obviously, i'm not gonna pretain in the imagenet. coz, it gonna take up to lot of time to train(may be weeks). Nonetheless, one can use any pretrained cnn network like resnet, or others mentioned in paper.
However.., here i've used custom architecture (Archi.config in repo) to train image classifier in custom dataset (pizza vs sandwich dataset.)
If you've noticed the architecture, i've used the higher convolution followed by lower number of channels convolution, because it reduces the amount of computation and improves the non-linearity of the model.
Result of Image Classifier:
Optimizer | Epoch | Learning Rate | Training Accuracy | Testing Accuracy |
---|---|---|---|---|
SGD Gradient Descent | 50 | 0.0001 | 93.50% | 91.41% |
The trained model is in ./Saved Models/
folder. You can pretrain CNN Network in your own dataset.
For training and testing , you can find in classifer.ipynb
file within the repo.
Also, you can checkout this repo.
Yolov1 frame object detection as a regression problem to spatially seperated bounding box and associated class probabilites.
For the last convolution layer, it outputs a tensor shapeed (7, 7, 1024). Then the tensor expands using 2 FC layers as a form of linear regression. It outputs parameters and then reshapes into (7, 7, 30).
2.The output is SS(5B+C).
Since, S=7. The image is alltogether divided into 77 grid.
So, for each grid, the size of output is 5B+C.
Terms:
B = Bounding Box
C = Proababilies of each class
If we consider B=1, and C = 'n' class., then for each grid 1 bouding box is predicted. It looks something like this.
If we consider B=2 then,
This means, each grid is going to predict 2 bounding box which is defined by (x, y, w, h)which are center, width and height of bounding box.
Therefore, the output is the flatten of size S**S(5B+C).
The 30 in the fully connected layer is the (5B+C), where the author considers B=2 and C=20 classes(can predict upto 20 classes)
However in this repo, i've used yolov3.(but doesn't contain pipeline for three scaled images.s.) Unlike in the yolov1 where the final year consists of the regressor, here CNN is used in the final layer.
- Classifier is used as Feature Selection/Feature Extraction.
- Extra layer is added to the classifier to get the output CNN.
- The output CNN should be in the dimension of SSC. where S: Grid size, and C= Channel.
- Here the S=13. Meaning the image is divided into 13 by 13 grid and the output consists of 13*13 height and weight and C channle.
- Here the C=7. Meaning the ouput will have 7 channel. [1st chnnel:Confidence Score, 2nd to 5th channel: x, y, w, h and 6th to 7th channel consists of probability score of given object falls in particular class.]
- Here (x, y): Center of the bounding box. (w, h): Dimension of bouding box.
- Since, here i've considered only two class. So, only two channel after x,y,w and h.
The Yolov1 loss function is used in this implementation.
Let's break each of em.
- Bounding Box Coordinate Loss/ Regression Loss.
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object.
We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.
However, i've only taken one bounding box per grid cell.
(xi, yi, wi, hi): Network prediction of center ,height and width of the responsible bounding box in the i'th grid box
(x^i, y^i, w^i, h^i): Ground truth
- Confidence Loss
Ci: True obectness score. Ci^: Prediction from network * Iou between ground truth and predicted volume.
- Classificaiton Loss
- Comparatively low recall and more localization error compared to Faster R_CNN.
- Struggles to detect close objects because each grid can propose only 2 bounding boxes.
- Struggles to detect small objects.