-
Notifications
You must be signed in to change notification settings - Fork 0
/
results.tex
980 lines (891 loc) · 45.7 KB
/
results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
\chapter{Results and Discussions}
\label{chap:results}
%
In this chapter, we discuss the experimental procedures adopted.
First, we go through how we assess the performance of object detection.
After that, we discuss the implementation details, including network architectures, as well as training and inference hyper-parameters set.
Finally, we examine the obtained results both numerically and visually.
\section{Evaluation}
\label{sec:eval}
% https://medium.com/@timothycarlen/understanding-the-map-evaluation-metric-for-object-detection-a07fe6962cf3
% https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
% https://github.com/rafaelpadilla/Object-Detection-Metrics
Assessing the performance of object detectors
% may seem easy at first sight. However, it
is a complex task since the models must be evaluated for both image classification (whether an object is on the image) and localization (where the object appears on the image, \ie bounding box regression).
Moreover, typical datasets have many classes with significantly nonuniform prior distribution over classes.
Thus, a simple accuracy-based metric is not appropriate as it will introduce biases.
Another aspect to be taken into consideration is the risk of misclassification.
Missing this information a priori leads to the necessity of introducing a ``confidence score'' or ``model score'' associated with each bounding box prediction.
This allows to evaluate the model at different levels of confidence, \ie
to regulate the trade-off between different types of classification error.
With these needs in mind, the mean Average Precision~(mAP)
\abbrev{\id{mAP}mAP}{mean Average Precision}~\cite{Everingham10}
metric was introduced and it is widely used to evaluate models in object detection and segmentation online challenges.
Before talking about the mAP, it is reasonable to understand the precision and recall concepts of a classifier and then
discuss about Average Precision~(AP)~\cite{Everingham10}.
%
\abbrev{\id{AP}AP}{Average Precision}
%
Before all of that, we need to consider the Intersection over Union (IoU) \abbrev{\id{IoU}IoU}{Intersection over Union} concept applied in object localization evaluation.
\subsection{Object localization evaluation}
%
As aforementioned, we need to evaluate our model regarding object localization.
In other words, we need to know how closely the object boundary output overlaps the ground truth.
To measure that, we use the IoU,
\abbrev{\id{IoU}IoU}{Intersection over Union}
which may be used to evaluate any algorithm that outputs bounding boxes.
The IoU is based on the Jaccard index~\cite{jaccard1912distribution}, also referred to as the Jaccard similarity coefficient, which is a standard for evaluating the similarity between finite sample sets.
One may compute the IoU as follows: given a predicted bounding box $B_{p}$ associated to a ground truth bounding box $B_{gt}$, the IoU is the ratio of the area of the intersection to the union of the boxes areas,
% ; as shown in Equation~\eqref{eq:IoU},
%
\begin{equation}
IoU = \frac{\textrm{area}~(B_p \cap B_{gt})}{\textrm{area}~(B_p \cup B_{gt})}
\label{eq:IoU}.
\end{equation}
\symbl{\id{IoU }$IoU$}{Intersection over Union}
\symbl{\id{intersection }$\cap$}{Intersection sets operator}
\symbl{\id{IoU }$\cup$}{Union sets operator}
%
Figure~\ref{fig:IoU} illustrates the IoU of a ground truth box (in green) and a prediction box (in red).
The area (in blue) is given by the number of pixels inside the bounding boxes.
%
\begin{figure}[htb]
\centering
\includegraphics[width=.6\linewidth]{IoU.pdf}
\caption{Intersection over Union (IoU).}
\label{fig:IoU}
\end{figure}
Detections output by the model are considered as true positive (TP) \abbrev{\id{TP}TP}{True Positive}
if the IoU between a proposal and the ground truth of the object is equal to or greater than a certain threshold and as false positive (FP) \abbrev{\id{FP}FP}{False Positive}
otherwise. Typically, this threshold is set to $50\%$; however, evaluations with thresholds up to 95\% are found in the literature.
A false negative (FN) \abbrev{\id{FN}FN}{False Negative} is a not detected ground truth object.
In the object detection context, a true negative (TN) \abbrev{\id{TN}TN}{True Negative} does not make sense since it would be all image regions that were
correctly considered as background, which would amount to a large number of possible bounding boxes.
Although, in this work, we only consider bounding boxes, IoU is also applicable to pixel-wise segmentation~\cite{He2017mask}.
\subsection{Precision \& recall}
%
The precision $P$ measures the ratio of true positive to the total of {\bf predicted detections},
%
\symbl{\id{TP }$TP$}{True Positive}
\symbl{\id{FP }$FP$}{False Positive}
\symbl{\id{P }$P$}{Precision}
\begin{equation}
P = \frac{TP}{TP + FP} = \frac{TP}{\textrm{all detections}},
\label{eq:precision}
\end{equation}
%
% Equation~\eqref{eq:precision},
\ie how accurate the predictions are.
The closer to $1.0$ the precision score is, the more probable the detector output is correct.
The recall $R$, also referred to as sensitivity, measures the ratio of true positive detections to the total of {\bf objects in the dataset},
%
\symbl{\id{R }$R$}{Recall}
\symbl{\id{FN }$FN$}{False Negative}
\begin{equation}
R = \frac{TP}{TP + FN} = \frac{TP}{\textrm{all ground truths}},
\label{eq:recall}
\end{equation}
% Equation~\eqref{eq:recall}
\ie how well it retrieves the objects in the dataset.
The closer to $1.0$ the recall score is, the more probable it is that the objects in the dataset are detected.
%
It is worth mentioning that there is an inverse relationship between these metrics as they inversely depend on the IoU threshold previously set.
\subsection{(Mean) Average Precision for object detection}
%
The procedure to compute the AP follows.
For a given class, we rank all predictions by the model score, from highest to lowest.
Then, we compute what the precision and recall would be for that output to be a considered as positive by the model.
This is equivalent to varying the model score threshold that determines what is counted as a model-predicted positive detection of the class.
Then, for calculating the AP score, we take the precision average across all recall values, as follows:
% Equation~\eqref{eq:AP},
\symbl{\id{r }$AP$}{Average Precision}
\symbl{\id{card }$\card(\cdot)$}{Cardinality of a set}
\symbl{\id{sum }$\sum$}{Summation operator}
\symbl{\id{stepR }$R_s$}{Recall step size}
\symbl{\id{setR }$\Rcal$}{Set of recall values}
%
\begin{equation}
AP = \frac{1}{\card(\Rcal)} \sum_{R\,\in\,\Rcal}P_{\rm interp}(R),
\label{eq:AP}
\end{equation}
%
where, $\Rcal$ is the set of recalls from $0$ to $1$ with step size $R_s$;
$\card(\Rcal)$ is cardinality of the set $\Rcal$, and;
$P_{\rm interp}$ is defined as
%
%
\symbl{\id{p_interp }$P_{\rm interp}(R)$}{Interpolated precision at recall $R$}
\symbl{\id{r }$AP$}{Average Precision}
\symbl{\id{prtilde }${P(\tilde{R})}$}{Measured precision at recall ${\tilde{R}}$.}
%
\begin{equation}
P_{\rm interp}(R) = \max_{\tilde{R} \geq R}\,{P(\tilde{R})},
\label{eq:p_interp}
\end{equation}
%
%
where ${P(\tilde{R})}$ is the measured precision at recall ${\tilde{R}}$.
We execute this interpolation in order to smooth the oscillations caused by small variations in the precision computations.
%
%
One may view the AP as the area under the curve (AUC) of the precision-recall graph.
We approximate this computation by interpolating the precision at each recall level by $R$ taking the maximum precision measured for a method for which the corresponding recall exceeds $R$, as shown in Equation~\eqref{eq:p_interp}.
In~\cite{Everingham10}, the authors vary the recall from 0 to 1, with step size $R_s=0.1$, so that $\card(\Rcal)=11$.
In this work, following~\cite{He2017mask}, we employ step size $R_s=0.01$, so that $\card(\Rcal)=101$.
By lowering $R_s$, we aim to better approximate the AUC \abbrev{\id{AUC}AUC}{Area under Curve}
of the precision-recall graph.
One may notice that to obtain a high score, a method must have high precision at all recall levels -- this penalizes methods which retrieve only a subset of examples with high precision (\eg an object in a certain position)~\cite{Everingham10}.
Also, remember that the IoU has a direct impact on AP since it determines if a detection is a TP or FP.
% It all sounds complicated but gets more comfortable as we illustrated this procedure with an example, as follows.
This computation is exemplified next.
% Let us say we have $5$ instances of a given class in our dataset.
Consider a dataset with $5$ instances of a given class.
We first rank all the model's predictions for that class according to the predicted confidence level (from the highest to the lowest), irrespective of correctness.
Table~\ref{tab:ex_rank_detections} shows an example of hypothetical predictions for those 5 instances ranked by their confidence level.
%
\begin{table}[th]
\centering
\caption{Example of ranked hypothetical detections.}
\label{tab:ex_rank_detections}
\begin{tabular}{ccccc}
\toprule
Rank & Confidence & Correct? & Precision & Recall \\
\midrule
1 & 0.99 & True & 1.00 & 0.2 \\
2 & 0.95 & True & 1.00 & 0.4 \\
3 & 0.82 & False & 0.67 & 0.4 \\
4 & 0.81 & False & 0.50 & 0.4 \\
5 & 0.79 & True & 0.60 & 0.6 \\
6 & 0.78 & False & 0.50 & 0.6 \\
7 & 0.74 & True & 0.57 & 0.8 \\
8 & 0.73 & False & 0.50 & 0.8 \\
9 & 0.63 & False & 0.44 & 0.8 \\
10 & 0.62 & True & 0.50 & 1.0 \\
\bottomrule
\end{tabular}
\end{table}
%
The column ``Correct?'' shows if the detection match the ground truth for an IoU equal or higher than a threshold of, say 50\%~\cite{Everingham10}.
%
Let us consider the row with rank \#3.
The precision for that row is the proportion of TP, $P=2/3=0.67$;
and the recall is the ratio of TP to total of examples, $R = 2/5 = 0.4$.
We can notice that the recall still increases as we include more predictions (lower the confidence model threshold), but the precision goes up and down.
The Figure~\ref{fig:prec-rec_curve} shows the precision-recall curve, obtained by computing $P$ and $R$ for all rows.
%
\begin{figure}[th]
\centering
\includegraphics[width=.63\linewidth]{precision.pdf}
\caption{Precision-recall curve.}
\label{fig:prec-rec_curve}
\end{figure}
Again, one may view AP as the AUC of the precision-recall curve.
Remember that we approximate the computation by first smoothing the precision oscillations according to Equation~\eqref{eq:p_interp}, which is better understandable in Figure~\ref{fig:prec-rec_curve_interp_ex},
where we give an example of computing $P_{\rm interp}(0.7)$.
%
%
\begin{figure}[th!]
\centering
\includegraphics[width=.73\linewidth]{precision_interp_r.pdf}
\caption[Example of computing $P_{\rm interp}$]{Example of computing $P_{\rm interp}$. In this case, $P_{\rm interp}(0.7) = 0.57$.}
\label{fig:prec-rec_curve_interp_ex}
\end{figure}
%
The Figure~\ref{fig:prec-rec_curve_interp} presents the curve of $P_{\rm interp}$ computed across all recall values.
%
%
\begin{figure}[bh!]
\centering
\includegraphics[width=.63\linewidth]{precision_interp.pdf}
\caption{Precision-recall curve with $P_{\rm interp}$.}
\label{fig:prec-rec_curve_interp}
\end{figure}
%
Finally, we may compute the AP of our example, using Equation~\eqref{eq:AP}. Since we varied the recall from $0$ to $1$ with $R_s=0.1$, $\card(\Rcal)=11$.
\begin{equation*}
\begin{split}
AP &= \frac{1}{11}\bigg(P_{\rm interp}(0.0)+P_{\rm interp}(0.1)+...+P_{\rm interp}(1.0)\bigg) =\\
&= \frac{1}{11}\bigg(1.00 + 1.00 + 1.00 + 1.00 + 1.00 + 0.60 + 0.60 + 0.57 + 0.57 + 0.50 + 0.50 \bigg) =\\
&= 0.7582.
\end{split}
\end{equation*}
So far, we have defined the AP and seen the impact of the IoU threshold in its computation.
We may now calculate the mAP by computing the AP for all the $M$ classes in the dataset and taking the average over them and/or IoU thresholds, as follows:
%
\begin{equation}
mAP = \frac{1}{M}\sum_{m\,\in\,M}AP_m,
\label{eq:mAP}
\end{equation}
%
where $AP_m$ is the Average Precision computed at each class and/or IoU threshold $m$.
%
Depending on the competition this procedure of computing the mAP may differ.
In the next section we discuss two famous online object detections competitions.
\subsection{Online challenges}
%
The PASCAL Visual Object Classes (VOC) is a dataset for object detection~\cite{Everingham10}.
\abbrev{\id{VOC}VOC}{PASCAL Visual Object Classes dataset}
In this challenge, a prediction is considered a TP if $IoU \geq 0.5$.
In the case of multiple detections of the same object, it counts the first one as a positive and others as negatives.
So, it is the responsibility of the competitor to deal with multiple detections for the same object.
The mAP in PASCAL VOC is calculated by computing the AP as discussed previously, considering $IoU \geq0.5$, and averaging over all 20 object categories in the dataset.
%
%http://cocodataset.org/#detection-eval
Latest works~\cite{He2017mask}, tend to report results for the Microsoft Common Objects in Context (MSCOCO) dataset~\cite{MSCOCO2014} only.
\abbrev{\id{MSCOCO}MSCOCO}{Microsoft Common Objects in Context dataset}
There are 12 metrics to assess the performance of an object detector on MSCOCO.
Nevertheless, we only focus on the 6 metrics based on AP.
The primary challenge metric for this competition averages AP for IoU from $0.5$ to $0.95$ with a step size of $0.05$ (AP at $[0.5:0.05:0.95]$).
By averaging over the higher IoU thresholds instead of only considering one more tolerant threshold, say $IoU \geq 0.5$, tends to reward detectors with better localization.
Other two MSCOCO challenge metrics consider only a single IoU threshold, one $IoU \geq 0.5$ (just like in PASCAL VOC) and another one $IoU \geq 0.75$.
For the MSCOCO challenge, the AP is averaged over all 80 object categories to compute mAP.
In MSCOCO dataset $41\%$ of objects are considered small (area $< 32^2$ pixels), $34\%$ medium ($32^2 < $ area $< 96^2$), and $24\%$ large object (area $> 96^2$)~\cite{MSCOCO2014}.
The object size affects the model accuracy substantially~\cite{Everingham10}.
In~\cite{Everingham10, He2017mask}, it is possible to observe the performance of the methods increasing as object size increases.
The MSCOCO challenge presents three metrics which considers objects size: mAP$_{\textrm{S}}$, mAP$_{\textrm{M}}$, mAP$_{\textrm{L}}$, in order to evaluate the detections for small, medium and large objects areas.
For those metrics, detections of objects with an area outside of a determined threshold are unconsidered.
The area is computed as the number of pixels of the ground truth bounding box.
%
% \red{
% Yet, MSCOCO computes the average recall (AR) which is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs~\cite{Hosang2016}.
% AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.
% They also compute the AR for small, medium and large objects.
% Although we mention these last 6 metrics, we do not use them to evaluate our models.
% }
From now on, unless otherwise noted, the (m)AP is averaged over the multiple IoU values $[0.50: 0.05: 0.95]$, for simplicity.
We summarize the MSCOCO metrics based on AP in Table~\ref{tab:metrics}.
%
\begin{table}[h]
\caption{Summary of MSCOCO metrics based on AP.}
\label{tab:metrics}
\centering
\begin{tabular}{@{}ll@{}}
\toprule
\multicolumn{1}{c}{{\bf Metric}} & \multicolumn{1}{c}{{\bf Description}} \\ \midrule
mAP & mAP at $IoU$ at $0.5:0.05:0.95$\\
mAP$_{50}$ & mAP at $IoU \geq 0.50$\\
mAP$_{75}$ & mAP at $IoU \geq 0.75$\\
\midrule
mAP$_{\textrm{S}}$ & mAP for small objects (area $ < 32^2$)\\
mAP$_{\textrm{M}}$ & mAP for medium objects ($32^2 <$ area $< 96^2$)\\
mAP$_{\textrm{L}}$ & mAP for large objects (area $ > 96^2$)\\
% \midrule
% AR$^{\textrm{max}=1}$ & AR given 1 detection per image\\
% AR$^{\textrm{max}=10}$ & AR given 10 detections per image\\
% AR$^{\textrm{max}=100}$ & AR given 100 detections per image\\
% \midrule
% AR$_{\textrm{S}}$ & AR for small objects (area $ < 32^2$)\\
% AR$_{\textrm{M}}$ & AR for medium objects ($32^2 <$ area $< 96^2$)\\
% AR$_{\textrm{L}}$ & AR for large objects (area $ > 96^2$)\\
\bottomrule
\end{tabular}
\end{table}
%
\section{Implementation details}
%
Next, we discuss about the implementation details including network architectures and hyper-parameters used on training and test phases.
We use the mask R-CNN benchmark implementation available under MIT-license~\cite{massa2018mrcnn}.
It is worth mentioning that we do not tune any hyper-parameter since we do not have a validation set.
All of them were chosen based on~\cite{He2017mask}, and taking into account the characteristics of the dataset used in our experiments.
The experiments were performed in a machine equipped with 4 GTX 1080 GPUs, 64GB DDR4 2133MHz of RAM, Intel\textsuperscript{TM} Core i7 6850-K 3.6 GHz processor, and Ubuntu 16.04 as the operational system.
\subsection{Dataset}
%
% Unfortunately, we do not use our MBG dataset here, since it is still under labeling process.
Since the MBG dataset described in Section~\ref{sec:new_data} is still being labeled, we use the publicly available CEFET dataset\footnote{from: \url{https://drive.google.com/open?id=1tDOVdb_vALUnD_cY3lQf0ggoiM1F63Jl}.} to train and evaluate our models.
% only the videos recorded at $5m$ are public available.
Again, the train-test split of this dataset is included in the annotation files.
% We noticed that they cut the videos in parts, which they call ``Tomada''.
In the CEFET dataset, each video is cut into several parts.
Two parts of the same video, for example, appear one in the training set and the other in the test set.
As we are using an approach based on isolated images instead of video, this split may facilitate the task of our detector since in two takes of the same video we have the same background and objects placed at the same place.
%
Therefore, in this work, we adopted a train-test split based on the videos \ie all the parts of a video are either in train or test set.
Having split the videos, we extract images every 30 frames.
In total, there are 419 images, containing 374 tires, for training and 144 images, containing 449 tires, for test.
% how many annoted objects?
%Train: 777, 800, 807 (419 imgs)
%Test: 804, 806, 810 (144 imgs)
%every 30 frames
Although we do not have the ground truth bounding boxes for the MBG dataset yet, we run the videos through our trained models and visually analyze some of the obtained results.
%
\subsection{Network architectures}
We instantiate Faster R-CNN with different network architectures.
We define the Faster R-CNN as composed of two parts:
(i) the network backbone: the convolutional network architecture (\eg VGG~\cite{Simonyan2015VGG}, ResNet~\cite{He2016deep}) responsible for the feature extraction task over images; and
(ii) the network head: responsible for image classification and bounding box regression tasks~\cite{He2017mask}.
We use the network depth (number of stacked layers) nomenclature to denote the backbone architecture.
We perform experiments by using ResNet~\cite{He2016deep}
% \red{and ResNeXt~\cite{RESNEXT}}
with depth of 50 and 101.
Following the original implementation of Faster R-CNN with ResNets,
we extract features from the final convolutional layer of 4th stage, which we call C4.
This is widely used in the literature~\cite{He2016deep, Huang2017,Shrivastava2016skip}.
We denote this backbone by \mbox{R-$<50, 101>$-C4}.
%
The network head follows the architectures presented in Faster R-CNN~\cite{Ren2017fasterpami}.
% \red{
% %Specifically, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4.
% The head on the ResNet-C4 backbone includes the 5th stage of ResNet (namely, the 9-layer ‘res5’~\cite{He2016deep}), which is compute-intensive.
% % For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters.
% }
\subsection{Training and inference}
%
% We follow the hyperparameters as in~\cite{He2017mask}.
During training the positive samples are the RoIs that has an $IoU \geq 0.5$ with the ground truth and all other RoIs are considered as negative samples, as in~\cite{Girshick2015}.
% The mask target is the intersection between an RoI and its associated ground truth mask.
We adopt image-centric training~\cite{Girshick2015}.
We resize the images, so that the shorter edge is not greater than 800 pixels resolution, while keeping the aspect ratio~\cite{He2017mask}.
%and larger ... and 1,333 ... respectively
Our mini-batch has $4$ images (we train on 2 GPUs, 2 images per GPU).
From each image, we sample $N=64$ RoIs with 1:3 ratio of positive to negatives~\cite{Girshick2015, Ren2017fasterpami}.
% For the C4 backbone $N$ is $64$~\cite{Girshick2015, Ren2017fasterpami}.
% and 512 for FPN (as in~\cite{Lin2017pyramid}).
We train our models for $18$k iterations, with learning rate of $0.005$ which is decreased by $10$ at the $12$k and $16$k iterations.
We use weight decay of 0.0001 and momentum of 0.9.
%With ResNeXt~\cite{Xie2017ResneXt}, we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01.
% To train RPN, we consider as positive the anchors with $IoU\geq 0.7$ and negative with $IoU\leq 0.3$~\cite{Ren2017fasterpami}.
We use RPN anchors at 5 scales (32, 64, 128, 256, and 512) and 3 aspect ratios (1:2, 1:1, and 2:1) with respect to the resized images input~\cite{Lin2017pyramid}.
We set number of proposals of RPN ouput to $600$.
These boxes highly overlap.
In order to reduce the redundancy caused by these overlaps,
we use an IoU threshold for NMS at $0.7$ and keep only the top-$50$ ranked proposals, based on $cls$ score (see Section~\ref{sec:rpn}), for Fast R-CNN train~\cite{Ren2017fasterpami}.
% \red{
% % The RPN anchors span 5 scales and 3 aspect ratios, following~\cite{Lin2017pyramid}.
% For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified.
% For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.
% }
% \subsection{Inference}
We perform batch inference using 2 GPUs and 4 images per batch.
% and 1000 for FPN~\cite{Lin2017pyramid}.
We set the number of proposals as 300, run predictions on those, and suppress ambiguous detections by applying NMS~\cite{Girshick2015DPM} at IoU=$0.5$.
\section{Results}
%
In this section, we analyze our results both quantitatively and qualitatively.
For the first one, we plot the precision-recall curve for results obtained from the test set.
Also, we summarize these curves into single numbers as discussed in Section~\ref{sec:eval}.
Moreover, we make a visual analysis of our results by looking at the detection outputs from our models.
\subsection{Quantitative results}
%
It took about $2$~h to train the models with R-50-C4 backbone while the models with R-101-C4 took about $3.5$~h.
For inference, the models with R-50-C4 backbone took about $90$~ms per image against $140$~ms for R-101-C4.
%
The Table~\ref{tab:results_CEFET} shows the result for the CEFET dataset.
We report the mAP (averaged over multiple IoUs), mAP$_{50}$, mAP$_{75}$, mAP$_{\textrm{M}}$, mAP$_{\textrm{L}}$ metrics, summarized in Table~\ref{tab:metrics}.
We can notice an improvement of almost 5 points in mAP by only adopting a random horizontal flip, with probability of 50\%, as data augmentation strategy in R-50-C4 backbone.
Except mentioned otherwise, we keep this data augmentation method for the other experiments.
%~
\begin{table}[b!]
\centering
\caption{Main results for CEFET dataset.}
\label{tab:results_CEFET}
\resizebox{\textwidth}{!}{%
\begin{tabular}{@{}c|l|c|ccc|cc@{}}
\toprule
& & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_\textrm{M}$ & mAP$_\textrm{L}$ \\
\hline
\multirow{3}{*}{train} & Faster R-CNN (no aug.) & R-50-C4 & 89.86 & 90.19 & 90.19 & 86.15 & 92.16 \\
& Faster R-CNN & R-50-C4 & 89.11 & 89.84 & 89.84 & 87.32 & 90.34 \\
& Faster R-CNN & R-101-C4 & 88.95 & 89.48 & 89.48 & 87.52 & 91.02 \\
\hline
\multirow{3}{*}{test} & Faster R-CNN (no aug.) & R-50-C4 & 43.81 & 62.25 & 53.64 & 34.46 & 59.56 \\
& Faster R-CNN & R-50-C4 & 47.38 & 64.16 & 57.70 & 38.42 & 61.85\\
& Faster R-CNN & R-101-C4 & 49.31 & 66.68 & 62.61 & 39.46 & 65.21 \\
\bottomrule
\end{tabular}
}
\end{table}
We also observe that Faster R-CNN with R-101-C4 backbone obtained the best result.
This is due to deeper networks being better feature extractors.
% However, for these networks we may be caution with overfitting since the deeper the network more parameters it has.
% That was not the case here, since the train results are compared to the other architecture.
While deeper networks are more prone to overfitting due to the larger number of parameters, we did not observe any overfitting in our training.
We also evaluate the impact of object size in the results.
As expected, we notice better results for large objects than for medium objects.
Since we have no objects with area $< 32^2$ in the CEFET dataset, mAP$_{\textrm{S}}$ does not apply.
%
%
Moreover, we plot the precision-recall curve for the trained model with R-101-C4 backbone %(Figures~\ref{fig:pr_R50C4_noaug},~\ref{fig:pr_R50C4}, and~\ref{fig:pr_R101C4}) at IoU from $0.5$ to $0.95$
(Figure~\ref{fig:pr_R101C4}) at IoU varying from $0.5$ to $0.95$ with $0.05$ step.
As expected, the higher the IoU threshold is, the worse the results are.
That happens because as we increase the IoU threshold we only accept more accurate detections as TP,
as a consequence, the precision and recall drops drastically.
We can observe that we still can achieve satisfactory results at IoU up to $0.75$ from which we achieve precision higher than $0.9$ for all models.
Unfortunately, all models do not achieve high precisions for high recalls.
By analyzing the Equation~\eqref{eq:recall}, this result may be due to high rate of FN in our results.
We show Figures~\ref{fig:prec-rec_curve_50} and~\ref{fig:prec-rec_curve_75} for better analyzing the models at single IoU thresholds.
As can be notice, R-101-C4 keeps higher precisions for higher recalls if compared to models with R-50-C4 backbone.
% %
% \begin{figure*}[htb!]
% \centering
% \includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG_zoom_1.pdf}~
% \includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG_zoom_2.pdf}\\
% \vspace{2mm}
% \includegraphics[width=\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG.pdf}
% % \includegraphics[width=.5\linewidth]{base4.png}
% \caption{Precision-recall curve for R-50-C4 (no aug.) at various IoUs.}
% \label{fig:pr_R50C4_noaug}
% \end{figure*}
% %
% \begin{figure*}[htb!]
% \centering
% \includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle_zoom_1.pdf}~
% \includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle_zoom_2.pdf}\\
% \vspace{2mm}
% \includegraphics[width=\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_50_C4_1x_cocostyle.pdf}
% % \includegraphics[width=.5\linewidth]{base4.png}
% \caption{Precision-recall curve for R-50-C4 at various IoUs.}
% \label{fig:pr_R50C4}
% \end{figure*}
%
\begin{figure*}[htb!]
\centering
\includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_101_C4_1x_cocostyle_zoom_1.pdf}~
\includegraphics[width=.5\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_101_C4_1x_cocostyle_zoom_2.pdf}\\
\vspace{2mm}
\includegraphics[width=\linewidth, trim={0 0 0 0},clip]{pr_curve_e2e_faster_rcnn_R_101_C4_1x_cocostyle.pdf}
% \includegraphics[width=.5\linewidth]{base4.png}
\caption{Precision-recall curve for R-101-C4 at various IoUs.}
\label{fig:pr_R101C4}
\end{figure*}
%
%
\begin{figure}[htb!]
\centering
\includegraphics[width=.85\linewidth]{pr_curve_comp_50.pdf}
\caption{Precision-recall curve at IoU = 0.50.}
\label{fig:prec-rec_curve_50}
\end{figure}
%
\begin{figure}[htb!]
\centering
\includegraphics[width=.85\linewidth]{pr_curve_comp_75.pdf}
\caption{Precision-recall curve at IoU = 0.75.}
\label{fig:prec-rec_curve_75}
\end{figure}
%
% FIRST ROUND!
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}c|l|c|ccc|ccc@{}}
% \toprule
% & & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_\textrm{S}$ & mAP$_\textrm{M}$ & mAP$_\textrm{L}$ \\
% \hline
% \multirow{3}{*}{train} & Faster R-CNN (no aug.) & R-50-C4 & 89.30 & 89.95 & 89.95 & -- & 88.78 & 90.88 \\
% & Faster R-CNN & R-50-C4 & 89.17 & 90.42 & 90.42 & -- & 88.14 & 90.88 \\
% & Faster R-CNN & R-101-C4 & 89.21 & 89.95 & 89.95 & -- & 85.74 & 91.66 \\
% \hline
% \multirow{3}{*}{test} & Faster R-CNN (no aug.) & R-50-C4 & 42.71 & 61.37 & 52.33 & -- & 33.75 & 57.75 \\
% & Faster R-CNN & R-50-C4 & 47.49 & 63.04 & 56.60 & -- & 37.68 & 63.52 \\
% & Faster R-CNN & R-101-C4 & 47.22 & 61.88 & 56.52 & -- & 36.92 & 64.21 \\
% \bottomrule
% \end{tabular}
% }
% \caption{Main results for CEFET dataset.}
% \label{tab:res}
% \end{table}
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}l|c|ccc|ccc@{}}
% \toprule
% & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_{\textrm{S}}$ & mAP$_{\textrm{M}}$ & mAP$_{\textrm{L}}$ \\ \hline
% Faster R-CNN (no aug.) & R-50-C4 & 42.71 & 61.37 & 52.33 & -- & 33.75 & 57.75 \\
% Faster R-CNN & R-50-C4 & 47.49 & 63.04 & 56.60 & -- & 37.68 & 63.52 \\
% Faster R-CNN & R-101-C4 & 47.22 & 61.88 & 56.52 & -- & 36.92 & 64.21 \\
% \bottomrule
% \end{tabular}
% }
% \caption{Results for CEFET dataset.}
% \label{tab:results_CEFET}
% \end{table}
% %
% %
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}l|c|ccc|ccc@{}}
% \toprule
% & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_{\textrm{S}}$ & mAP$_{\textrm{M}}$ & mAP$_{\textrm{L}}$ \\ \hline
% Faster R-CNN (no aug.) & R-50-C4 & 89.30 & 89.95 & 89.95 & -- & 88.78 & 90.88 \\
% Faster R-CNN & R-50-C4 & 89.17 & 90.42 & 90.42 & -- & 88.14 & 90.88 \\
% Faster R-CNN & R-101-C4 & 89.21 & 89.95 & 89.95 & -- & 85.74 & 91.66 \\
% \bottomrule
% \end{tabular}
% }
% \caption{Results for CEFET TRAIN dataset.}
% \label{tab:results_CEFET}
% \end{table}
%
%
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}c|l|c|ccc|ccc@{}}
% \toprule
% & & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_\textrm{S}$ & mAP$_\textrm{M}$ & mAP$_\textrm{L}$ \\
% \hline
% \multirow{3}{*}{train} & Faster R-CNN (no aug.) & R-50-C4 & 89.55 & 89.96 & 89.96 & -- & 88.37 & 91.18 \\
% & Faster R-CNN & R-50-C4 & 89.36 & 90.43 & 90.17 & -- & 87.29 & 91.24 \\
% & Faster R-CNN & R-101-C4 & 89.45 & 90.19 & 90.19 & -- & 86.90 & 91.57 \\
% \hline
% \multirow{3}{*}{test} & Faster R-CNN (no aug.) & R-50-C4 & 46.31 &62.10 &58.18 & -- & 35.06 & 64.76 \\
% & Faster R-CNN & R-50-C4 & 46.60 & 61.47 & 58.26 & -- & 36.43 & 63.63 \\
% & Faster R-CNN & R-101-C4 & 49.62 & 65.04 & 61.15 & -- & 38.52 & 67.95 \\
% \bottomrule
% \end{tabular}
% }
% \end{table}
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}l|c|ccc|ccc@{}}
% \toprule
% & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_{\textrm{S}}$ & mAP$_{\textrm{M}}$ & mAP$_{\textrm{L}}$ \\ \hline
% Faster R-CNN (no aug.) & R-50-C4 & 46.31 &62.10 &58.18 & -- & 35.06 & 64.76 \\
% Faster R-CNN & R-50-C4 & 46.60 & 61.47 & 58.26 & -- & 36.43 & 63.63 \\
% Faster R-CNN & R-101-C4 & 49.62 & 65.04 & 61.15 & -- & 38.52 & 67.95 \\
% \bottomrule
% \end{tabular}
% }
% \caption{NEW Results for CEFET dataset.}
% \label{tab:results_CEFET}
% \end{table}
% %
% %
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}l|c|ccc|ccc@{}}
% \toprule
% & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_{\textrm{S}}$ & mAP$_{\textrm{M}}$ & mAP$_{\textrm{L}}$ \\ \hline
% Faster R-CNN (no aug.) & R-50-C4 & 89.55 & 89.96 & 89.96 & -- & 88.37 & 91.18 \\
% Faster R-CNN & R-50-C4 & 89.36 & 90.43 & 90.17 & -- & 87.29 & 91.24 \\
% Faster R-CNN & R-101-C4 & 89.45 & 90.19 & 90.19 & -- & 86.90 & 91.57 \\
% \bottomrule
% \end{tabular}
% }
% \caption{NEW Results for CEFET TRAIN dataset.}
% \label{tab:results_CEFET}
% \end{table}
% %
% \begin{table}[h]
% \centering
% \resizebox{\textwidth}{!}{%
% \begin{tabular}{@{}c|l|c|ccc|ccc@{}}
% \toprule
% & & backbone & mAP & mAP$_{50}$ & mAP$_{75}$ & mAP$_\textrm{S}$ & mAP$_\textrm{M}$ & mAP$_\textrm{L}$ \\
% \hline
% \multirow{3}{*}{train} & Faster R-CNN (no aug.) & R-50-C4 & 90.78 & 95.67 & 95.59 & -- & 90.72 & 91.75 \\
% & Faster R-CNN & R-50-C4 & 86.95 & 94.53 & 93.65 & -- & 85.58 & 87.98 \\
% & Faster R-CNN & R-101-C4 & 85.23 & 93.31 & 93.31 & -- & 86.85 & 85.52 \\
% \hline
% \multirow{3}{*}{test} & Faster R-CNN (no aug.) & R-50-C4 & 45.92 & 65.76 & 58.05 & -- & 35.24 & 63.11 \\
% & Faster R-CNN & R-50-C4 & 51.31 & 69.69 & 64.34 & -- & 41.98 & 67.50 \\
% & Faster R-CNN & R-101-C4 & 55.77 & 75.89 & 70.04 & -- & 46.63 & 70.93 \\
% \bottomrule
% \end{tabular}
% }
% \caption{Results for CEFET TRAIN dataset.}
% \label{tab:results_CEFET}
% \end{table}
\newpage
\subsection{Visual analysis}\label{sec:res_vis}
%
In this section, we discuss some visual results by analyzing the detections in the images.
We try to associate the visualization with the numerical results obtained in the previous section.
To do so, we plot the ground truth bounding boxes in blue and overlay the models detection in red, along with the label and confidence scores.
As mentioned, the recall is low for all models, which may be due to high rate of FN in our results.
By looking at an example in Figure~\ref{fig:hard}, one may notice that there are many hard-to-detect tires in the dataset, even for humans.
In this image none of the models were capable to detect any tire.
%
% hard: no one model detected anything
\begin{figure}[h!]
\centering
\includegraphics[width=\textwidth,trim={0 2.9cm 0 2.4cm},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_30.pdf}
\caption{Hard example.}
\label{fig:hard}
\end{figure}
% improvement
%
Another interesting fact can be observed in Figure~\ref{fig:improv_1} where the same tire (the rightmost one) was not detected by the R-50-C4 model without data augmentation; detected with a low confidence score ($0.28$) by the model with R-50-C4, using data augmentation; and detected with a high confidence score ($1.00$) by the model with R-101-C4, also using data augmentation.
All models found the leftmost tire with the same confidence score and none of them found the tire at the top.
Following this, we can observe a similar case in Figure~\ref{fig:improv_2}, where both models with R-50-C4 were not capable of finding the tire, while the model with R-101-C4 detected it with a high confidence score ($0.97$).
%
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.9\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 8.5cm 0 2.4cm},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_1.pdf}
\caption{R-50-C4 (no aug.).}
\label{fig:improv_50N}
\end{subfigure}\\
%
\begin{subfigure}[t]{0.9\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 8.5cm 0 2.4cm},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_1.pdf}
\caption{R-50-C4.}
\label{fig:improv_50}
\end{subfigure}
%
\begin{subfigure}[t]{0.9\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 8.5cm 0 2.4cm},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_1.pdf}
\caption{R-101-C4.}
\label{fig:improv_101}
\end{subfigure}
\caption{Detection improvement over the models (cropped images).}
\label{fig:improv_1}
\end{figure}
%
%
% only R-101 detected
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.49\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={2cm 2.9cm 11.5cm 8.4cm},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_12.pdf}
\caption{R-50-C4 (no aug.).}
\label{fig:improv_50N}
\end{subfigure}~
%
\begin{subfigure}[t]{0.49\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={2cm 2.9cm 11.5cm 8.4cm},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_12.pdf}
\caption{R-50-C4.}
\label{fig:improv_50}
\end{subfigure}\\
%
\begin{subfigure}[t]{0.49\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={2cm 2.9cm 11.5cm 8.4cm},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_12.pdf}
\caption{R-101-C4.}
\label{fig:improv_101}
\end{subfigure}
\caption{Another detection improvement over the models (cropped images).}
\label{fig:improv_2}
\end{figure}
%
% % 75
% \begin{figure}[th!]
% \centering
% \begin{subfigure}[t]{.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_75.pdf}
% \caption{R-50-C4 (no aug.).}
% \label{fig:improv_50N}
% \end{subfigure}\\
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_75.pdf}
% \caption{R-50-C4.}
% \label{fig:improv_50}
% \end{subfigure}
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_75.pdf}
% \caption{R-101-C4.}
% \label{fig:improv_101}
% \end{subfigure}
% \caption{75}
% \label{fig:improv}
% \end{figure}
%
%
In Figure~\ref{fig:FP_cases}, we observe some case of false positives.
For the same image, only the model with R-50-C4 backbone and data augmentation outputs the correct detections.
The other two models output one FP each. The R-50-C4 (no aug.) wrongly outputs a tire with low confidence score ($0.05$) between two true tires; while the R-101-C4 predicted the yellow garbage bin as a tire with high confidence score ($1.00$).
%
% 78
\begin{figure}[ht!]
\centering
\begin{subfigure}[t]{.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_78.pdf}
\caption{R-50-C4 (no aug.).}
\label{fig:FP_cases_50N}
\end{subfigure}\\
%
\begin{subfigure}[t]{0.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_78.pdf}
\caption{R-50-C4.}
\label{fig:FP_cases_50}
\end{subfigure}
%
\begin{subfigure}[t]{0.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_78.pdf}
\caption{R-101-C4.}
\label{fig:FP_cases_101}
\end{subfigure}
\caption{Some false positives cases.}
\label{fig:FP_cases}
\end{figure}
%
%
% % 80
% \begin{figure}[th!]
% \centering
% \begin{subfigure}[t]{.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_80.pdf}
% \caption{R50NO.}
% \label{fig:improv_50N}
% \end{subfigure}\\
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_80.pdf}
% \caption{R50}
% \label{fig:improv_50}
% \end{subfigure}
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_80.pdf}
% \caption{R101}
% \label{fig:improv_101}
% \end{subfigure}
% \caption{80}
% \label{fig:improv}
% \end{figure}
%
In Figure~\ref{fig:occlusion}, all models could find almost all tires except for the one with high level of occlusion (the very middle one, under all tires).
That tire is hard to be found even for humans.
Besides, the dataset does not have many occlusions examples, which makes detection of such cases even harder.
%
% 127
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_127.pdf}
\caption{R-50-C4 (no aug.).}
\label{fig:occlusion_50N}
\end{subfigure}\\
%
\begin{subfigure}[t]{0.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_127.pdf}
\caption{R-50-C4.}
\label{fig:occlusion_50}
\end{subfigure}
%
\begin{subfigure}[t]{0.85\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_127.pdf}
\caption{R-101-C4.}
\label{fig:occlusion_101}
\end{subfigure}
\caption{High occlusion example.}
\label{fig:occlusion}
\end{figure}
In Figure~\ref{fig:wrong_fp} all models detect, with a high confidence score ($1.00$), a tire that had not been annotated in the dataset.
Although is has been correctly detect, it was counted as a FP.
% 133
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{\linewidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/img_133.pdf}
% \caption{R-50-C4 (no aug.).}
% \label{fig:wrong_fp_50N}
\end{subfigure}%\\
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_50_C4_1x_cocostyle/img_133.pdf}
% \caption{R-50-C4.}
% \label{fig:wrong_fp_50}
% \end{subfigure}
% %
% \begin{subfigure}[t]{0.85\linewidth}
% \centering
% \includegraphics[width=\textwidth,trim={0 0 0 0},clip]{_vis_results/e2e_faster_rcnn_R_101_C4_1x_cocostyle/img_133.pdf}
% \caption{R-101-C4.}
% \label{fig:wrong_fp_101}
% \end{subfigure}
\caption{Wrong false positive.}
\label{fig:wrong_fp}
\end{figure}
Even tough our models were trained using a small dataset, they were capable to detect tires in the MBG dataset, as depicted in Figure~\ref{fig:mbg_res}.
However, they also detected a lot of false positives, as shown in Figure~\ref{fig:mbg_res_fp}.
%
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.33\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={10cm 5cm 10cm 5cm},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle/rectfied_DJI_0010/im_frame_4900.pdf}
% \caption{R-50-C4 (no aug.).}
\label{fig:mbg_res_50N}
\end{subfigure}~
%
%
\begin{subfigure}[t]{0.33\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={10cm 5cm 10cm 5cm},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/rectfied_DJI_0041/im_frame_2900.pdf}
% \caption{R-101-C4.}
\label{fig:mbg_res_101}
\end{subfigure}~
% \vspace{-12mm}
%
\begin{subfigure}[t]{0.33\linewidth}
\centering
\includegraphics[width=.7\textwidth,trim={15cm 5cm 5cm 5cm},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle/rectfied_DJI_0019/im_frame_0200.pdf}
% \caption{R-50-C4.}
\label{fig:mbg_res_50}
\end{subfigure}
%
\caption{Tires from the MBG dataset detected by the models trained using CEFET dataset (cropped images).}
\label{fig:mbg_res}
\end{figure}
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.49\linewidth}
\centering
\includegraphics[width=\textwidth,trim={3cm 0 9.8cm 4.5cm},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle/rectfied_DJI_0010/im_frame_4150.pdf}
% \caption{R-50-C4 (no aug.).}
\label{fig:mbg_res_50N}
\end{subfigure}~
%
%
\begin{subfigure}[t]{0.49\linewidth}
\centering
\includegraphics[width=\textwidth,trim={12.8cm 4.1cm 0 0},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle_NO-AUG/rectfied_DJI_0041/im_frame_0700.pdf}
% \caption{R-101-C4.}
\label{fig:mbg_res_101}
\end{subfigure}
\\
\vspace{-12mm}
%
\begin{subfigure}[t]{0.49\linewidth}
\centering
\includegraphics[width=\textwidth,trim={10cm 0 3cm 0cm},clip]{_vis_res_MBG/e2e_faster_rcnn_R_50_C4_1x_cocostyle/rectfied_DJI_0019/im_frame_0600.pdf}
% \caption{R-50-C4.}
\label{fig:mbg_res_50}
\end{subfigure}~
%
\begin{subfigure}[t]{0.49\linewidth}
\centering
\includegraphics[width=\textwidth,trim={10cm 0 3cm 0},clip]{_vis_res_MBG/e2e_faster_rcnn_R_101_C4_1x_cocostyle/rectfied_DJI_0043/im_frame_2550.pdf}
% \caption{R-101-C4.}
\label{fig:mbg_res_101}
\end{subfigure}\\
\caption{Example of objects in the MBG dataset missclassified as tires by the models trained using CEFET dataset (cropped images).}
\label{fig:mbg_res_fp}
\end{figure}
\section{Conclusions}
%
In this chapter we apply deep-learning based models to detect potential mosquito breeding sites, particularly tires.
We have seen that deeper models achieved better results.
Some further adjustments in the model can improve even more them.
Nevertheless, the obtained results have shown promising and that resulting models trained with CEFET dataset may be useful in detecting potential mosquitoes breeding sites.