-
Notifications
You must be signed in to change notification settings - Fork 0
/
database.tex
575 lines (515 loc) · 34.4 KB
/
database.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
\chapter{Vision Meets Unmanned Vehicles}
\label{chap:database}
In this chapter, we briefly review some works related to our problem, with focus on those ones that detect mosquito breeding sites directly.
Particularly, we also talk about those directly interested in detecting mosquito breeding sites.
Lastly, we describe in details our new dataset.
\section{Related works}\label{sec:trab_rel}
In this section, we provide a brief review of related works.
We first discuss techniques for detecting stagnant water which is the perfect environment for \Aedes reproduction.
Moreover, we highlight works that directly perform mosquito breeding grounds detection.
%
\subsection{Stagnant water detection}
Image-based water detection systems can be useful for many applications~\cite{Rankin2006, Rankin2010a, Rankin2010b, Zhang2010b, Rankin2011a, Santana2012a, Zhong2013a}.
In our problem specifically, this approach could also be applied since \textit{Aedes aegypti} reproduces in stagnant water.
The method in~\cite{Rankin2006} uses a water detection system to assist in the navigation of autonomous off-road vehicles.
The system is based on several types of features, including texture and others using the HSV (hue, saturation and value) color space.
It also uses a pair of cameras to separate regions of the image that are candidates to be as reflective as water (regions whose depth is estimated as greater than the depth of the neighborhood).
It also presents a fusion rule to combine these features and segment the region of interest containing water.
In~\cite{Rankin2010a} one finds a methodology to asses the performance of water detectors used in images from unmanned vehicles.
Two types of evaluations are presented: one that considers the intersection between the region of the detector outputs and the ground truth, and another one that assess the georeferencing accuracy.
The authors in~\cite{Rankin2010b} observed that, to detect water bodies, the reflections of the sky are more useful in images whose objects are distant from the camera while colors have more discriminative power in the shorter range.
The authors of~\cite{Zhang2010b} propose a shape descriptor in images which is invariant to scale, rotation, affine transformations, mirroring and non-rigid distortions such as ripple effects.
This descriptor is quantitatively and qualitatively compared with other methods,
%based on mirroring invariance forms,
achieving better results.
The work on~\cite{Rankin2011a} performs water detection based on sky reflection and has been developed to be applied to unmanned land vehicles.
In that work, it is considered that water bodies act as flat mirrors for large incidence angles so that the proposed method seeks to geometrically locate the pixels of the sky that are reflected in a water body candidate.
Based on the color similarity and the local characteristics of the terrain,
it is decided whether the candidate is water or not, given that it is below the horizon line.
Tests performed in open rural areas with distances greater than 7 meters obtained 100\% true positives and a maximum of 0.58\% false positives for different climatic conditions.
Another way to detect water in videos is through the dynamic texture segmentation~\cite{Santana2012a}.
In that work, the proposed technique removes the static background image and even dynamic objects present in the scene.
To do so, an entropy measure computed through optical flow in the course of several frames to obtain the water signature is used.
In order to detect regions without motion, an image segmentation method based on the propagation of labels is applied.
The technique was validated in 12 videos with static and moving camera obtaining 95\% of true positives and 10\% of false positives.
A new water-reflection recognition technique is presented in ~\cite{Zhong2013a}.
First, they construct a new feature space using moments invariant to motion distortion in low- and high-frequency curvelet space.
In this new space of characteristics, they apply the algorithms to minimize the cost of reflection in low frequencies together with discrimination of {\it curvelets} coefficients in high frequencies.
By doing so, they classify the water reflection and detect the reflection axis in the images with a hit rate ranging from 80\% to 95\%.
\subsection{Mosquito breeding grounds}
%
Regarding mosquito breeding grounds detection we point out references~\cite{Agarwal2014a, Prasad2015a, Mehra2016a}.
%
In~\cite{Agarwal2014a} a system receives geotagged images generated by the population.
Then, the image quality is evaluated in order to reject images that contain high levels of distortions or artifacts.
Next, each image is converted into a feature vector using the bag of visual words model through
the scale-invariant feature transform (SIFT) descriptor.
\abbrev{\id{SIFT}SIFT}{Scale-Invariant Feature Transform}
Afterwards, a support vector machine (SVM) classifier is trained to identify whether the
\abbrev{\id{SVM}SVM}{Support Vector Machine}
images contain potential mosquitoes breeding sites (stagnant water, open tyres or containers, bushes, etc) or not (running water, manicured lawns, tyres attached to vehicles, etc).
Finally, the system outputs a heat map where the regions with the highest risk of having mosquitoes habitats are indicated.
%
The authors of~\cite{Prasad2015a} use a trained SVM classifier to detect water puddles in images obtained from videos that can have stagnant water, acquired by a quadcopter.
%
The work in~\cite{Mehra2016a} use thermal and gray level images to detect stagnant water.
From each image, they compute a 128-bit vector using speeded-up robust features (SURF)
\abbrev{\id{SURF}SURF}{Speeded-Up Robust Features}
descriptor.
This vector is then reduced to a 64-bit vector using principal component analysis (PCA).
\abbrev{\id{PCA}PCA}{Principal Component Analysis}
The computed vectors are used to train an ensemble of naive Bayes classifiers to identify potential mosquitoes breeding grounds.
In this work, we propose a methodology for mosquitoes breeding grounds detection that applies machine learning techniques using videos acquired by an Unmanned Aerial Vehicle (UAV)\abbrev{\id{UAV}UAV}{Unmanned Aerial Vehicle}, also known as a drone.
%
To our knowledge, there is not a big public dataset to train a model for this task.
However, the dataset proposed in~\cite{casfinal2018} is interesting but with some drawbacks, which we describe in Section~\ref{sec:cefet_data}.
They train a Random Forest classifier using features (H channel from HSV color space for detecting tires and histograms from S and V channels from the same space for detecting stagnant water) extracted from the images.
Taking~\cite{casfinal2018} as inspiration, we propose to design and acquire a new dataset, described in Section~\ref{sec:new_data}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The CEFET dataset}\label{sec:cefet_data}
%
The authors of~\cite{casfinal2018} use the commercial UAV DJI Phantom Vision 2 Plus to capture several videos containing tires, puddles and some other objects with water, like water tanks and pails at different simulated environmental situations.
This UAV model has 20 minutes of flight autonomy and is equipped with a camera capable of recording videos up to 1080$\times$720 pixels resolution at 60~Hz.
They chose to record the videos using 1080p at 30~frames per second (fps)\abbrev{\id{fps}fps}{frames per second}.
This UAV camera model presents high distortions, specially the radial one.
In order to mitigate this problem, they reduced the field of view (FoV)\abbrev{\id{FOV}FOV}{Field of View} to 85 degrees.
By doing so, the flight time increases since the drone would take more time to cover a region.
All the videos, with about 15~s duration, were recorded with the camera at downward position with the drone following a specified route at
%three different
approximately constant 5~m altitude.
% (5, 10 and $15~m$).
The speed is set to be 7~km/h.
They point out that the parameters may vary over the course due environmental interference such as wind.
The drone telemetry measurements (\eg position, velocity and altitude) are extracted using the third-party application Litchi~\cite{web:litchi}.
For each video, there is a text file containing the frame-by-frame of objects annotations.
Another limitation of this dataset is that it only contains tires annotated.
Their dataset, including the 29 videos in \verb|mp4| format and corresponding text annotation files, is publicly available\footnote{\url{https://drive.google.com/open?id=1tDOVdb_vALUnD_cY3lQf0ggoiM1F63Jl}}.
We show some examples of scenarios of this dataset in Figure~\ref{fig:base_cefet}
The train-test split is included in the annotation files.
However, in this work, we propose a different split that we mention in Chapter~\ref{chap:results}.
\begin{figure*}[htb!]
\centering
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={10cm 3.2cm 0 0},clip]{base_cefet_2.png}~
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={0 0 0 2.2cm},clip]{base_cefet_3.png}\\
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={0 0 0 3.2cm},clip]{base_cefet_1.png}~
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={0 5.0cm 0 5.0cm},clip]{base_cefet_4.png}\\
% \includegraphics[width=.5\linewidth]{base4.png}
\caption{Examples of scenarios contemplated by the CEFET dataset.}
\label{fig:base_cefet}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Mosquito Breeding Grounds dataset}\label{sec:new_data}
%
% We use the commercial Phantom Vision 2 Plus UAV from DJI Company for acquiring the dataset videos.
% %
% This UAV has 20 minutes of flight autonomy and is capable of executing a predefined flight plan.
% One may retrieve all telemetry information (\eg altitude, latitude, longitude, speed, etc) from a \verb|.csv| file.
% It is equipped with a high-definition camera with passive and active stabilization (dampers and gimbal).
% %
% The camera has many parameters that can be adjusted and can generate videos up to 1080i at 60Hz.
% %
% Since the telemetry sampling rate is higher than the camera frame rate, we employ a decimation in the signals obtained from the \verb|.csv| file in order to associate each video frame to its respective telemetry measurement.
% %
%
% The dataset has the following technical specifications:
% %
% \begin{itemize}
% \item We automatically predefine a flight plan using the Litchi~\cite{web:litchi} software.
% This plane performs a zigzag sweep over the entire terrain area autonomously.
% However, because of the limited accuracy of the telemetry and uncontrolled situations such as wind, the UAV does not fly straight in the path between two points.
% For this reason, once the UAV reaches a point, it takes some time to the UAV realigns with the direction of the next point.
% %
% \item We turn off the camera's auto adjustment and set all the parameters manually.
% We keep the focus fixed. We also reduce the field of view (FoV)\abbrev{\id{FOV}FOV}{Field of View} to lessen the effect of radial distortions.
% We set the video scan to 1080p at 30~frames per second (fps)\abbrev{\id{fps}fps}{frames per second}.
% %
% \item Before starting a flight, we execute camera calibration using a calibration pattern, as described in Section~\ref{sec:calibration}.
% %
% \item The altitude is approximately constant in each video.
% Currently, the dataset has videos acquired at two different altitudes, $10 $ and $ 25 $~m, both predefined in the flight plan through Litchi.
% Small variations in altitude caused by the limited accuracy of telemetry, wind, etc. are within acceptable ranges.
% %
% \item Approximately, constant speed of $10~km/h$ is preset via Litchi.
% Again, the variations caused by the instruments' limited accuracy, wind, and interferences of unknown nature are within acceptable thresholds.
% %
% \item The dataset includes two types of terrain: green vegetation, to simulate the environment of a wasteland, and asphalt, to simulate urban environment, as depicted in Figure~\ref{fig:scenarios1}.
% %
% \item The videos have the following objects randomly arranged on the recording area: tires, bottles, and objects that can accumulate water such as buckets and plastic pool, as seen in Figure~\ref{fig:objects1}.
% %
% \item Afterward, we compensate the videos distortions, as described in Section~\ref{sec:calibration} and manually annotate them using the Zframer software, according to Section~\ref{sec:annot}.
% %
% \end{itemize}
%
In this section we present the description of our dataset named as ``Mosquito Breeding Grounds'' (MBG). \abbrev{\id{MBG}MBG}{Mosquito Breeding Grounds dataset}
We use the commercial Phantom Vision 4 PRO UAV from DJI Company for acquiring the aerial videos that might depict potential mosquito breeding sites.
%
This UAV has approximately 30 minutes of flight autonomy and is capable of executing a predefined flight plan.
One may retrieve all telemetry information (\eg altitude, latitude, longitude, speed, etc) from a \verb|.csv| file using a 3rd party software.
It is equipped with a high-definition camera with passive and active stabilization (dampers and gimbal).
%
The camera has many parameters that can be adjusted and can generate videos up to 4096~p at 60~Hz.
%
Since the telemetry sampling rate is higher than the camera frame rate, we employ a decimation in the signals obtained from the \verb|.csv| file in order to associate each video frame to its respective telemetry measurement.
%
The dataset has the following technical specifications:
%
\begin{itemize}
\item We automatically predefine a flight plan using the Litchi~\cite{web:litchi} software.
This plane performs a serpentine-like sweep over the entire terrain area autonomously.
%However, because of the limited accuracy of the telemetry and uncontrolled situations such as wind, the UAV does not fly straight in the path between two points.
%For this reason, once the UAV reaches a point, it takes some time to the UAV realigns with the direction of the next point.
%
\item We turn off the camera auto adjustment and set all the parameters manually,
keeping the focus fixed at infinity and
%We also reduce the field of view (FoV)\abbrev{\id{FOV}FOV}{Field of View} to lessen the effect of radial distortions.
setting the video scan to 3840~p at 50~frames per second (fps)\abbrev{\id{fps}fps}{frames per second}.
This step is important because we want to keep all camera parameters constant in order to perform camera calibration.
%
\item Before starting a flight, we record a calibration video using a calibration pattern, as described in Section~\ref{sec:calibration}.
%
\item The altitude is approximately constant in each video.
Currently, the dataset has videos acquired at different altitudes, \eg $10$, $25$, $40$~m, all of them predefined in the flight plan through Litchi.
Small variations in altitude ($\pm 0.5$m, according to the manufacturer~\cite{web:djip4prospec}) caused by the limited accuracy of telemetry, wind, etc are within acceptable ranges.
%
\item Speed approximately constant of $15$~km/h is preset via Litchi.
This parameters can suffer small variations caused by wind, for example.
However, our drone has a $10$m/s maximum wind speed resistance~\cite{web:djip4prospec}.
%
\item The dataset includes different types of terrains.
Currently, it contains high and low grass, asphalt, wasteland and buildings, as depicted in Figure~\ref{fig:scenarios1}.
% green vegetation, to simulate the environment of a wasteland, and asphalt, to simulate urban environment
%
\item The videos have about 15 objects manually inserted objects, randomly arranged on the recording area, including tires, bottles, and other objects that can accumulate water such as buckets and plastic pools.
The videos also have objects that originally were part of the recorded scenes such as water tanks.
We show examples of these objects in Figure~\ref{fig:objects1}.
%
\item Afterward, we compensate the videos' distortions, as described in Section~\ref{sec:calibration} and manually annotate them using the Zframer software, according to Section~\ref{sec:annot}.
%
\end{itemize}
\begin{figure*}[htb!]
\centering
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={2cm 0 0 0},clip]{new_base_04.png}~
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={2cm 0 0 0},clip]{new_base_02.png}\\
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={2cm 0 0 0},clip]{new_base_01.png}~
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={2cm 0 0 0},clip]{new_base_03.png}\\
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={0 0 0 2cm},clip]{new_base_05.png}~
\vspace{2mm}
\includegraphics[width=.49\linewidth, height=.3\linewidth, trim={2cm 0 0 0},clip]{new_base_06.png}
% \includegraphics[width=.5\linewidth]{base4.png}
\caption{Examples of scenarios contemplated by the MBG dataset.}
\label{fig:scenarios1}
\end{figure*}
The generated dataset contains several sequences of aerial videos containing tires, water puddles, water reservoirs and several other objects filled with water.
Currently, the videos were recorded at the Technology Center of UFRJ, {\it campus} Ilha do Fundão, and at the CEFET/RJ {\it campus} Nova Iguaçu.
%
\begin{figure*}[htb!]
\centering
\includegraphics[width=.15\linewidth]{garrafa1.png}
\includegraphics[width=.25\linewidth]{garrafa2.png}
\includegraphics[width=.15\linewidth]{pneu1.png}
\includegraphics[width=.15\linewidth]{pneu3.png}\\
\vspace{2mm}
\includegraphics[width=.15\linewidth]{pneu5.png}
% \includegraphics[width=.15\linewidth]{water1.png}
\includegraphics[width=.15\linewidth]{water2.png}
\includegraphics[width=.15\linewidth]{water3.png}
% \includegraphics[width=.26\linewidth]{base4.png}
\caption{Examples of objects in the video dataset.}
\label{fig:objects1}
\end{figure*}
% \red{
% % \subsection{The Equipment}
% % Talk about the first drone used (DJI Phanton 2) that was later replaced by DJI Phantom 4 Pro.
%
% \subsection{Planning the UAV Trajectory}
%
% \subsection{The Smallest Surrounding Rectangle Problem}
% %https://gis.stackexchange.com/questions/22895/finding-minimum-area-rectangle-for-given-points
% %https://www.geometrictools.com/Documentation/MinimumAreaRectangle.pdf
%
% % \subsection{Defining Height and Speed}
%
% }
\subsection{Camera calibration}\label{sec:calibration}
%
Camera calibration is an important process for computer vision applications.
This step is critical because three-dimensional (3D) metric information may be extracted from an image if calibration is well performed.
Through camera calibration one may find the intrinsic camera parameters, such as focal length and principal point.
Many calibration techniques have been developed~\cite{Geiger2012,Zhang2010a},
even for photometry applications~\cite{Yusoff2017,Perez2012}.
In this work, we apply camera calibration for minimizing the distortions caused by the camera lens, mainly the radial distortions.
We have chosen the method proposed in~\cite{Zhang2000} due to its simplicity and low cost.
This technique consists in
extracting the key points (corners) from an image containing a calibration pattern, usually a chessboard;
estimating the camera parameters;
estimating the distortion coefficients; and
applying the correction.
%
\subsubsection{Keypoints detection}
%
Harris~\cite{Harris1988} and SIFT~\cite{Lowe2004} are classical detectors of image key points.
However, in this work, we use the method implemented by the OpenCV's
\verb|findChessboardCorners| function~\cite{opencv_library},
for being more robust in cluttered images containing smoothing artifacts.
Initially proposed by Vladimir Vezhnevets~\cite{web:Vezhnevets},
this method consists of first converting the image to a binary image based on an adaptive threshold, segmenting the white and black squares of the calibration pattern.
Then the borders of black squares are found and these contours are approximated to quadrilaterals, whose corners are selected and grouped in, according to the calibration object.
%
% The algorithm finds all squares as connected components, then fits a polygon to each connected component.
% If a resulting polygon has four vertices, it is a qualified square.
% Then the algorithm orders qualified squares into a grid until the pattern is found.
%Esta função tem algumas limitações.
%O objeto de calibração deve possuir $ m\times n$ quadrados (ou $ n\times m$),
%onde $m$ é par e $n$ é ímpar, ou vice-versa (i.e, $5\times6$, $7\times8$, $10\times7$ etc).
In this work, we record calibration videos showing the calibration pattern which has a $10\times7$ checkerboard pattern.
Key points detection is performed at every 20 frames of the video.
Before performing detection, we filter the image with a Gaussian filter of size $7\times 7$ pixels, 0 mean and standard deviation $1.4$ in both directions.
We lower the image resolution by $40$\% to reduce detection time.
\subsubsection{Camera model}\label{sec:cam_model}
%
Having detected the key points in the calibration pattern, one may estimate the camera projection matrix from the coordinates of these points in the real world and the image.
Therefore, we consider that a camera maps a world point $\Mbf' = [X,Y,Z,1]^\Trm$ to an image point $\mbf' = [u,v,1]^\Trm$ through a projective transformation of the form~\cite{Hartley2004}
\begin{equation}
\label{eq:projection}
s\mbf' = \Asf[\Rsf~|~\tbf]\Mbf',
\end{equation}
%
where $s$ is an arbitrary scale factor,
$\Rsf$ and $\tbf$ the rotation matrix and translation vector, respectively, (extrinsic parameters) that relates the world coordinate system to the camera coordinate system.
The matrix $\Asf$ is the calibration camera matrix (intrinsic parameters), defined as
%
\symbl{\id{A }$\Asf$}{calibration camera matrix (intrinsic parameters)}
\begin{equation}
\Asf =
\begin{bmatrix}
\alpha & \gamma & u_0\\
0 & \beta & v_0\\
0 & 0 & 1\\
\end{bmatrix},
\end{equation}
%
where $[u_0,v_0]^\Trm$ denotes the principal point coordinates,
$\alpha$ and $\beta$ the scale factors of $u$ and $v$ image axis, respectively, and
$\gamma$ is the skewness of the two image axis.
The intrinsic parameters does not depend on the image viewed.
As a result, once computed, they may be used for all images since the focal length is fixed (same zoom level).
% \subsubsection{Zhang's Algorithm}
\subsubsection{Distortion compensation}\label{sec:Zhang}
%
Conventional cameras usually have significant lens distortion, specifically radial distortion.
In order to minimize such distortions, we apply Zhang's algorithm~\cite{Zhang2000a} as follows.
%
Let $(u,v)$ be the ideal (nonobservable distortion-free) pixel image coordinates and
$(\breve{u},\breve{v})$ the corresponding coordinates at observed (distorted) image.
The ideal points are projections of calibration pattern points according to the model given by Equation~\eqref{eq:projection}.
Likewise, $(x,y)$ and $(\breve{x},\breve{y})$ are, respectively, the ideal and real normalized image coordinates.
Hence, the radial distortion may be modeled as~\cite{Zhang2000a}:
%
\begin{equation}
\begin{cases}
\breve{x} = x + x(k_1 r^2 + k_2 r^4) \\% + k_3 r^6)\\% + 2p_1 x y + p_2(r^2 + 2x^2), \\
\breve{y} = y + y(k_1 r^2 + k_2 r^4) % + k_3 r^6),% + p_1(r^2 + 2y^2) + 2 p_2 x y,
\end{cases},
\end{equation}
%
where $k_1$ and $k_2$ are radial distortions coefficients and $r^2 = (x^2 + y^2)$.
Higher order distortion models do not present significant improvements and may lead to numerical instability~\cite{Zhang2000a}.
The center of the radial distortion is located at the principal point.
From
$\breve{u} = \alpha \breve{x} + \gamma \breve{y} + u_0 $ and $\breve{v} = \beta \breve{y} + v_0$, assuming $\gamma = 0$,
we may write
%
\begin{equation}
\begin{cases}
\breve{u} = u + (u-u_0)(k_1 r^2 + k_2 r^4 ) \\
\breve{v} = v + (v-v_0)(k_1 r^2 + k_2 r^4 )
\end{cases}.
\label{eq:u_dist}
% \label{eq:v_dist}
\end{equation}
Zhang's algorithm~\cite{Zhang2000}
first makes a coarse estimation of the camera extrinsic and intrinsic parameters and refine them through maximum likelihood estimation.
Given $n$ calibration pattern images, considering we have $m$ points in this pattern, and the image points are corrupted by independent and identically distributed (i.i.d.) noise,
the maximum likelihood estimation may be obtained by minimizing the following function~\cite{Zhang2000a}
%
\begin{equation}
\sum_{i=1}^{m}\sum_{j=1}^{n} \Vert \xbf_{ij} - \breve{\xbf}(\Asf, k_1, k_2, \Rsf_i, \tbf_i, \Xbf_j) \Vert^2
\label{eq:functional},
\end{equation}
where $\breve{\xbf}(\Asf, k_1, k_2, \Rsf_i, \tbf_i, \Xbf_j)$ is the projection of point $\Xbf_j$ in image $i$, according to Equation~\eqref{eq:projection},
following the distortion model in Equations~\eqref{eq:u_dist}. %and~\eqref{eq:v_dist}.
The minimization of the functional given in Equation~\eqref{eq:functional}, is an nonlinear optimization problem that may be solved with the Levenberg-Marquardt algorithm.
%\subsection{Resumo}
%
%O método de calibração de Zhang pode ser resumido da seguinte forma~\cite{Zhang2000}:
%\begin{enumerate}
% \item imprimir um padrão de calibração e fixar em uma superfície plana;
% \item tirar algumas fotografias do padrão sob diferentes orientações, movendo o padrão ou a câmera;
% \item detectar as quinas do padrão de calibração;
% \item inicializar os parâmetros intrínsecos e extrínsecos utilizando a forma fechada, descrita em~\cite{Zhang2000};
% \item inicializar os coeficientes de distorção radial resolvendo~\eqref{eq:v_dist} para os $m$ pontos e $n$ imagens, ou setá-los para $0$;
% \item refinar os parâmetros minimizando~\eqref{eq:functional}.
%\end{enumerate}
The Figure~\ref{fig:undistort_1} shows an example of the original (distorted) image with detected calibration pattern corners overlaid,
and Figure~\ref{fig:undistort_2} shows the corresponding image after applying Zhang's algorithm to undistort the image.
It is interesting to note that the ``barrel'' effect was eliminated of the image, which
is clearer from the observation of the chessboard calibration pattern.
%
\begin{figure}[th!]
\centering
\begin{subfigure}[t]{.8\linewidth}
\centering
\includegraphics[width=\textwidth]{frame_420_pts.png}
\caption{Original (distorted) image with detected calibration pattern corners overlaid.}
\label{fig:undistort_1}
\end{subfigure}\\
%
\begin{subfigure}[t]{0.8\linewidth}
\centering
\includegraphics[width=\textwidth]{frame_420_undistorted.png}
\caption{Undistorted image.}
\label{fig:undistort_2}
\end{subfigure}
\caption{Example of image undistortion using Zhang's algorithm.}
\label{fig:undistort}
\end{figure}
\section{Generating rectified videos}
% https://slhck.info/video/2017/03/01/rate-control.html
% https://trac.ffmpeg.org/wiki/Encode/H.264
%
We apply rectification frame by frame \ie
we need to extract all video frames and apply the rectification transform in each one.
% After that, we need to generate a new rectified video in compressed format due to storage issues. since
% videos in a single file are better to store than videos broken into images sequence.
% Hence, we need to compress these videos in order to store them.
To verify how compression compromises the frames quality, we made a brief study that we describe in the sequel.
First, we extract all frames from the original, not rectified videos.
We save them as new videos using the same codec (X264) of the original videos, through
\verb|imageio| python library~\cite{imageio} (a \verb|FFMPEG| wrapper).
\verb|FFMPEG| is a software that can save, convert and create video and audio streams in various formats~\cite{web:ffmpeg}.
Lastly, we compare the videos both objectively and subjectively.
We choose the constant rate factor (CRF)
\abbrev{\id{CRF}CRF}{Constant Rate Factor}
mode that varies from 0 to 51 (using 8 bits), where 0 is lossless and 51 is the worst quality possible.
A lower value leads to higher quality, and a subjectively sane range tends to be $17$--$28$~\cite{web:ffmpeg}.
By using this mode we want to keep the best quality without caring much about the final file size.
In this brief experiment, we use a “quality” scale~\cite{imageio} in which the conversion to CRF is done as follows:
\begin{equation}
\textrm{CRF} = 51\bigg(1 - \frac{\textrm{quality}}{10}\bigg).
\end{equation}
%
We evaluate the quality varying from 0 to 10 with unity step, according to Table~\ref{tab:crfxqual}.
%
\begin{table}[htb!]
\caption{Quality indexes evaluated and respective CRF values~\cite{web:ffmpeg}.}
\label{tab:crfxqual}
\centering
\begin{tabular}{@{}cccccccccccc@{}}
\toprule
\textbf{Quality} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \midrule
\textbf{CRF} & 51 & 45.9 & 40.8 & 35.7 & 30.6 & 25.5 & 20.4 & 15.3 & 10.2 & 5.1 & 0 \\ \bottomrule
\end{tabular}
\end{table}
To evaluate our results quantitatively, we use the peak signal-to-noise ratio (PSNR), defined as
%
\begin{equation}
\textrm{PSNR} = 10\log_{10}\bigg({\frac{2^B-1}{\textrm{MSE}}}\bigg),
\label{eq:psnr}
\end{equation}
%
where, $B$ is the number of bits used to represent a pixel (in our case, $B=8$) and MSE is the mean squared error between original frame $I$ and compressed frame $K$ of same size $m\times n$, defined as,
%
\begin{equation}
\textrm{MSE} = \frac{1}{mn} \sum_{i=1}^{m}\sum_{j=1}^{n} \bigg(I(i,j) - K(i,j) \bigg)^2.
\label{eq:mse}
\end{equation}
%
%
Good-quality compressed images present typical values of PSNR between 30 and 50 dB, provided the bit depth is 8 bits, where higher is better~\cite{Welstead20019compression}.
For each video and each quality level, we compare the frame by frame PSNR extracted from original and compressed video
%, as shown in Figure~\ref{fig:vid_psnr},
and we take their mean. % (red dashed line).
%
% \begin{figure}[htb]
% \centering
% \includegraphics[width=\textwidth]{original_001_imageio07.pdf}
% \caption[Frame-by-frame PSNR between original and compressed frames after compression.]
% {Frame-by-frame PSNR between original and compressed frames after compression; the dashed line represents their mean.}
% \label{fig:vid_psnr}
% \end{figure}
%
We run this experiment for 4 videos; Video 1 is calibration video and the other 3 are videos recorded on grass area (we choose these last 3 since grass is a texture that is hard to compress).
We show the average PSNR values along with the final video compressed size in Figure~\ref{fig:compression_comp}.
Note that as we increase quality, the file size increases exponentially,
but, as video encoders are essentially lossy, the PSNRs reach a saturation level~\cite{web:ffmpeg}.
% We hoped the PSNR $\rightarrow \infty$ when generating lossless videos.
% However the documentation says even though the videos are lossless compressed it is not always possible to decode lossless~\cite{web:ffmpeg}.
% Note that lossless output files will likely be huge, and most non-FFmpeg based players will not be able to decode lossless.
%Therefore, if compatibility or file size are an issue, you should not use lossless.
%
%
\begin{figure}[hbt!]
\centering
\includegraphics[width=\textwidth,trim={0 0 0 3.2cm},clip]{comparisson_filesize.pdf}
\caption[Video compression and file size comparison]{Video compression and file size comparison between original and frames after compression.
The leftmost bin represents quality$=0$ and rightmost quality$=10$.}
\label{fig:compression_comp}
\end{figure}
%
We also visually evaluate the videos generated.
As shown in Figure~\ref{fig:compression_vis}, at zero quality the image is seriously damaged, as expected.
We chose the quality parameter that gives the best balance between quality and file size.
The quantitative and qualitative results have shown the quality was not significantly compromised when we use quality factor around 5 (CRF = $25.5$).
Therefore, we use this factor when storing the new videos with the rectified frames.
\begin{figure*}[htb!]
\centering
\includegraphics[width=.8\linewidth, trim={2cm 0 0 0},clip]{frame_3850_orig.png}\\
\vspace{2mm}
\includegraphics[width=.8\linewidth, trim={2cm 0 0 0},clip]{frame_3850_q0.png}\\
\vspace{2mm}
\includegraphics[width=.8\linewidth, trim={2cm 0 0 0},clip]{frame_3850_q5.png}
% \includegraphics[width=.5\linewidth]{base4.png}
\caption[Compression quality comparison]{Compression quality comparison. From up to down: frame from original video, quality = 0, and quality = 5.}
\label{fig:compression_vis}
\end{figure*}
\section{Database annotation}\label{sec:annot}
%
Our dataset is under manual annotation process and is being performed using the Zframer software\footnote{\url{http://www.smt.ufrj.br/~tvdigital/Software/zframer}}, developed at the Signals, Multimedia, and Telecommunications (SMT)
\abbrev{\id{SMT}SMT}{Signals, Multimedia, and Telecommunications} Laboratory of Coppe/UFRJ.
%
After rectified, the acquired video sequences are labeled frame by frame with Zframer, as depicted in Figure~\ref{fig:zframer1}.
%
The students from signal processing laboratory at CEFET/RJ {\it Campus} Nova Iguaçu has helped us in this labeling process.
%
Using the Zframer one may annotate, in each frame of the videos, the objects that have been determined as potential mosquitoes breeding grounds (\eg tires, water reservoirs, bottles).
Moreover, the software allows interpolation between annotations of selected frames, so that it is not necessary that all frames where the object appears are annotated.
The software output is a text file containing, in each line, the annotation format (in our case, rectangles), and the frame number along with the pixel coordinates of the of the upper-left and bottom-right corners of the bounding box of each annotated object, as shown in Figure~\ref{fig:zframer_output}.
%
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{zframer_marking.png}%\\
% \includegraphics[width=\textwidth]{annot_1-2.png}
\caption{A video frame annotation using Zframer.}
\label{fig:zframer1}
\end{figure}
%
\begin{figure}[htb]
\centering
\includegraphics[width=.7\textwidth]{zframer_output_marked.pdf}
\caption{Example of a text file generated by the annotation process.}
\label{fig:zframer_output}
\end{figure}
\section{Conclusions}
%
In this chapter, we discussed works that are direct and indirectly related to our problem.
We have seen that many of them are concerned about helping develop intelligent systems.
In particular, we have gone through those that are of interest to detecting mosquitoes foci.
However, since none of them provided a consolidated dataset available,
we proposed a new one, and described its construction.