Vehicle detection project used machine learning and computer vision techniques, and combined advanced lane detection techniques.
I applied two different methods for detection. The steps of this project are the following:
1) SVM Algorithm
- Perform a Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of images and train a classifier Linear SVM classifier.
- Implement an effiecient sliding-window technique and use trained SVM classifier to search for vehicles in images.
- Run a pipeline on a video stream and create a heat map of recurring detections frame by frame to reject outliers and follow detected vehicles. watch full video
2) YOLO Algorithm
- Construct a Keras based neural network and implement a pre-trained model to predict images.
- Run a pipeline on a video stream and create a console to monitor lane status and detections. watch full video
Project-SVM.py
andhelper.py
contain the code for SVM classifier stracture and pipeline.dist.p
contains a trained SVM classifier based on YUV color features and HOG features with 17,000+ car and not-car pictures.Project-yolo.py
andhelper_yolo.py
contain the code for Keras network and pipeline.
- Numpy
- cv2
- sklearn
- scipy
- skimage
- keras
SVM (Support Vector Machine) is a powerful machine learning technique. Here in this project it is trained and used for classification of car and not-car.
My main training data is downloaded from GTI vehicle image database and KITTI vision benchmark websites, which contain about 8700 pictures of car and 8900 of not-car. In addition, in order to increase detection accurancy, I create about 20 not-car pictures from video.
I explored different color spaces and different skimage.hog()
parameters (orientations
, pixels_per_cell
, and cells_per_block
) and made a comparison.
Color Space | Accuracy | Training Time (CPU) |
---|---|---|
YUV | 97.75% | 65 s |
YCrCb | 98.11% | 51 s |
LUV | 98.23% | 59 s |
HLS | 98% | 60 s |
HSV | 97.8% | 112 s |
The above table indicates that accuracy performance in different color space are almost same. Considering less false-positive, I chose YUV
to extract color features.
Here is an example using the YUV
color space and HOG parameters of orientations=15
, pixels_per_cell=(8, 8)
and cells_per_block=(2, 2)
:
I trained a linear SVM using the following code:
car_features = extract_features(cars, color_space, orient, pix_per_cell, cell_per_block, spatial_feat=False, hist_feat=False, hog_channel=hog_channel)
notcar_features = extract_features(notcars, color_space, orient, pix_per_cell, cell_per_block, spatial_feat=False, hist_feat=False, hog_channel=hog_channel)
# Create an array stack of feature vectors
X = np.vstack((car_features, notcar_features)).astype(np.float64)
# Fit a per-column scaler
X_scaler = StandardScaler().fit(X)
# Apply the scaler to X
scaled_X = X_scaler.transform(X)
# Define the labels vector
y = np.hstack((np.ones(len(car_features)), np.zeros(len(notcar_features))))
# Split up data into randomized training and test sets
rand_state = np.random.randint(0, 100)
X_train, X_test, y_train, y_test = train_test_split(
scaled_X, y, test_size=0.2, random_state=rand_state)
# Use a linear SVC
svc = LinearSVC()
svc.fit(X_train, y_train)
Note that feature normalization through sklearn.prepocessing.StandardScaler
is one of the key steps before training. Here is a comparison:
An efficient method for sliding window search is applied, one that allows me to only have to extract the HOG features once.
def find_cars(img, ystart=400, ystop=656, scale=1.5):
draw_img = np.copy(img)
img = img.astype(np.float32) / 255
img_tosearch = img[ystart:ystop, :, :]
ctrans_tosearch = convert_color(img_tosearch, conv='RGB2YUV')
cspace = 'YUV'
if scale != 1:
imshape = ctrans_tosearch.shape
ctrans_tosearch = cv2.resize(ctrans_tosearch, (np.int(imshape[1] / scale), np.int(imshape[0] / scale)))
ch1 = ctrans_tosearch[:, :, 0]
ch2 = ctrans_tosearch[:, :, 1]
ch3 = ctrans_tosearch[:, :, 2]
# Define blocks and steps as above
nxblocks = (ch1.shape[1] // pix_per_cell) - cell_per_block + 1
nyblocks = (ch1.shape[0] // pix_per_cell) - cell_per_block + 1
nfeat_per_block = orient * cell_per_block ** 2
# 64 was the orginal sampling rate, with 8 cells and 8 pix per cell
window = 64
nblocks_per_window = (window // pix_per_cell) - cell_per_block + 1
cells_per_step = 2 # Instead of overlap, define how many cells to step
nxsteps = (nxblocks - nblocks_per_window) // cells_per_step
nysteps = (nyblocks - nblocks_per_window) // cells_per_step
# Compute individual channel HOG features for the entire image
hog1 = get_hog_features(ch1, orient, pix_per_cell, cell_per_block, feature_vec=False)
hog2 = get_hog_features(ch2, orient, pix_per_cell, cell_per_block, feature_vec=False)
hog3 = get_hog_features(ch3, orient, pix_per_cell, cell_per_block, feature_vec=False)
bbox_list = []
for xb in range(nxsteps):
for yb in range(nysteps):
ypos = yb * cells_per_step
xpos = xb * cells_per_step
# Extract HOG for this patch
hog_feat1 = hog1[ypos:ypos + nblocks_per_window, xpos:xpos + nblocks_per_window].ravel()
hog_feat2 = hog2[ypos:ypos + nblocks_per_window, xpos:xpos + nblocks_per_window].ravel()
hog_feat3 = hog3[ypos:ypos + nblocks_per_window, xpos:xpos + nblocks_per_window].ravel()
hog_features = np.hstack((hog_feat1, hog_feat2, hog_feat3))
xleft = xpos * pix_per_cell
ytop = ypos * pix_per_cell
# Extract the image patch
subimg = cv2.resize(ctrans_tosearch[ytop:ytop + window, xleft:xleft + window], (64, 64))
# Get color features
spatial_features = bin_spatial(subimg, color_space=cspace, size=spatial_size)
hist_features = color_hist(subimg, nbins=hist_bins)
# Scale features and make a prediction
test_features = X_scaler.transform(
np.hstack((spatial_features, hist_features, hog_features)).reshape(1, -1))
# test_features = X_scaler.transform(np.hstack((shape_feat, hist_feat)).reshape(1, -1))
test_prediction = svc.predict(test_features)
if test_prediction == 1:
xbox_left = np.int(xleft * scale)
ytop_draw = np.int(ytop * scale)
win_draw = np.int(window * scale)
cv2.rectangle(draw_img, (xbox_left, ytop_draw + ystart),
(xbox_left + win_draw, ytop_draw + win_draw + ystart), (0, 0, 255), 6)
bbox_list.append(((xbox_left, ytop_draw + ystart), (xbox_left + win_draw, ytop_draw + win_draw + ystart)))
return bbox_list, draw_img
Additionally, multiple-scaled search windows is applied with different scale values.
Heatmap with a certain threshold is a good helper to filter false positives and deal with multiple detections. I then used scipy.ndimage.measurements.label()
to identify individual blobs in the heatmap.
def add_heat(heatmap, bbox_list):
# Iterate through list of bboxes
for box in bbox_list:
# Add += 1 for all pixels inside each bbox
# Assuming each "box" takes the form ((x1, y1), (x2, y2))
heatmap[box[0][1]:box[1][1], box[0][0]:box[1][0]] += 1
return heatmap
heatmap = threshold(heatmap, 2)
labels = label(heatmap)
Video stream is a series of image process with the techniques above. In order to make detection between frames more smooth, I built a simple historical heatmap queue to set up connections.
history = deque(maxlen=8)
current_heatmap = np.clip(heat, 0, 255)
history.append(current_heatmap)
YOLO (You Look Only Once) is a popular end-to-end Real-time Object Detection algorithm based on deep learning. Compared with other object recognition methods, such as Fast R-CNN, YOLO integrates target area and object classification into a single neural network. The most outstanding point is its fast speed with preferably high accuracy, nearly 45 fps in base version and up to 155 fps in FastYOLO, quite favourable for real-time applications, for example, computer vision of self-driving car.
YOLO uses an unified single neural network, which makes full use of the whole image infomation as bounding box identification and classification. It divides the image into an SxS grid and for each grid celll predicts B bounding boxes, confidence for those boxes, and C class probailities. The output is a 1470 vector, containing probability, confidence and box coordinates.
It has 20 classes as the following:
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle",
"bus", "car", "cat", "chair", "cow",
"diningtable", "dog", "horse", "motorbike", "person",
"pottedplant", "sheep", "sofa", "train", "tvmonitor"]
In this project, I used tiny YOLO v1 as it is easy to implement and impressively fast.
First to construct a convolutional neuarl network architecture based on Keras, containing 9 convolution layers and 3 full connected layers.
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
convolution2d_1 (Convolution2D) (None, 16, 448, 448) 448 convolution2d_input_1[0][0]
____________________________________________________________________________________________________
leakyrelu_1 (LeakyReLU) (None, 16, 448, 448) 0 convolution2d_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 16, 224, 224) 0 leakyrelu_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D) (None, 32, 224, 224) 4640 maxpooling2d_1[0][0]
____________________________________________________________________________________________________
leakyrelu_2 (LeakyReLU) (None, 32, 224, 224) 0 convolution2d_2[0][0]
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 32, 112, 112) 0 leakyrelu_2[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D) (None, 64, 112, 112) 18496 maxpooling2d_2[0][0]
____________________________________________________________________________________________________
leakyrelu_3 (LeakyReLU) (None, 64, 112, 112) 0 convolution2d_3[0][0]
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D) (None, 64, 56, 56) 0 leakyrelu_3[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D) (None, 128, 56, 56) 73856 maxpooling2d_3[0][0]
____________________________________________________________________________________________________
leakyrelu_4 (LeakyReLU) (None, 128, 56, 56) 0 convolution2d_4[0][0]
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D) (None, 128, 28, 28) 0 leakyrelu_4[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D) (None, 256, 28, 28) 295168 maxpooling2d_4[0][0]
____________________________________________________________________________________________________
leakyrelu_5 (LeakyReLU) (None, 256, 28, 28) 0 convolution2d_5[0][0]
____________________________________________________________________________________________________
maxpooling2d_5 (MaxPooling2D) (None, 256, 14, 14) 0 leakyrelu_5[0][0]
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D) (None, 512, 14, 14) 1180160 maxpooling2d_5[0][0]
____________________________________________________________________________________________________
leakyrelu_6 (LeakyReLU) (None, 512, 14, 14) 0 convolution2d_6[0][0]
____________________________________________________________________________________________________
maxpooling2d_6 (MaxPooling2D) (None, 512, 7, 7) 0 leakyrelu_6[0][0]
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D) (None, 1024, 7, 7) 4719616 maxpooling2d_6[0][0]
____________________________________________________________________________________________________
leakyrelu_7 (LeakyReLU) (None, 1024, 7, 7) 0 convolution2d_7[0][0]
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D) (None, 1024, 7, 7) 9438208 leakyrelu_7[0][0]
____________________________________________________________________________________________________
leakyrelu_8 (LeakyReLU) (None, 1024, 7, 7) 0 convolution2d_8[0][0]
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D) (None, 1024, 7, 7) 9438208 leakyrelu_8[0][0]
____________________________________________________________________________________________________
leakyrelu_9 (LeakyReLU) (None, 1024, 7, 7) 0 convolution2d_9[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten) (None, 50176) 0 leakyrelu_9[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 256) 12845312 flatten_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 4096) 1052672 dense_1[0][0]
____________________________________________________________________________________________________
leakyrelu_10 (LeakyReLU) (None, 4096) 0 dense_2[0][0]
____________________________________________________________________________________________________
dense_3 (Dense) (None, 1470) 6022590 leakyrelu_10[0][0]
====================================================================================================
Total params: 45,089,374
Trainable params: 45,089,374
Non-trainable params: 0
____________________________________________________________________________________________________
Then to load the pre-trained weights (172 MB, link) from website as network training is really time-consuming.
After weight loading, detected bounding boxes could be draw onto the images and finally applied into video stream pipeline with a confidence threshold=0.2.
SVM has an acceptable accuracy of detection, however, it has two shorcomings:
- Due to heatmap threshold application, the bounding box usually appears unstable and smaller than the actual size of object cars.
- The processing speed is only up to 2 fps on account of sliding window search. Even with GPU parallel computing, it is not favorable for real-time application.
YOLO is much more preferable by reason of its strength against SVM's shortcomings:
- Better detection and more stable bounding box position.
- Real-time processing speed, nearly 40 fps.
Note that the limitation of YOLO is as follows:
- Only two boxes and one class in each grid are predicted, which causes detection accuracy of adjacent objects decreases.
- YOLO is learned from pre-trained data. As a result, it performs poor with new objects or usual view angle.
- As the false detection types of YOLO and Fast R-CNN are different, we can integrate these two models to enhance performance.
- Explore YOLO to more videos to try other classifications in addition to cars.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, arXiv:1506.02640 (2015).
- dark flow
- yolo_tensorflow
- YOLO Introduction