Appendix A

An appendix is provided below regarding convolutional neural networks.

Convolutional Neural Networks

Convolutional neural networks (CNNs) are specialized neural networks for data processing and are considered a grid-like structure. CNNs use the convolution operator at least one layer, unlike the other types of neural networks. Usage of the convolution operator provides three advantages that play significant roles for improving the machine learning system: sparse interactions, parameter sharing, and equivariant representations [1]. CNNs can be thought of as a series of layers. In this study, convolutional layers, downsampling layers are used to extract features, and a flatten layer is preferred to create vector forms of feature maps before the classification part of the network architecture.

CNN Layers

Convolution Layer

The convolution layer is based on a discrete convolution process. Discrete convolution is given as the following;

where is the input and is the kernel that shifts through the information in the input and kernel the parts that are summation to it and exclude the rest. Input data of convolution layers are generally multidimensional arrays. A convolution operator depends on tensor shape can be implemented in more dimension. The two-dimensional that is employed in our study can be defined as below

where represents two dimensional input matrix and is a kernel matrix with . The main goal of the convolution operator usage is to reduce the input image to its essential features. A feature map is produced by sliding the convolution filter over the input signal. The sliding scale is a hyper-parameter known as a stride. The size of the feature map or convolution layer output length for each dimension can be realized using the following equation

where is the number of dimensions, represent the length of the input vector and the kernel length in dimension, where is the value of stride.

Activation functions

In neural networks, when output data is generated from input data, activation functions are proposed to introduce non-linearity. The activation functions employed in our study are described below.

Rectified Linear Unit (ReLu) It offers much faster learning than sigmoid and tangent functions because of the simpler mathematical operations. Although it is continuous, it is not differentiable.
Softmax function: It is the type of sigmoid function, and the softmax output can be considered a probability distribution over a finite set of outcomes [1]. Therefore it is used in the output layer of the proposed architecture, especially for multiclass classification problems.

where is input of the softmax, is the output index and is the number of classes.

Pooling layer

Another important part of CNNs is the pooling operation. A pooling layer does not include learnable parameters like bias units or weights. Pooling operations are used to decrease the size of feature maps with some functions that calculate the average or the maximum value of each distinct region of size from the input. It helps the representation become slightly invariant to small translations of the input. A pooling layer solves disadvantages related to the probability of over-fitting and computational complexity [2].

Flatten layer

A flatten layer is used between feature extraction and classification sections to arrange tensor shape. The output tensor shape is mostly two or more dimensional tensor. Therefore the tensor shape is decreased to a one-dimensional vector with flatten layer to get a suitable input size for dense layers.

Fully-connected layer

Fully-connected layers are also called dense layers and correspond to convolution layers with a kernel of size . In the fully-connected layer, all units are connected with the units at the previous layer. Outputs are multiplied with weight a and are given as inputs to the units of the next layers. This processes can be represented as follows

where is the output vector of the fully connected layer, is the input vector, denotes the matrix includes the weights of the connections between the neurons, and represents the bias term vector.

CNN Learning

CNNs occur a lot of layers and connections between these layers. As a result of this, they consist of many parameters that are required to be tuned. The main purpose of CNNs is to find out the best values for parameters because they directly affect classification performance. The learning ability of CNNs increases with tuning parameters. In the following, we try to answer by explaining the main parts of the CNNs' learning mechanism mathematically.

Cross-Entropy Loss Function

A loss function quantifies the difference between the estimated output of the model (the prediction) and the correct output (the ground truth) to provide better convergence. In this study, we utilized the cross-entropy loss function for multi-class classification. The probability of each class is calculated according to the softmax function (5) and the cross-entropy loss (7) for an instance is generated as follows:

where is the output vector, is the number of classes and is the estimated probability that the instance belongs to class . is equal to 1 if the target class is k. Otherwise, it is equal to 0.

Optimization

Optimization algorithms are used when the closed-form equation can not be preferable due to the presence of a singular matrix, a large dataset. In this study, gradient descent based adaptive moment estimation (ADAM) algorithm is utilized for training deep learning algorithms. Gradient descent based optimization methods adjust the parameters iteratively to minimize the cost function, which calculates a sum of loss functions using the training data set. When cross-entropy loss function is used as a loss function, the cost function is defined for multiclass classification problems as follows:

where is the number of instances, $\theta$ is the parameter vector and is the parameter matrix. Gradient descent measures the local gradient of the cost function with regards to the parameter vector, and it goes in the direction of the descending gradient until the algorithm converges to a minimum. At this point, an important hyperparameter that must be determined carefully is the learning rate, which specifies how often updating parameters occurs. The learning rate changes depending on the gradient of the cost function which is calculated at each iteration and each unit. At the output layer the gradient of the cost function (8) can be defined for class as follows:

where is the score regarding instance. Learning rate is updated at each iteration according to the equations as in below:

where indicates the partial derivation, is the learning rate and is the gradient matrix of the weight matrix at the iteration time . All weights are updated according to the chain rule in the backpropagation algorithm in each iteration from the output unit to inputs.

Adaptive Moment Estimation

Adam can be defined as an adaptive learning rate method because of the capability to compute individual learning rates for different parameters. It calculates the learning rate depending on the gradient of cost function and estimates the first moment (mean) and the second moment (variance) of the gradient to update parameters. Estimations of mean and variance can be calculated using the following equations:

where indicates the gradient of the cost function, and are values of the decay. After that, the weights are updated according to the following equation:

where denotes adaptive learning rate of each individual parameter and is a small term which is used to prevent division to zero.


Backpropagation algorithm on two layer neural network.

Backpropogation

Backpropagation is the iterative procedure that decreases the cost in a sequence by adjusting the weights. Backpropagation does not update the weights of the model, and the optimizer's adjustment depends on the gradient of the cost function. The backpropagation algorithm takes the partial derivative of the cost function with respect to the weights in accordance with the chain rule (15) and propagates back to the network from the outputs to the inputs. Regarding the multiclass classification problem, we assume that softmax function and sigmoid function are used at the output layer, hidden layer, respectively. Gradients are computed related to the neural network architecture in Figure 1{reference-type="ref" reference="backpropagation"} for an instance as follows. To update a weight at the second layer, the chain rule is:

where is the bias term, is the output of neuron at hidden layer.

To update a weight at first layer chain rule is:

Regularization

There are many parameters in the training phase of deep CNNs, and sometimes this causes over-fit means that the model is very successful in the training data but fails when compared to new data. Regularization techniques in our study are explained briefly in the following.

Batch Normalization: Batch normalization provides to learn a more complex or flexible model by normalizing the mean and variance of the output activations. The distribution of activations at each layer shows a variation when the parameters are updated during training. It improves learning by reducing this internal covariance shift. In this way, a model is more resistant to problems such as vanishing, exploiting problems. As a first step, given d-dimensional feature vector all features are normalized as in below;

After the feature normalization, batch normalization can be defined for each one of the batches as in below:

where represents the total number of features at one batch, and are mean and variance of the batch respectively.

All normalized features are scaled and shifted to hinder the network's ability to utilize nonlinear transformations fully. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also, in some cases eliminating the need for Dropout.

Dropout: Dropout can be seen as a stochastic regularization technique. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. Dropout prevents to move some of the outputs of the previous layer to the next layer. This can be considered as a masking applied to cross-layer transitions. Dropout is applied to a unit in a layer that must learn the pattern with randomly selected previous units' outputs. In this way, the hidden units are enforced to extract valuable features. Moreover, it reduces the risk of training data memorization. Hyperparameters expressing the probability of the masking process are called "dropout rate".

References

[1] I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, MIT Press, 2016.

[2] R. Sebastian, “Python Machine Learning”, Packt Publishing, 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
codes		codes
dataset_samples		dataset_samples
misc		misc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Appendix A

Convolutional Neural Networks

CNN Layers

Convolution Layer

Activation functions

Pooling layer

Flatten layer

Fully-connected layer

CNN Learning

Cross-Entropy Loss Function

Optimization

Adaptive Moment Estimation

Backpropogation

Regularization

References

About

Releases

Packages

Languages

rcetin/CNN-based-Signal-Classification-in-Real-Time

Folders and files

Latest commit

History

Repository files navigation

Appendix A

Convolutional Neural Networks

CNN Layers

Convolution Layer

Activation functions

Pooling layer

Flatten layer

Fully-connected layer

CNN Learning

Cross-Entropy Loss Function

Optimization

Adaptive Moment Estimation

Backpropogation

Regularization

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages