• The CNN then finds the output probabilities for each class in training dataset using the score function, which plots the raw data to class scores.

15 The score function for linear classifier is f (xi, W, b) = Wxi + b:

where xi symbolizes the input image. The matrix W represents the weight associated with the neuron, and the vector b represents the bias related to the neuron. W and b signify the parameters of the score function.

For the image classification task, the input to the score function is an image xi, for which the score function calculates the vector f (xi,W) of the raw class scores denoted by ‘s’. Therefore, given an image xi, the predicted score for the jth class will be the jth element in ‘s’, given by sj = f(xi;W)j .

The class scores obtained from the training data will be used to compute the loss.

• When the network was initially trained, the weights were randomly chosen. As a result the output probabilities are also random.

Step 3: Loss function

The output random class scores are used to calculate the loss function that finds the match between the predicted scores and the ground truth labels in the training data.

The loss function is also named as the cost function or objective which presents the discontent of the predicted scores output by the score function. instinctively, if the predicted scores are intimately equivalent to the training data labels then the loss is reduced. Otherwise the loss would be high.

Softmax Classifier

The Softmax classifier employs the cross entropy loss/ softmax loss. The softmax function can be represented as: fj(z) = ezj/ ?k ezk

The input to the softmax function is a vector of real-valued scores (in z) . It then compresses the scores to a vector of values ranging between zero and one so that the total sum is one.

Consider ith example in the data, the cross entropy loss is given as:

where fj means the jth element of the vector of class scores f.

The softmax function flattens the raw class scores denoted by ‘s’ into normalized positive values that add up to one on which the cross entropy loss can be applied.

The total error at the output layer that sums over all 10 classes considered in the problem is given by;

Total Error = ? ½ (target probability – output probability) ²

The complete loss for the dataset is the mean of Li over all training examples, together with a regularization term, R(W).

L =Pi Li N + ?R(W)

where N corresponds to the total number of images in the training set.

? is the regularization strength, which is the network hyperparameter.

The loss function enables to evaluate the value of any particular set of parameters used in the developed model. The aim is to lessen the loss so that the model outputs an improved accuracy.

Step 4: Optimization

Optimization is the method of obtaining the set of parameters of the model that minimizes the total loss. The core principle for the optimization techniques is to calculate the gradient of the loss in relation with the parameters of the model. The gradient of a function provides the direction of steepest ascent. One means of computing the gradient proficiently is to compute the gradient systematically by recursively applying the chain rule. This method is called “Backpropagation” which provides an efficient way to optimize the random loss functions. These loss functions may convey diverse classes of network architectures (e.g. fully connected neural networks, convolutional networks etc).

Backpropagation algorithm is used to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values/weights and parameter values to minimize the output error.

Step 5: Parameter Updates

After calculating the analytic gradient using Backpropagation, the resulting gradients are used to perform the parameter update. Various approaches are available for performing the parameter update – SGD (Stochastic Gradient Desent), SGD+Momentum, Nesterov Momentum , Adagrad, RMSprop , Adam etc.

• The weights are altered in the fraction to the total error.

• When the same image is input again, output probabilities become closer to the target vector. This signifies that the network has now learnt to classify the particular image correctly by fine-tuning its weights/filters and decrease the output error.

• The filter values produced during the convolution operation and the selected weights are the two parameters that get updated during the training process while the number of filters, filter size and the architecture remains unchanged.