Recognizing character and digit fromdocuments such as photographs captured at a street level is a very importantfactor in developing digital map. For example, google street view images includedmillions of geo-located images. By recognizing images, we can develop a precisemap which can improve navigation services.
Though normal character classificationis already solved by computer vision, but still recognizing digit or characterfrom the natural scene like photographs are still a complex issue. The reason behind this problem are non-contrastingbackgrounds, low resolution, blurred images, fonts variation, lighting etc. Traditionalapproach of doing this work was a two-step process. First slice the image toisolate each character and then perform recognition on extracted image. This usedto be done using multiple hand-crafted features and template matching.
1The main purposeof this project is to recognize the street view house number by using a deepconvolutional neural network. For thiswork, I considered the digit classification dataset of house numbers which Iextracted from street level images. 5 This dataset is similar in flavor toMNIST dataset but with more labeled data. It has more than 600,000-digit imageswhich contain color information and various natural backgrounds. 5 To achievethe goal, I developed an application which will detect the number from images.A convolutional neural network model with multiple layers is used to train thedataset and detect the house digit numbers.
I used the traditionalconvolutional architecture with different pooling methods and multistagefeatures and finally got almost 92% accuracy. Streetview number detection is called natural scene text recognition problem which isquite different from printed character or handwritten recognition. Research inthis field was started in 90’s, but still it is considered as an unsolved issue.
As I mentioned earlier that the difficulties arise due to fonts variation,scales, rotations, low lights etc. In earlier years to deal with naturalscene text identification sequentially, first character classification bysliding window or connected components mainly used. 4 After that wordprediction can be done by predicting character classifier in left to rightmanner. Recently segmentation method guided by supervised classifier use wherewords can be recognized through a sequential beam search. 4 But none of thiscan help to solve the street view recognition problem. In recent works convolutional neuralnetworks proves its capabilities more accurately to solve object recognitiontask.
4 Some research has done with CNN to tackle scene text recognitiontasks. 4 Studies on CNN shows its huge capability to represent all types ofcharacter variation in the natural scene and till now it is holding this highvariability. Analysis with convolutional neural network stars at early 80’s andit successfully applied for handwritten digit recognition in 90’s. 4 With therecent development of computer resources, training sets, advance algorithm anddropout training deep convolutional neural networks become more efficient to recognizenatural scene digit and characters. 3 Previously CNN used mainly to detecting asingle object from an input image. It was quite difficult to isolate eachcharacter from a single image and identify them.
Goodfellow et al., solve thisproblem by using deep large CNN directly to model the whole image and with asimple graphical model as the top inference layer. 4 The rest of the paper is designed insection III Convolutional neural network architecture, section IV Experiment,Result, and Discussion and Future Work and Conclusion in section V. ConvolutionalNeural Networks (CNN) is a multilayer network to handle complex andhigh-dimensional data, its architecture is same as typical neural networks. 8Each layer contains some neuron which carries some weight and biases. Eachneuron takes images as inputs, then move onward for implementation and reduceparameter numbers in the network.
7 The first layer is a convolutional layer.Here input will be convoluted by a set of filters to extract the feature fromthe input. The size of feature maps depends on three parameters: number offilters, stride size, padding. After each convolutional layer, a non-linearoperation, ReLU use.
It converts all negative value to zero. Next is pooling orsub-sampling layer, it will reduce the size of feature maps. Pooling can be differenttypes: max, average, sum.
But max pooling is generally used. Down-sampling alsocontrols overfitting. Pooling layer output is using to create featureextractor.
Feature extractor retrieves selective features from the inputimages. These layers will have moved to fully connected layers (FCL) and theoutput layer. In CNN previous layer output considers as next layer input. For thedifferent type of problem, CNN is different.
Themain objective of this project is detecting and identifying house-number signsfrom street view images. The dataset I am considering for this project isstreet view house numbers dataset taken from 5 has similarities with MNISTdataset. The SVHN dataset has more than 600,000 labeled characters and theimages are in .png format.
After extract the dataset I resize all images in32x32 pixels with three color channels. There are 10 classes, 1 for each digit.Digit ‘1’ is label as 1, ‘9’ is label as 9 and ‘0’ is label as 10. 5 Thedataset is divided into three subgroups: train set, test set, and extra set.The extra set is the largest subset contains almost 531,131 images.
Correspondingly, train dataset has 73,252 and test data set has 26,032 images. Figure 3 is an example of the original,variable-resolution, colored house-number images where each digit is marked bybounding boxes. Boundingbox information is stored in digitStruct.mat file, instead of drawn directly onthe images in the dataset. digitStruct.
mat file contains a struct calleddigitStruct with the same length of original images. Each element indigitStruct has the following fields: “name” which is a string containing thefilename of the corresponding image. “bbox” is a struct array that contains theposition, size, and label of each digit bounding box in the image. As an example,digitStruct(100). bbox (1). height meansthe height of the 1st digit bounding box in the 100th image.
5 This is very clearfrom Figure 3 that in SVHN dataset maximum house numbers signs are printedsigns and they are easy to read. 2 Because there is a large variation infont, size, and colors it makes the detection very difficult. The variation ofresolution is also large here. (Median: 28 pixels. Max: 403 pixels.
Min: 9pixels). 2 The graph below indicates that there is the large variation incharacter heights as measured by the height of the bounding box in originalstreet view dataset. That means the size of all characters in the dataset,their placement, and character resolution is not evenly distributed across thedataset.
Due to data are not uniformly distributed it is difficult to makecorrect house number detection In my experiment, I train a multilayer CNN forstreet view house numbers recognition and check the accuracy of test data. Thecoding is done in python using Tensorflow, a powerful library forimplementation and training deep neural networks. The central unit of data inTensorFlow is the tensor. A tensor consists of a set of primitive values shapedinto an array of any number of dimensions. A tensor’s rank is its number ofdimensions.
9 Along with TensorFlow used some other library function such asNumpy, Mathplotlib, SciPy etc. I perform myanalysis only using the train and test dataset due to limited technical resources.And omit extra dataset which is almost 2.7GB. To make the analysis simpler deleteall those data points which have more than 5 digits. By preprocessing the datafrom the original SVHN dataset a pickle file is created which being used in myexperiment.
For the implementation, I randomly shuffle valid dataset and thenused the pickle file and train a 7-layer Convoluted Neural Network. At the verybeginning of the experiment, first convolution layer has 16 feature maps with5x5 filters, and originate 28x28x16 output. A few ReLU layers are also addedafter each convolution layer to add more non-linearity to the decision-makingprocess. After first sub-sampling the output size decrease in 14x14x10. Thesecond convolution has 512 feature maps with 5×5 filters and produces 10x10x32output.
By applying sub-sampling second time get the output size 5x5x32.Finally, the third convolution has 2048 feature maps with same filter size. Itis mentionable that the stride size =1 in my experiment along with zero padding.During my experiment, I use dropout technique to reduce the overfitting.Finally, SoftMax regression layer is used to get the final output. Weights areinitialized randomly using Xavier initialization which keeps the weights in theright range.
It automatically scales the initialization based on the number ofoutput and input neurons. After model buildup, start train the network and logthe accuracy, loss and validation accuracy for every 500 steps.Once the processis done then get the test set accuracy. To minimize the loss, Adagrad Optimizer used.After reach in a suitable accuracy level stop train the network and save thehyperparameters in a checkpoint file. When we need to perform the detection, theprogram will load the checkpoint file without train the model again.
Initially,the model produced an accuracy of 89% with just 3000 steps. It’s a greatstarting point and certainly, after a few times of training the accuracy will reachin 90%. However, I added some additional features to increase accuracy. First, addeda dropout layer between the third convolution layer and fully connected layer. Thisallows the network to become more robust and prevents overfitting.
Secondly, introducedexponential decay to calculate learning rate with an initial rate 0.05. It willdecay in each 10,000 steps with a base of 0.95. This helps the network to takebigger steps at first so that it learns fast but over time as we move closer tothe global minimum, it will take smaller steps. With these changes, the modelis now able to produce an accuracy of 91.
9% on the test set. Since there are alarge training set and test set, there is a chance of more improvement if themodel will train for a longer time. During myanalysis, I reached an accuracy level of almost 92%. After train the model firsttime the accuracy was 89%.
After several times of training it reached to 92%. Asmentioned earlier that the saved checkpoint file will be restored later tocontinue training or to detect new images. By using the dropout, its confirmthat the model is suitable and can predict most images. The model is testedover a wide range of input from the test dataset. To recognize housenumbers this model can detect most of the images.
From Figure 5 it appears thatamong ten house numbers it correctly recognizes seven house numbers. However,the model still gives incorrect output when the images are blurry or has anyother noise. Due to limited resource I train the model few times as it takeslonger time to run. I believe there is a strong possibility to increase theaccuracy level if work with whole dataset. Also, the use of better hardware andGPU can run the model faster. In the experiment I proposed a multi-layer deepconvolutional neural network to recognize the street view house number.
The testingdone on more than 600,000 images and achieve almost 92% accuracy. From theanalysis it is vibrant that the model produces correct output for most images.However, the detection may fail if the Image is blurry, or contain any noise. Most exciting featureof the project is to discover the performance of some applied tricks likedropout and exponential learning rate decay on real data.
As many variation ofCNN architecture can be implemented, it’s very difficult to understand which architecturewill work best for any specific dataset. Determine the most appropriate CNNarchitecture was very challenging aspect of this experiment. The modelimplemented in this project is relatively simple but does the job very well andis quite robust. However still some works need to be done to optimize accuracylevel. As a future work, I will extend my experiment using another architectureof CNN along with hybrid technique and algorithms.
And try to find out whichone gives better accuracy with minimum cost and less number of loss. As well astry to incorporate the whole dataset in next experiment.