Skip to content

A Text Extraction and Transcription System created in 2023 as a project for my Machine Learning & Pattern Recognition course in the British University in Dubai

Notifications You must be signed in to change notification settings

Mayonaka88/text-extraction-and-transcription-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Extraction and Transcription System Using CRNN and EAST For Automatic Multiple Text Transcription From Images

NOTE: DOWNLOAD THE FIRST DATASET REQUIRED FROM HERE! THE SECOND DATASET IS NOT AVAILABLE YET SO MODIFY THE CODE ACCORDINGLY! DOWNLOAD THE EAST MODEL HERE!

A Text Extraction and Transcription System created in 2023 as a project for my Machine Learning & Pattern Recognition course in the British University in Dubai

Methodology

The methodologies proposed for this endeavor rely on the different objectives of text recognition. For text recognition, the three pillars needed to be addressed and implemented are text detection, optical character recognition, and sequence recognition. For text detection, a model is required that can detect text from an image. Text detection is essential for isolating text from other parts of an image, similar to object detection. This can be achieved in multiple ways such as segmentation or using convolutional neural networks. However, since it is required to extract text from real-world environments, simple segmentation using image manipulation would not be enough because of the sheer amount of noise data that is present in real-world environments. For optical character recognition, a model is required to classify an image of a character in the English language and classify what that character might be. Optical character recognition is needed for the recognition of each letter in a sentence. This could be achieved using a convolutional neural network with labeled data of each character in the English language. For sequence recognition, a model is required to recognize sentences as a sequence of characters. This can be implemented using recurrent neural networks that use Long Short-Term Memory. With that in mind, a system of neural networks is required to achieve the desired outcome. The system will work as parts that pass input around. First, the text detection model will extract the text from the image and isolate it from the rest of the image. Then, the sentences are going to be recognized as sequences using the sequence recognition model. Finally, using optical character recognition, each character in the sequence is going to be classified as its corresponding class which is the character it represents in English. A few more additions to the system can be added like a model for spell-checking and grammar which will provide more accurate real-world results however it is not required.

For this implementation, Python will be used as the development language. However, few packages are required for the proper development of the system. The packages used include but are not limited to OpenCV, NumPy, TensorFlow, Keras, Matplotlib, Skimage, Visualkeras, and Pillow. The most important packages are TensorFlow and Keras. TensorFlow and Keras are packages used for creating neural networks. TensorFlow and Keras go hand in hand when creating models in the development process of neural networks. They provide the classes and functions required for the construction of neural networks. OpenCV, NumPy, and Skimage are used for storing, manipulating, and processing data. OpenCV and Skimage are mainly used for image data while NumPy is used specifically for arrays. Since pictures are technically arrays, all these three packages are usually simultaneously used. Meanwhile, Matplotlib and Visualkeras are used as visualization tools. Matplotlib is utilized to create graphs representing the accuracy and loss of the training and validation processes of the model while Visualkeras is used to visualize the layers of the model for a better understanding of the structure of the neural network. These tools provide more context to the implementation and the evaluation of the model.

Dataset Used

To train the model, datasets of text from real-world images are required. The images are required to be labeled and should contain the desired text that the neural network should train on. For the implementation, two datasets were used. The first dataset used is the MJSynth Dataset provided by the University of Oxford Visual Geometry group. MJSynth Dataset is a synthetic word dataset that contains 9 million synthetically generated words representing 90 thousand words from the English language. The images in the dataset are comparable to that found in real-world environments and are perfect for the task at hand. The second dataset used is a brand-new dataset created for this project which can be learned more about here.

Models Used

For the implementation, two models are utilized. These models are necessary to carry out text detection, optical character recognition, and sequence recognition. For text detection, OpenCV's Efficient and Accurate Scene Text (EAST) model was used. The EAST model is a widely used text detection algorithm designed to identify and localize text regions within an image. It crops out the parts of the image where the text is located. A convolutional recurrent neural network is used to create the EAST model. The model predicts where text regions are located. The EAST model provides both efficiency and high enough accuracy to be implemented industry-wide. For optical character recognition and sequence recognition, a convolutional recurrent neural network has been implemented. The convolutional aspect of the model is required to recognize the characters in an image while the recurrent aspect of the model is essential to detect the sequences in an image. Combining both aspects, the convolutional recurrent neural network is able to recognize words and transcribe them accordingly.

The recurrent convolutional neural network consists of 7 convolutional layers, 4 max-pooling layers, 2 batch normalization layers, and 2 bidirectional LSTM layers with an input and output layer. The convolutional layers contain filters from 64 up to 512 in multiples of 64 with a kernel size of 3x3 and “ReLU” as the activation function. There are two variants of max pooling used in the model. The first two max pooling layers contain a pooling size of 2x2 with a stride of 2 while the rest of the max pooling layers contain a pooling size of 2x1 without stride. Moreover, the bidirectional LSTM layers contain 128 neurons with a dropout of 0.2. Lastly, the output layer is represented as a dense layer with 62 neurons that correspond with the classes and “Softmax” as the activation function. The architecture of the model can be observed in Figure 1. The classes for this model are the 62 characters found in the English language which include uppercase and lowercase characters and numeral digits. Connectionist Temporal Classification is used to calculate the loss. Connectionist Temporal Classification is a technique used for sequence-to-sequence mapping. It is utilized to ensure that variable-length alignment between input and output sequences is achieved without segmentation. This enables end-to-end training without segmenting the words into explicit characters. Overall, Connectionist Temporal Classification simplifies training, handles variable-length alignments, and is widely used in sequence-based applications. This is why it is essential for Connectionist Temporal Classification to be implemented since segmentation is required for text transcription from images.

Preprocessing

For the implementation, preprocessing of the images is required. First, the images are read from the directory and saved into an array. Using the images, the words representing them are saved into another array. The words are then encoded and vectorized so that they can be used as labels representing the classes of each character. On the other hand, the images are either cropped or padded to a size of 32x128 and turned into grayscale. Back to the words, the labels are padded so they can all have the same dimensions as the longest present word. Finally, all of the arrays are converted into numpy arrays using the NumPy python package and are used for utilized for training and validating the recurrent convolutional neural network.

Training and Evaluation

For the recurrent convolutional neural network, the model has been trained using 150k images from the MJSynth Dataset and 150k images from the SNIST Dataset. The data has been divided into training and validation data subsets with a 10% of the images for validation and 90% of the images for training. The model was trained for 20 iterations with the loss metric being monitored. Early stopping was also applied in the training process to avoid overfitting. On average, each iteration took around an hour which leaves the total training time for the model for 20 iterations to be 20 hours. In the end, the model trained for the entirety of the 20 iterations which took around 1344 minutes or 22 hours.

After the training has been completed, the loss, accuracy, validation loss, and validation accuracy has been reported. Upon completion, the model achieved an accuracy score of 79% and a validation accuracy score of 70%. Since there are no models trained with the new SNIST dataset other than the one provided in this paper, it is hard to discern the performance of the model. However, a similar convolutional recurrent neural network created by Baoguang Shi, Xiang Bai, and Cong Yao in the paper “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition” reported an accuracy of 78.2% and an accuracy of 74.6% when trained on sheet music for Optical Music Recognition. As per the accuracies provided, there is a good chance that the model provided for this paper might outperform others that had been previously implemented. This might be the case thanks to the new dataset created for this paper. Although not a perfect comparison, the accuracy provided by the other paper paints a good picture of how well the model created for this paper has performed. In the figure below, the training process can be observed through the monitored accuracy metrics.

d d

Experimentation

After the recurrent convolutional neural network model has been trained, the EAST model is used. Combining these two models together, a system of automatic text extraction and then text classification is achieved. First, an image is segmented into different text regions using the EAST model. Each text region represents a word in the image. These text regions are then cropped out of the original image and saved into an array representing the sentence that is present in the image. Using the array of words, the words are sent to the recurrent convolutional neural network where they are classified. The recurrent convolutional neural network would return a vector representing the letters of the word which is then decoded into a word.

To experiment on the system and its models, a random picture was used to see if the models are able to efficiently detect the words and properly translate them into strings. The words found in the picture are “Selecting”, “TYPEFACES”, “for”, “body”, and “text”. The models should be able to discern the words and a human would then verify if the words that are provided by the model are predicted correctly. After using the models to extract and predict the words, an array of words was shown. The model was able to detect “TYPEFACES”, “body”, and “text” correctly. However, for the word “Selecting”, the model predicted “selectinn” which is close enough however not perfect. Meanwhile, for the word “for”, the model missed the mark completely by predicting “JECDTC”. After reviewing the text regions that the EAST model has detected, it can be observed that the EAST model did not accurately extract the text from the image. Some words had been cropped out incorrectly while others were close enough for the model to predict. In the case of the word “Selecting”, the letter “g” had its lower half cropped out which resulted in the prediction of the letter “n” instead of “g”. For the word “for”, the preprocessing and resizing of the image stretched out the image to the extent that the model was unable to recognize and classify it properly. Through these experiments, it can be observed that the EAST model and the preprocessing steps for the input of the model may not have been the most optimal solution to be used for text detection. However, it can be inferred that the model created for this paper and trained using the SNIST provides surprisingly accurate results if the text regions had been detected properly. The figure below provides a visual representation of the experiments and the text region that the EAST model detected.

d

About

A Text Extraction and Transcription System created in 2023 as a project for my Machine Learning & Pattern Recognition course in the British University in Dubai

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published