Skip to content

Adi5598/Automatic-Speech-Recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Speech Recognition for Regional Indian Languages

INTRODUCTION

The aim of this project is to implement automatic speech recognition algorithms using Hidden Markov Models (HMMs) for regional Indian languages. We have self-recorded Tamil digits, Telugu digits and words, and English continuous speech. We have also used external datasets for Hindi continuous speech and English digits. We have implemented HMM based systems using hmmlearn (Python library) and HTK (toolkit). We have also implemented a Deep Neural Network (DNN) based system to draw comparison and have presented our analysis. Following is the list of implementations:

LITERATURE REVIEW

Automatic Speech Recognition (ASR) is a well researched field. The utilization of HMMs for ASR is studied well in The Application of Hidden Markov Models in Speech Recognition. The paper presents the core architecture of a HMM-based Large Vocabulary Continuous Speech Recognition (LVCSR) system and then describes ways to achieve state-of-the-art performance. There is also a recent seminar report on Hidden Markov Model and Speech Recognition which explains the Forward algorithm, the Viterbi algorithm and the Baum-Welch algorithm in the context of speech recognition and HMMs concisely.

In the past few years, there has been significant work on developing speech recognition systems using HMMs for regional Indian languages. Syllable Based Continuous Speech Recognition for Tamil Language uses MFCC feature vectors and an acoustic HMM model to develop a recognition system for Tamil. We have used a similar methodology to develop a recognition system for Telugu words using HTK, a toolkit for building HMMs. Grapheme Gaussian Model and Prosodic Syllable Based Tamil Speech Recognition System builds upon this system and produces an accuracy of 77% on a dataset of 20 Tamil words, with 2 speakers and 2 utterances each. However, the implementation of this was beyond the scope of this project. HTK Based Speech Recognition Systems for Indian Regional languages: A Review presents well the summaries and best obtained accuracies of HTK based speech recognition systems developed for 13 regional languages including Tamil, Telugu, Hindi and English.

Automatic Speech Recognition Systems for Regional Languages in India argues that Deep Neural Networks (DNNs) must be more efficent and accurate for speech recognition.

DATASETS

Tamil Digits

The Tamil digits dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Tamil digit from 0-9. The length of each file is approximately 1 second. A total of 230-250 samples are present with each digit having around 13-15 samples. The dataset can be accessed here.

The digit-label-utterance mapping is given in the following table.

Digit Label Utterance
Zero 0 Poojyam
One 1 Onnu
Two 2 Rendu
Three 3 Munnu
Four 4 Naalu
Five 5 Anju
Six 6 Aaru
Seven 7 Yezhu
Eight 8 Yettu
Nine 9 Ombodu

Telugu Digits

The Telugu digits dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Telugu digit from 1-10. A total of ~60 samples are present with each digit having around 6-7 samples. The dataset can be accessed here.

The digit-label-utterance mapping is given in the following table.

Digit Label Utterance
One 1 Okati
Two 2 Rendu
Three 3 Mudu
Four 4 Nalugu
Five 5 Aidu
Six 6 Aaru
Seven 7 Edu
Eight 8 Enimidi
Nine 9 Tommidi
Ten 10 Padi

Telugu Words

The Telugu words dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Telugu word. A total of 80 samples are present with each word having 4 samples. The dataset can be accessed here.

The Telugu-English word mapping is given in the following table.

Word Meaning
abbayi boy
amma mother
ammayi girl
andarum all
batuku everyone
bojanum meal
chudama check
cinnema movie
dhairyam courage
kalisi together
kannu eye
kodatanu beats
konchum slightly
manum us
meeru you
nanna father
nenu I
pinni aunt
sonthum ourselves
yevaru who

English (Indian Accent) Continuous Speech

Three speakers recorded data for English continuous speech in an Indian accent.

Speaker 1 (Male, 20yrs):

File Duration
rec1.wav 6:49
rec2.wav 7:04
rec3.wav 13:34
rec4.wav 4:07
rec5.wav 7:53

Speaker 2 (Male, 21yrs):

File Duration
rec1.wav 18:14
rec2.wav 30:59

Speaker 3 (Male, 21 yrs):

File Duration
rec1.wav 8:36
rec2.wav 10:29
rec3.wav 11:25
rec4.wav 9:54

Externally Obtained Datasets:

  • Hindi Continuous Speech:

    The dataset consists of 150 sentences in Hindi with 7 different speakers for each. It can be accessed here.

  • English Digits:

    The dataset can be accessed here.

IMPLEMENTATIONS

HMMLEARN FOR TAMIL, TELUGU AND ENGLISH DIGITS, AND TELUGU WORDS RECOGNITION

Dependencies

  • Python (version 2.7. *)
  • hmmlearn
  • python_speech_features

Tamil Digits

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Fig Accuracy: 61.48%

Testing Results:

Fig Accuracy: 60.97%

Summary:

Fig

Telugu Digits

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Fig Accuracy: 50.24%

Testing Results:

Fig Accuracy: 58.06%

Telugu Words

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Fig Accuracy: 65%

Testing Results:

Fig Accuracy: 60%

Summary: Fig

English Digits

Note: To reproduce the results, refer this section for the script and weight file.

  • Entire Dataset

    Training Results:

    Fig Accuracy: 96.75%

    Testing Results:

    Fig Accuracy: 94%

  • Limited Dataset

    We used 20% of the original test data and took 15 samples per digit.

    Training Results:

    Fig Accuracy: 60%

    Testing Results:

    Fig Accuracy: 60%

HTK FOR HINDI CONTINUOUS SPEECH AND TELUGU WORDS RECOGNITION

HTK Installation (Linux)

Hindi Continuous Speech

Fig

Telugu Words

Note: Our forked repository can be found here.

  • Upload the data in ./data dir in the corresponding train and test directories.
  • Prepare a transliteration file and a lexicon file (phone level) (hindiSentences150.txt and lexicon.txt respectively in the original repository) for all the words present in the speech samples and put in ./doc and ./lm respectively.
  • Now go to scripts_ph_pl_py folder and edit the HTK_home variable in master.sh with the absolute path of your HTK dir.
  • Also give read_write permissions to all the files present here -> chmod a+rx *.sh *.pl *.py.
  • Now cd into the parent directory and run the following commands to:
    1. Generate env var and mfcc features
    2. Write the transcription
    3. Initialize PDF of each phone model
    4. Fit the data
    5. Evaluate the output

scripts_sh_pl_py/master.sh HCOPY
scripts_sh_pl_py/master.sh LEXICON
scripts_sh_pl_py/master.sh HCOMPV
scripts_sh_pl_py/master.sh HEREST
scripts_sh_pl_py/master.sh ALIGN
scripts_sh_pl_py/master.sh HVITE_MONO

Fig

DEEP NEURAL NETWORK (DNN) FOR TAMIL AND TELUGU DIGITS RECOGNITION

Dependencies

  • Numpy
  • Pandas
  • Librosa
  • Pytorch
  • Sklearn

Tamil Digits

Note: To reproduce the results, refer this section for the script and weight file.

In order to compare the performance of the HMM model on the tamil digits dataset, we train a modern deep learning architecture for the same dataset and observe the performance and compare it with the previous model.

The deep learning model that has been chosen is a Long Short-Term Memory (LSTM) model. LSTM are a special member of the Recurrent Neural Network (RNN) family and have the ability to model the data based on previous data. A non-recurrent Neural Network does not have any memory whereas an RNN has a limited memory and they tend to perform badly on data that has long term temporal dependency on the previous data. LSTM also has the ability to decide how much information to use in its memory as they have input gates, forget gates and output gates.

The LSTM architecture and the other hyper-parameters and functions used are given below:

  • Architecture

LSTM( (rnn): LSTM(input = 81, hidden_neurons = 10, num_layers=2, dropout=0.1) (fc): Sequential(Linear(in_features=10, out_features=10, bias=True)) (output) : Softmax(input = 10 , output = 1) )

  • Hyper-Parameters
    • Learning Rate = 0.01
    • Loss function used = MSE (Mean Squared Error) Loss
    • Optimizer used = Adam Optimizer

Training Results

We trained the model on 220 samples by shuffling the samples. The model was trained for 100 epochs and used batch gradient descent on a batch size of 20 samples. The results are as follows :

  • Total number of test samples = 220
  • Correct predictions = 192
  • Accuracy = 87.27272727272727%

Confusion Matrix:

Confusion Matrix

Loss Plot for Training:

Loss Plot for Training

Testing Results

After training the model, we test on a few unseen samples to see the performance of the model.

  • Total number of test samples = 20
  • Correct predictions = 13
  • Accuracy = 65.0%

Confusion Matrix:

Confusion Matrix

Telugu Digits

Note: To reproduce the results, refer this section for the script and weight file.

In order to compare the performance of the HMM model on the telugu digits dataset, we train a modern deep learning architecture for the same dataset and observe the performance and compare it with the previous model.

The deep learning model that has been chosen is a Long Short-Term Memory (LSTM) model. LSTM are a special member of the Recurrent Neural Network (RNN) family and have the ability to model the data based on previous data. A non-recurrent Neural Network does not have any memory whereas an RNN has a limited memory and they tend to perform badly on data that has long term temporal dependency on the previous data. LSTM also has the ability to decide how much information to use in its memory as they have input gates, forget gates and output gates.

The LSTM architecture and the other hyper-parameters and functions used are given below:

  • Architecture

LSTM( (rnn): LSTM(81, 10, num_layers=2, dropout=0.1) (fc): Sequential( (0): Linear(in_features=10, out_features=10, bias=True)) )

  • Hyper-Parameters

Learning Rate = 0.01 Loss function used = MSE (Mean Squared Error) Loss Optimizer used = Adam Optimizer

Training Results

We trained the model on 50 samples by shuffling the samples. The model was trained for 20 epochs and used batch gradient descent on a batch size of 5 samples. The results are as follows :

  • Total number of test samples = 50
  • Correct predictions = 45
  • Accuracy = 90.0%

Confusion Matrix:

Confusion Matrix

Loss Plot for Training:

Loss Plot for Training

Testing Results

After training the model, we test on a few unseen samples to see the performance of the model.

  • Total number of test samples = 16
  • Correct predictions = 10
  • Accuracy = 62.5%

Confusion Matrix:

Confusion Matrix

COMPARISON

DATASET IMPLEMENTATION TRAINING ACCURACY TESTING ACCURACY SCRIPT WEIGHT FILE
Tamil Digits hmmlearn 61.48% 60.97% hmm_digits.py tamil_digits.pkl
Tamil Digits DNN 87.27% 65% DNN_Tamil.ipynb link
Telugu Digits hmmlearn 50.24% 58.06% hmm_digits.py telugu_digits.pkl
Telugu Digits DNN 90% 62.5% DNN_TELUGU.ipynb link
English Digits (entire dataset) hmmlearn 96.75% 94% hmm_digits.py english_digits.pkl
English Digits (limited dataset) hmmlearn 60% 60% hmm_digits.py english_digits_limited.pkl
Telugu Words hmmlearn 65% 60% hmm_words.py telugu_words.pkl
Telugu Words HTK:heavy_exclamation_mark:
Hindi Continuous Speech HTK 67.35%

CONCLUSION AND FUTURE WORK

  • DNN outperforms the training accuracy of hmmlearn by a large margin.
  • DNN outperforms the testing accuracy of hmmlearn by a smaller margin.
  • Training and testing on the entire English dataset gave ~95% accuracy as opposed to 60% for limited dataset. Thus, more data will significantly improve the accuracies on our self-recorded regional language datasets.
  • The HTK implementation works successfully for continuous speech data. The next step would be to try regional language datasets for continuous speech.

CONTRIBUTORS

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published