In this repo, I have covered preprocessing by,
Cleaning: Remove irrelevant items like HTML tags, symbols, and nonalphabetic characters from the corpus (Data set in NLP)
Normalization: Convert all words to lowercase. Remove punctuation and extra spaces.
Tokenization: Split the text into words, also known as tokens.
Stop Words Removal: Remove the most common words (a, an, the, etc.).
Parts of Speech Tagging: Identify the parts of speech for the remaining words.
Named Entity Recognition: Recognize the named entities in the data.
Stemming and Lemmatisation: Convert words into dictionary forms, using stemming and lemmatization.
Applications that are covered,
-> Speech-to-text conversion
-> Text Preprocessing
-> Language Modelling
-> Language Translation