cleantext is a an open-source python package to clean raw text data. Source code for the library can be found here.
cleantext has two main methods,
- clean: to clean raw text and return the cleaned text
- clean_words: to clean raw text and return a list of clean words
cleantext can apply all, or a selected combination of the following cleaning operations:
- Remove extra white spaces
- Convert the entire text into a uniform lowercase
- Remove digits from the text
- Remove punctuations from the text
- Remove or replace the part of text with custom regex
- Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
- Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)
cleantext requires Python 3 and NLTK to execute.
To install using pip, use
pip install cleantext
- Import the library:
import cleantext
- Choose a method:
To return the text in a string format,
cleantext.clean("your_raw_text_here")
To return a list of words from the text,
cleantext.clean_words("your_raw_text_here")
To choose a specific set of cleaning operations,
cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True , # Remove extra white spaces
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english' # Language for stop words
)
import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)
returns,
'this is a sample text to clean'
import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to cleaN566556+2+59*/133')
returns,
['sampl', 'text', 'clean']
from cleantext import clean
text = "my id, name1@dom1.com and your, name2@dom2.in"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)
returns,
"my id, email and your, email"
For any questions, issues, bugs, and suggestions please visit here