text_nlp

Text analysis, cleaning and classification modeling

Main Libraries used

Text processing

pytesseract 0.3.9
opencv 3.4.2
pillow 9.0.1

Text classification

pandas 1.3.4
numpy 1.21.2
scikit-learn 0.23.2
xgboost 1.5.1
nltk 3.7
smart_open 1.8.0
gensim 3.8.0

Installation and setup

Install tesseract OCR
brew install tesseract (MAC)
sudo apt-get install tesseract-ocr (Ubuntu)
If you get this error:
Error: The following directories are not writable by your user: /usr/local/share/man/man8

You should change the ownership of these directories to your user. sudo chown -R $(whoami) /usr/local/share/man/man8
And make sure that your user has write permission. chmod u+w /usr/local/share/man/man8
5. Run this: sudo chown -R $(whoami) /usr/local/share/man/man8

conda create -n ENV_NAME python=3.7
conda activate ENV_NAME
conda install pandas pytesseract toe cv2 pillow nltk pytest
conda install -c anaconda scikit-learn

For xgboost: Currently, the XGBoost package from conda-forge channel doesn't support GPU. There is an on-going discussion about this: conda-forge/xgboost-feedstock#26.
For now, you can get XGboost from one of the following here:

conda install -c nvidia -c rapidsai py-xgboost
pip install xgboost
conda install -c conda-forge py-xgboost-gpu

Code Structure

model_data.py
|
|------> preprocess_data.py
|
data------>|

How to Run

Notebook: /partII/data_exploratory_analysis.ipynb
Preprocessing data: /partII/preprocess_data.py
Models: /partII/model_data.py

python model_data.py <train_filename> 
                     <test_filename> 
                     <filedir> 
                     <steps> 
                     <text_cols>
                     <target>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

text_nlp

Main Libraries used

Text processing

Text classification

Installation and setup

Code Structure

How to Run

Files

README.md

Latest commit

History

README.md

File metadata and controls

text_nlp

Main Libraries used

Text processing

Text classification

Installation and setup

Code Structure

How to Run