Text analysis, cleaning and classification modeling
- pytesseract 0.3.9
- opencv 3.4.2
- pillow 9.0.1
- pandas 1.3.4
- numpy 1.21.2
- scikit-learn 0.23.2
- xgboost 1.5.1
- nltk 3.7
- smart_open 1.8.0
- gensim 3.8.0
- Install tesseract OCR
- brew install tesseract (MAC)
- sudo apt-get install tesseract-ocr (Ubuntu)
- If you get this error:
Error: The following directories are not writable by your user: /usr/local/share/man/man8
You should change the ownership of these directories to your user. sudo chown -R $(whoami) /usr/local/share/man/man8
And make sure that your user has write permission. chmod u+w /usr/local/share/man/man8
5. Run this: sudo chown -R $(whoami) /usr/local/share/man/man8
- conda create -n ENV_NAME python=3.7
- conda activate ENV_NAME
- conda install pandas pytesseract toe cv2 pillow nltk pytest
- conda install -c anaconda scikit-learn
For xgboost: Currently, the XGBoost package from conda-forge channel doesn't support GPU. There is an on-going discussion about this: conda-forge/xgboost-feedstock#26.
For now, you can get XGboost from one of the following here:
- conda install -c nvidia -c rapidsai py-xgboost
- pip install xgboost
- conda install -c conda-forge py-xgboost-gpu
model_data.py
|
|------> preprocess_data.py
|
data------>|
- Notebook: /partII/data_exploratory_analysis.ipynb
- Preprocessing data: /partII/preprocess_data.py
- Models: /partII/model_data.py
python model_data.py <train_filename>
<test_filename>
<filedir>
<steps>
<text_cols>
<target>