This project involves building a robust classifier that classifies whether a document (from abstract content) belongs to cancer class or not.
Dataset Description The training as well as test data contains research papers abstract in .nxml format. Training data contains two folders
- Cancer :- Contains document related to cancer
- Non Cancer: - Contains document not related to cancer. It contains document related to any category apart from cancer, spanning from music, videos to HIV and stroke. Test data contains 100 files with names ranging 1 to 100.nxml. Output should contain labels in below format.
pip install bs4
pip install html2text
pip install tqdm
pip install xml
pip install nltk
pip install numpy
pip install sklearn
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
cd python-package
python setup.py install
conda install libgcc