This project focuses on categorizing news headlines into six categories using a statistical machine learning approach. The dataset consists of 3000 headlines evenly distributed across six categories: Politics, Economy, Sports, Current Affairs, Health, and Technology.
- Data: The dataset contains 500 headlines for each of the six categories. The data has been cleaned, with stop words and numbers removed.
- Model: A Random Forest classifier is used.
- Evaluation: Cross-validation is employed to test the model, achieving a precision of 0.70 and an accuracy of 0.61.
To set up the environment, the following packages need to be installed:
requirements
chardet
gensim
openpyxl
-
Clone the repository:
git clone https://github.com/eraybuyukkanat/nlp_news_classification.git cd your-repo
-
Install the necessary packages:
pip install -r requirements.txt
pip install chardet
pip install gensim
pip install openpyxl
-
Ensure your dataset is properly formatted and placed in the appropriate directory.
-
Open and run the Jupyter Notebook to train and evaluate the model:
jupyter notebook classification.ipynb
The model achieves the following performance metrics:
Precision: 0.70 Accuracy: 0.61
## License
This project is licensed under the MIT License - see the LICENSE file for details.