Welcome to the News Sorting project using Natural Language Processing (NLP) techniques applied to the BBC News Dataset. This project aims to classify news articles into predefined categories such as business, entertainment, politics, sport, and tech. By leveraging NLP, we'll extract features from the text data to build a machine learning model capable of accurately categorizing news articles.
The BBC News Dataset consists of news articles published by the BBC, categorized into five predefined classes: business, entertainment, politics, sport, and tech. Each article contains textual content along with its corresponding category label. The dataset is available on Kaggle.
data/
: Directory to store the dataset files.notebooks/
: Jupyter notebooks for data exploration, text preprocessing, model training, and evaluation.models/
: Saved models after training.results/
: Results and evaluation metrics.README.md
: This file, providing an overview of the project.
To run the project, you'll need Python 3.x and the following libraries:
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- nltk
- wordcloud
You can install these dependencies using pip:
pip install numpy pandas scikit-learn matplotlib seaborn nltk wordcloud
- Download the BBC News Dataset from Kaggle.
- Place the dataset files in the
data/
directory. - Open and run the Jupyter notebooks in the
notebooks/
directory sequentially. These notebooks cover data preprocessing, text feature extraction, model training, and evaluation. - After training the models, evaluate their performance using appropriate evaluation metrics.
- Experiment with different NLP techniques, models, and hyperparameters to improve performance.
The project aims to achieve the following outcomes:
- Develop an NLP model capable of accurately categorizing news articles into business, entertainment, politics, sport, and tech categories.
- Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.
- Visualize the results and gain insights into the classification performance.
This project is licensed under the MIT License - see the LICENSE file for details.