Overview: This project focuses on developing a news classification system using Natural Language Processing (NLP) techniques. The goal is to automatically categorize news articles into different topics, such as sports, business, politics, and weather. The project involves web scraping, data preprocessing, and the implementation of machine learning models to achieve accurate and reliable classification.
Key Features: Web Scraping: The system extracts news articles from diverse sources, including prominent websites such as BBC, The Hindu, Times Now, and CNN, using web scraping tools like BeautifulSoup and Selenium.
Data Cleaning and Preprocessing: Irrelevant information, including HTML tags and advertisements, is removed. The text is tokenized, stop words are eliminated, and lemmatization or stemming is applied for effective data preprocessing.
Text Representation: The textual data is converted into a numerical format suitable for machine learning models. This includes leveraging techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings such as Word2Vec and GloVe.
Topic Clustering: Clustering algorithms, such as K-means and hierarchical clustering, are applied to group similar articles together based on their content.
Topic Labeling: Manual inspection of a sample of articles within each cluster is conducted to assign meaningful topic labels, enhancing the interpretability of the classification.
Classification Model: A supervised machine learning model is trained to predict the topic of a news article. The model may include algorithms like Naive Bayes, Support Vector Machines, or advanced models like LSTM or BERT. Labeled clusters serve as ground truth labels for model training.
Evaluation: The performance of the classification model is rigorously evaluated on a testing set, utilizing metrics such as accuracy, precision, recall, and F1-score.
Deployment with Streamlit: The classification model is deployed through a user-friendly Streamlit application, allowing users to input news text and receive predictions on the category in real-time.