Kitwe-Focused News Project-Omdena

A machine learning project to predict customer churn, designed to help businesses retain valuable customers and reduce revenue loss.

🔥 Project Overview

The Kitwe-Focused News Project aims to gather, analyze, and interpret local news articles centered around Kitwe, Zambia, by creating a structured dataset from reputable Zambian news sources. This README will document each stage of the project, beginning with Data Collection and continuing through Data Preprocessing, Data Analysis, and Insights Generation.

Project Stages

Data Collection
Data Preprocessing

👨‍💻 Author

Hi! I'm Ambigathi, a fresher data scientist with experience in machine learning, API development, and model deployment. I'm passionate about creating solutions that have real-world impact.

LinkedIn: linkedin.com/in/ambigathi
GitHub: github.com/yourusername

1. Data Collection

The Data Collection stage focuses on gathering Kitwe-centered news articles to create a dataset for analysis.

Approach

To collect relevant articles, we employed a systematic approach using RSS (Really Simple Syndication) feeds from Zambian news websites. The process involves:

RSS Feed Access: Accessing RSS feeds from reputable sources to ensure real-time, structured updates.
Filtering by “Kitwe”: Using a Python script to retrieve only articles tagged with "Kitwe" from these feeds, ensuring that the dataset remains focused on Kitwe-related news.
Iterative Data Extraction: Configuring the script to retrieve up to 1,000 pages per source to maximize data collection.

Tools and Libraries

Python: The primary programming language for data collection.
Feedparser: Used to parse and extract content from RSS feeds efficiently.
CSV Library: Employed to store the data in a structured CSV file format, with each row representing a unique news article.

Data Structure

Each article entry in the dataset includes:

Source: Origin of the news article
Category Tags: Relevant categories or tags from the source
Headline: Title of the article
Link: URL to the article
Description: Brief description of the article
Publication Date: Date the article was published
Author: Author of the article, if available

Challenges

Some challenges encountered during the data collection stage include:

Inconsistent Data Fields: Some articles lacked information like publication dates or descriptions, resulting in gaps in the dataset.
Irregular Pagination: Non-standard pagination on some sites caused the script to halt, requiring adjustments.
Connection Interruptions: Occasional connectivity issues with RSS feeds led to delays. Retry mechanisms were implemented, but some interruptions persisted.
Duplicate Entries: Repeated articles across pages required duplicate filtering for data integrity.

Limitations

RSS Dependence: Only articles indexed by the RSS feeds were collected, potentially limiting coverage.
Incomplete Metadata: Some articles lacked sufficient details, complicating classification efforts.
Manual Quality Checks: Minor manual checks introduced a small margin for human error.

2.Data Cleaning

This stage focuses on improving the dataset's quality for analysis.

Key Steps

Load Data: Import the raw dataset from a CSV file.
Handle Missing Values: Remove or impute missing entries.
Remove Duplicates: Eliminate duplicate articles based on headline and publication date.
Text Preprocessing:

Lowercase text.
Remove special characters and stop words.
Lemmatize words.

Normalize Text: Address URLs to reduce distractions.

Logging

Track counts before and after cleaning, including any entries removed.

Challenges

Inconsistent article titles can complicate duplicate detection.
Articles with incomplete information require careful handling.

🛠️ Installation

Clone the repository:

git clone https://github.com/Ambigapathi-V/Kitwe-Local-News-Aggregator-Omdena-

Install dependencies:
```
pip install -r requirements.txt
```
Run the pipeline :
```
python main.py
```

💬 Feedback

If you have any feedback, feel free to reach out via feedback.

🌟 Support

For support, please contact support.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gthub/workflows		.gthub/workflows
artifacts		artifacts
config		config
data		data
research		research
src/newsaggregator		src/newsaggregator
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
schema.yaml		schema.yaml
setup.py		setup.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kitwe-Focused News Project-Omdena

🔥 Project Overview

Project Stages

👨‍💻 Author

1. Data Collection

Approach

Tools and Libraries

Data Structure

Challenges

Limitations

2.Data Cleaning

Key Steps

Logging

Challenges

🛠️ Installation

💬 Feedback

🌟 Support

About

Releases

Packages

Languages

tanishqkolhatkar93/Kitwe-Local-News-Aggregator-Omdena-

Folders and files

Latest commit

History

Repository files navigation

Kitwe-Focused News Project-Omdena

🔥 Project Overview

Project Stages

👨‍💻 Author

1. Data Collection

Approach

Tools and Libraries

Data Structure

Challenges

Limitations

2.Data Cleaning

Key Steps

Logging

Challenges

🛠️ Installation

💬 Feedback

🌟 Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages