Challenge - News Content Collect and Store

In this project, I was challenged to develop a complete application that scrapes articles from The Guardian's website, processes and stores the data in Google BigQuery, and provides an API built with FastAPI for querying the collected data.

YouTube: video of the project running - https://www.youtube.com/watch?v=90E-riHbVtk

The key concepts and principles applied include:

Asynchronous Programming with asyncio and aiohttp to efficiently handle I/O-bound operations during web scraping.
Use of the Factory Pattern to create instances of scrapers and manage processing and transformations. This ensures better code maintainability and facilitates scalability by allowing easy adaptation to incorporate additional data sources.
Application of SOLID Principles to enhance modularity and separation of concerns, promoting cleaner and more maintainable code.

Technologies Used

Python 3.9
FastAPI : Framework for building APIs with Python.
Uvicorn : A lightning-fast ASGI server used to run the FastAPI application.
Google Cloud BigQuery : Data Warehouse.
aiohttp e asyncio : Libraries for asynchronous HTTP clients and servers, enabling non-blocking I/O operations.
Pandas : Data analysis and manipulation.
Docker : A platform for developing, shipping, and running applications inside containers.
Docker Compose : A tool for defining and running multi-container Docker applications.
YAML : Used for configuration files.
db-dtypes : A package providing support for BigQuery data types when using Pandas.

Requirements

Docker and Docker Compose installed on the system.
A Google Cloud Platform account with access to BigQuery.
Google Cloud credentials saved in ./credentials/your_credentials.json.

How to Run the Project

Create a Google Cloud Project:
- In google cloud console, create a project called: news-collect-project or modify the project name in the config.yaml files
Create a Service Account and Add BigQuery Admin permission
Create a service account key
Save your credential .json in the ./credentials/ folder and or change the file name in both the Dockerfiles and the config.yaml files.
Build the Docker Image:

In the project’s root directory, run:

docker-compose build

Start the Services

docker-compose up

Access the API

http://localhost:8000/api/articles

http://localhost:8000/api/articles?keyword=peacekeeping

Contact

For more information or questions, please contact:

Name: Guilherme Carvalho

Email: guilhermerdcarvalho@gmail.com

LinkedIn: https://www.linkedin.com/in/devguilhermecarvalho/

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
api		api
data		data
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Challenge - News Content Collect and Store

The key concepts and principles applied include:

Technologies Used

Requirements

How to Run the Project

Contact

About

Releases

Packages

Languages

devguilhermecarvalho/challenge-collect_news_content_and_store

Folders and files

Latest commit

History

Repository files navigation

Challenge - News Content Collect and Store

The key concepts and principles applied include:

Technologies Used

Requirements

How to Run the Project

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages