Takes a list of RSS feeds, downloads found articles, processes them and stores the result in a SQLite database.
This project uses Trafilatura to extract text from HTML pages and feedparser to parse RSS feeds.
This project uses Poetry to manage dependencies. Make sure you have it installed.
poetry add "git+https://github.com/gambolputty/newscorpus.git"
# Clone this repository
git clone git@github.com:gambolputty/newscorpus.git
# Install dependencies with poetry
cd newscorpus
poetry install
Copy the example sources file and edit it to your liking.
cp sources.example.json sources.json
It is expected to be in the following format:
[
{
"id": 0,
"name": "Example",
"url": "https://example.com/rss",
},
...
]
To start the scraping process run:
poetry run scrape [OPTIONS]
Option | Default | Description |
---|---|---|
--src-path | sources.json |
Path to a sources.json -file. |
--db-path | newscorpus.db |
Path to the SQLite database to use. |
--debug | none (flag) | Show debug information. |
--workers | 4 |
Number of download workers. |
--keep | 2 |
Don't save articles older than n days. |
--min-length | 350 |
Don't process articles with a text length smaller than x characters. |
--help | none (flag) | Show help menu. |
Access the database within your Python script:
from newscorpus.database import Database
db = Database()
for article in db.iter_articles():
print(article.title)
print(article.published_at)
print(article.text)
print()
Arguments to iter_articles()
are the same as for rows_where()
in sqlite-utils (Docs, Reference).
The Database
class takes an optional path
argument to specify the path to the database file.
- IFG-Ticker for some source