Newscorpus 📰🐍

Takes a list of RSS feeds, downloads found articles, processes them and stores the result in a SQLite database.

This project uses Trafilatura to extract text from HTML pages and feedparser to parse RSS feeds.

Installation

This project uses Poetry to manage dependencies. Make sure you have it installed.

Via Poetry

poetry add "git+https://github.com/gambolputty/newscorpus.git"

Via clone

# Clone this repository
git clone git@github.com:gambolputty/newscorpus.git

# Install dependencies with poetry
cd newscorpus
poetry install

Configuration

Copy the example sources file and edit it to your liking.

cp sources.example.json sources.json

It is expected to be in the following format:

[
  {
    "id": 0,
    "name": "Example",
    "url": "https://example.com/rss",
  },
  ...
]

Usage

Starting the scraper (CLI)

To start the scraping process run:

poetry run scrape [OPTIONS]

Options (optional)

Option	Default	Description
--src-path	`sources.json`	Path to a `sources.json`-file.
--db-path	`newscorpus.db`	Path to the SQLite database to use.
--debug	none (flag)	Show debug information.
--workers	`4`	Number of download workers.
--keep	`2`	Don't save articles older than n days.
--min-length	`350`	Don't process articles with a text length smaller than x characters.
--help	none (flag)	Show help menu.

Accessing the database

Access the database within your Python script:

from newscorpus.database import Database

db = Database()

for article in db.iter_articles():
    print(article.title)
    print(article.published_at)
    print(article.text)
    print()

Arguments to iter_articles() are the same as for rows_where()in sqlite-utils (Docs, Reference).

The Database class takes an optional path argument to specify the path to the database file.

Acknowledgements

IFG-Ticker for some source

License

GNU AFFERO GENERAL PUBLIC LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.vscode		.vscode
newscorpus		newscorpus
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
sources.example.json		sources.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newscorpus 📰🐍

Installation

Via Poetry

Via clone

Configuration

Usage

Starting the scraper (CLI)

Options (optional)

Accessing the database

Acknowledgements

License

About

Releases 6

Languages

License

gambolputty/newscorpus

Folders and files

Latest commit

History

Repository files navigation

Newscorpus 📰🐍

Installation

Via Poetry

Via clone

Configuration

Usage

Starting the scraper (CLI)

Options (optional)

Accessing the database

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Languages