Skip to content

Commit

Permalink
feat: cache locations action (#31)
Browse files Browse the repository at this point in the history
* feat: github workflow for caching parishes (#30)

* chore(README): badges for workflow

* chore: update README

* chore: bump version from 0.4.1 to 0.4.2
  • Loading branch information
lsg551 authored May 5, 2024
1 parent 8b3a348 commit 5b833fe
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 1 deletion.
41 changes: 41 additions & 0 deletions .github/workflows/cache-parishes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This workflow runs the scraper once a week within GitHub's Workflow .
# It scrapes the parishes and pushes the compressed file into the branch /cache/parishes

name: Cache Parishes
on:
workflow_dispatch:
schedule:
- cron: "0 3 * * 1" # every Monday 03:00 UTC am
env:
# TEMPORARY: use this version as long as <= 1.x to prevent breaking anything
MOS_VERSION: v0.4.1 # matricula-online-scraper version

jobs:
cache-parishes:
name: Cache Parishes
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
with:
ref: cache/parishes
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.12
check-latest: true
- run: python -m pip install --upgrade pip
# Install this tool from pypi instead of building it from source
- name: Install from PyPi
run: pip install matricula-online-scraper==$MOS_VERSION
- name: Scrape parishes
run: matricula-online-scraper fetch location "parishes" -e csv # -> 'parishes.csv'
- name: Zip file
run: gzip parishes.csv # -> 'parishes.csv.gz'
- name: Push to branch 'cache/parishes'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add parishes.csv.gz
git commit -m "cache parishes"
git push
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,15 @@
[Matricula Online](https://data.matricula-online.eu/) is a website that hosts parish registers from various regions across Europe. This CLI tool allows you to fetch data from it and save the data to a file.

---

Our GitHub Workflow automatically scrapes a list with all parishes once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). Download [`parishes.csv`](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) ⚡️

[![Cache Parishes](https://github.com/lsg551/matricula-online-scraper/actions/workflows/blank.yml/cache-parishes.yml)](https://github.com/lsg551/matricula-online-scraper/actions/workflows/cache-parishes.yml)
![GitHub last commit (branch)](https://img.shields.io/github/last-commit/lsg551/matricula-online-scraper/cache%2Fparishes?path=parishes.csv.gz&label=last%20caching&cacheSeconds=43200)

---

Note that this tool will not format or clean the data in any way. Instead, the data is saved as-is to a file. I mention this because the original data is especially poorly formatted and contains a lot of inconsistencies. It is up to the user to process the data further.

## 🔧 Installation
Expand Down Expand Up @@ -38,6 +47,8 @@ Fetch all available locations and save them to a `.jsonl` file:
$ matricula-online-scraper fetch locations ./output.jsonl
```

> :warning: This will fetch all parishes from Matricula Online, which may take a few minutes. Despite that, this data only changes rarely, but frequent scraping will put unnecessary load on the server. Therefore our GitHub Workflow caches this data once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). ⚡️ [Download CSV](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) ⚡️
### Example 2:

Fetch all available register from one parish in Münster, Germany and save them to a `.jsonl` file:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "matricula-online-scraper"
version = "0.4.1"
version = "0.4.2"
description = "Command Line Interface tool for scraping Matricula Online https://data.matricula-online.eu."
repository = "https://github.com/lsg551/matricula-online-scraper"
authors = ["Luis Schulte"]
Expand Down

0 comments on commit 5b833fe

Please sign in to comment.