Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from sul_pub, rialto-orgs, OpenAlex and Dimensions APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment.
flowchart TD
sul_pub_harvest(SUL-Pub harvest) --> sul_pub_pubs[/SUL-Pub publications/]
rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/]
org_data --> dimensions_harvest_orcid(Dimensions harvest ORCID)
org_data --> openalex_harvest_orcid(OpenAlex harvest ORCID)
dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions DOI-ORCID dictionary/]
openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex DOI-ORCID dictionary/]
dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set)
openalex_orcid_doi_dict -- DOI --> doi_set(DOI set)
sul_pub_pubs -- DOI --> doi_set(DOI set)
doi_set --> dois[/All unique DOIs/]
dois --> dimensions_enrich(Dimensions harvest DOI)
dois --> openalex_enrich(OpenAlex harvest DOI)
dimensions_enrich --> dimensions_enriched[/Dimensions publications/]
openalex_enrich --> openalex_enriched[/OpenAlex publications/]
dimensions_enriched -- DOI --> merge_pubs(Merge publications)
openalex_enriched -- DOI --> merge_pubs
sul_pub_pubs -- DOI --> merge_pubs
merge_pubs --> all_enriched_publications[/All publications/]
all_enriched_publications --> join_org_data(Join organizational data)
org_data --> join_org_data
join_org_data --> publications_with_org[/Publication with organizational data/]
publications_with_org -- DOI & SUNET --> contributions(Publications to contributions)
contributions --> contributions_set[/All contributions/]
contributions_set --> publish(Publish)
Based on the documentation, Running Airflow in Docker.
-
Clone repository
git clone https://github.com/sul-dlss/rialto-airflow.git
-
Start up docker locally.
-
Create a
.env
file with theAIRFLOW_UID
andAIRFLOW_GROUP
values. For local development these can usually be:
AIRFLOW_UID=50000
AIRFLOW_GROUP=0
AIRFLOW_VAR_DATA_DIR="data"
(See Airflow docs for more info.)
- Add to the
.env
values for any environment variables used by DAGs. Not in place yet--they will usually applied to VMs by puppet once productionized.
Here is an script to generate content for your dev .env file:
for i in `vault kv list -format yaml puppet/application/rialto-airflow/dev | sed 's/- //'` ; do \
val=$(echo $i| tr '[a-z]' '[A-Z]'); \
echo AIRFLOW_VAR_$val=`vault kv get -field=content puppet/application/rialto-airflow/dev/$i`; \
done
- The harvest DAG requires a CSV file of authors from rialto-orgs to be available. This is not yet automatically available, so to set up locally, download the file at
https://sul-rialto-dev.stanford.edu/authors?action=index&commit=Search&controller=authors&format=csv&orcid_filter=&q=. Put the
authors.csv
file in thedata/
directory.
- Install
uv
for dependency management as described in the uv docs. - Create a virtual environment:
uv venv
This will create the virtual environment at the default location of .venv/
. uv
automatically looks for a venv at this location when installing dependencies.
- Activate the virtual environment:
source .venv/bin/activate
uv pip install -r requirements.txt
To add a dependency:
uv pip install flask
- Add the dependency to
pyproject.toml
. - To re-generate the locked dependencies in
requirements.txt
:
uv pip compile pyproject.toml -o requirements.txt
Unlike poetry, uv's dependency resolution is not platform-agnostic. If we find we need to generate a requirements.txt for linux, we can use uv's multi-platform resolution options.
To upgrade Python dependencies:
uv pip compile pyproject.toml -o requirements.txt --upgrade
First enable the virtual environment:
source .venv/bin/activate
Then ensure the app dependencies and dev dependencies are installed.
uv pip install -r requirements.txt -r requirements-dev.txt
Then run the tests:
pytest
- Run linting:
ruff check
- Automatically fix linting:
ruff check --fix
- Run formatting:
ruff format
(orruff format --check
to identify any unformatted files)
First you'll need to build a Docker image and publish it DockerHub:
DOCKER_DEFAULT_PLATFORM="linux/amd64" docker build . -t suldlss/rialto-airflow:latest
docker push suldlss/rialto-airflow
Deployment to https://sul-rialto-airflow-dev.stanford.edu/ is handled like other SDR services using Capistrano. You'll need to have Ruby installed and then:
bundle exec cap dev deploy