Data processing and microsite for Leeds 2023
NEW NEW NEW: This repo contains a velociraptor.yaml
file to capture scripts. Take a look at https://velociraptor.run/ More documentation to come...
The repo contains a series of pipelines which are used to collect and process data.
If you are running the python scripts, you will need to install the dependencies listed in requirements.txt
.
You will also need to set PYTHONPATH
in your environment to include scripts
. On a mac, this can be acheived
with the following command: export PYTHONPATH=scripts
. Without that, the scripts will not run, and will throw
an error similar to this:
ModuleNotFoundError: No module named 'metrics'
Some of the scripts and data are managed in a DVC pipeline.
DVC has been added to the requirements.txt
file, so ensure that your python
environment has the required dependencies installed. This could be as simple as
running pip3 install -r requirements.txt
. It's recommended to use a virtual
environment tool such as virtualenv
to avoid clashing requirements.
The repo uses data held in AWS S3 buckets. To access this, make sure
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
are set for your environment.
Here are some useful DVC commands:
- Check the DVC status by running
dvc status
. - To pull the latest data run
dvc update -R working
. - You can run all pipelines with
dvc repro -P
. If no stage dependencies (input files or code) have changed, nothing will be executed. - To list the available pipeline stages run
dvc stage list --all
. You can see the dependency graph withdvc dag
- You can force a stage to re-run using
dvc repro --force <stage name>
.
- May encounter an issue with npm package markdown-it-attrs that prevents serving the site, displaying the error message 'could not find npm package markdown-it-attrs'
- This is a known bug due to an issue with how deno caches NPM packages. Run the command 'echo "import 'lume/cli.ts'" | deno run --unstable -A --reload -' to refresh deno's cache.