Skip to content

Commit

Permalink
Reflow README, add note on docker compose dev loop
Browse files Browse the repository at this point in the history
  • Loading branch information
tomalrussell committed Jun 14, 2024
1 parent 3e17325 commit 1ff4e20
Showing 1 changed file with 138 additions and 64 deletions.
202 changes: 138 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
# IRV AutoPackaging

FastAPI + Celery for executing ETL (boundary clipping, re-formatting, move to S3/Local FS) against source datasaets hosted on https://global.infrastructureresilience.org (see also https://github.com/nismod/infra-risk-vis.)

Encompasses API and backend-processing to generate / manage frictionless-data datapackages associated with boundaries
FastAPI + Celery for executing ETL (boundary clipping, re-formatting, move to
S3/Local FS) against source datasaets hosted on
https://global.infrastructureresilience.org (see also
https://github.com/nismod/infra-risk-vis.)

Encompasses API and backend-processing to generate / manage frictionless-data
datapackages associated with boundaries

## Architecture

<a href="url"><img src="docs/architecture.png" width="500"></a>


## API

API Covering boundaries, processors, packages and processing job submission against packages.
API Covering boundaries, processors, packages and processing job submission
against packages.

The source of truth of package data for the API is the configured filesystem.

Expand All @@ -22,29 +25,39 @@ The API uses boiundaries loaded into a configured postgres database (see below).

Boundaries are sourced by the API from a local PostGIS table.

To load the boundaries in the correct form you can use the helper script: `tests/data/load_boundaries.py <geojson filepath> <name column> <long_name column> <wip table true/false>`.
To load the boundaries in the correct form you can use the helper script:
`tests/data/load_boundaries.py <geojson filepath> <name column> <long_name
column> <wip table true/false>`.

The boundaries table schema is managed by Alembic can be found under
`api/db/models/boundary.py`.

The boundaries table schema is managed by Alembic can be found under `api/db/models/boundary.py`.
**NOTE**: API Integration tests require a Db user who has RW access to this
table.

__NOTE__: API Integration tests require a Db user who has RW access to this table.
__NOTE__: The configured API database will be wiped during running of the integration tests and loaded with test-boundaries.
**NOTE**: The configured API database will be wiped during running of the
integration tests and loaded with test-boundaries.

#### PG Schema Management with Alembic

The database schema is managed through Alembic. The following serves as a guide to basic usage for extending the schema - refer to https://alembic.sqlalchemy.org/en/latest/ for more information.
The database schema is managed through Alembic. The following serves as a guide
to basic usage for extending the schema - refer to
https://alembic.sqlalchemy.org/en/latest/ for more information.

##### Schema Updates

* Make changes as required to models
* From within the autoppkg/api folder run the following to auto-generate an upgrade/downgrade script:
- Make changes as required to models
- From within the autoppkg/api folder run the following to auto-generate an
upgrade/downgrade script:

```bash
alembic revision --autogenerate -m "Added Boundary Table"
```

__NOTE__: CHECK the script - remove extraneous operations (in particular those relating to spatial-ref-sys)
**NOTE**: CHECK the script - remove extraneous operations (in particular those
relating to spatial-ref-sys)

* When ready run the following to upgrade the database:
- When ready run the following to upgrade the database:

```bash
# Ensure the AUTOPKG_POSTGRES_* env variables are set (see below)
Expand All @@ -58,10 +71,18 @@ alembic upgrade head
uvicorn api.main:app --host 0.0.0.0 --port 8000
```

Running using Docker:
Running using Docker, for example to run services in the background and
build/run/stop the API container:

```bash
docker-compose up api
# Run background services
docker compose -f docker-compose.yaml up -d dataproc flower redis db
# Build API container
docker compose -f docker-compose.yaml build api
# Run API container
docker compose -f docker-compose.yaml up api
# Logs printed directly to terminal
# CTRL-C to stop
```

### Documentation
Expand All @@ -72,15 +93,13 @@ OpenAPI JSON: https://global.infrastructureresilience.org/extract/openapi.json

#### OpenAPI

* Run the app as above
* Navigate to http://`host`:`port`/openapi.json
- Run the app as above
- Navigate to http://`host`:`port`/openapi.json

#### ReDoc

* Run the app as above
* Navigate to http://`host`:`port`/redoc


- Run the app as above
- Navigate to http://`host`:`port`/redoc

## Data Processing

Expand All @@ -99,53 +118,72 @@ celery --app dataproc.tasks worker
See: `docker-compose.yaml`

```bash
docker-compose up dataproc
docker compose up dataproc
```

### Data Storage

Terms:

* `Package` - All Data associated with a single boundary
* `Processor` - Code for clipping a particular Dataset and Version
* `Processing Backend` - Processor execution environment. Currently only local filesystem processing backend is supported. (Processing interim files are executed against and stored-in the local execution env)
* `Storage Backend` - Package storage environment. Currently AWS S3 and LocalFS are supported. Package files are hosted from here, either using NGINX (see `docker-compose.yaml`) or S3.
- `Package` - All Data associated with a single boundary
- `Processor` - Code for clipping a particular Dataset and Version
- `Processing Backend` - Processor execution environment. Currently only local
filesystem processing backend is supported. (Processing interim files are
executed against and stored-in the local execution env)
- `Storage Backend` - Package storage environment. Currently AWS S3 and LocalFS
are supported. Package files are hosted from here, either using NGINX (see
`docker-compose.yaml`) or S3.

#### Package Structure:

<a href="url"><img src="docs/package_structure.png" width="500"></a>

Processors will download and store source datafiles to a configured location on the local execution environment filesystem, on first-execution. (This means source files _could_ be downloaded multiple times if multiple Celery workers were deployed across seperate filesystems.)
Processors will download and store source datafiles to a configured location on
the local execution environment filesystem, on first-execution. (This means
source files _could_ be downloaded multiple times if multiple Celery workers
were deployed across seperate filesystems.)

Processor will generate interim files in a configured location on the local filesystem during processing of a boundary. These files are subsequently moved to the configured storage backend and deleted from temporary storage on processor exit.
Processor will generate interim files in a configured location on the local
filesystem during processing of a boundary. These files are subsequently moved
to the configured storage backend and deleted from temporary storage on
processor exit.

### Processors

Dataset Core Processors (`dataproc/processors/core`) are executed as Celery Tasks and are responsible for fetching, cropping and moving the dataset-version to-which they are associated.
Dataset Core Processors (`dataproc/processors/core`) are executed as Celery
Tasks and are responsible for fetching, cropping and moving the dataset-version
to-which they are associated.

Supporting Internal Processors (`dataproc/processors/internal`) generate Boundary and folder-structures, as well as providing logging.
Supporting Internal Processors (`dataproc/processors/internal`) generate
Boundary and folder-structures, as well as providing logging.

Celery tasks are constructued from API request and executed against source data. A processing request can only be executed against a single boundary, but can include multiple processors to be executed.
Celery tasks are constructued from API request and executed against source data.
A processing request can only be executed against a single boundary, but can
include multiple processors to be executed.

The overall task for each request is executed as a Chord, with a nested Group of tasks for each processor (which can run in parallel):
The overall task for each request is executed as a Chord, with a nested Group of
tasks for each processor (which can run in parallel):

```python
dag = step_setup | group(processor_task_signatures) | step_finalise
```

The `step_setup` and `step_finalise` tasks are defined in `dataproc.tasks` and are responsible for setting up the processing environment and cleaning up after the processing has completed.
The `step_setup` and `step_finalise` tasks are defined in `dataproc.tasks` and
are responsible for setting up the processing environment and cleaning up after
the processing has completed.

The `processor_task_signatures` are generated by the API and are responsible for executing the processing for each processor.

Duplicate execution of tasks is prevented by using a Redis-lock for a combination of boundary-dataset-version key. (see `dataproc/tasks.py`).
The `processor_task_signatures` are generated by the API and are responsible for
executing the processing for each processor.

Duplicate execution of tasks is prevented by using a Redis-lock for a
combination of boundary-dataset-version key. (see `dataproc/tasks.py`).

### Configuration

All config variables are parsed by `config.py` from the execution environment.

```bash
# Celery
# Celery
AUTOPKG_LOG_LEVEL=DEBUG # API and Dataproc Logging Level
AUTOPKG_INTEGRATION_TEST_ENDPOINT="http://localhost:8000" # API Endpoint used during integration testing (integration testing deployment env)
AUTOPKG_REDIS_HOST="localhost" # Redis Host (APOI and Worker)
Expand Down Expand Up @@ -191,7 +229,8 @@ AUTOPKG_PACKAGES_HOST_URL= # Root-URL to the hosting engine for package data. e.

#### Processor Specific Configurations

Some processors require their-own environment configuration(e.g. secrets for source data)
Some processors require their-own environment configuration(e.g. secrets for
source data)

```bash
# AWS OSM / Damages DB
Expand All @@ -211,17 +250,27 @@ GDAL_CACHEMAX=1024 # This flag limits the amount of memory GDAL uses when croppi
AUTOPKG_CELERY_CONCURRENCY=2 # The number of tasks that can be executed at once. Assume you'll get into the position of executing multiple very large crops / OSM cuts this number of times in parallel. Smaller tasks will be queued behind these larger blocking tasks.
```

Also when running under docker-compose you can change the container resource limits in `docker-compose.yaml` to uit your execution environment.
Also when running under docker-compose you can change the container resource
limits in `docker-compose.yaml` to uit your execution environment.

__NOTE__: We have not yet extensively testsed running on a distributed-cluster (i.e. workers running on separate nodes). In Theory this is supported through Celery and the Redis backend, however the processor data-folder will need to be provided through some shared persistent storage to avoid pulliung source data multiple-times.
**NOTE**: We have not yet extensively testsed running on a distributed-cluster
(i.e. workers running on separate nodes). In Theory this is supported through
Celery and the Redis backend, however the processor data-folder will need to be
provided through some shared persistent storage to avoid pulliung source data
multiple-times.

### Testing

#### DataProcessors

Integration tests in `tests/dataproc/integration/processors` all run standalone (without Redis / Celery), but you'll need access to the source data for each processor (see above).
Integration tests in `tests/dataproc/integration/processors` all run standalone
(without Redis / Celery), but you'll need access to the source data for each
processor (see above).

__NOTE__: Test for geopkg (test_natural_earth_vector) loading include a load from shapefile to postgres - the API database is used for this test and configured user requires insert and delete rights on the api database for the test to succeed.
**NOTE**: Test for geopkg (test_natural_earth_vector) loading include a load
from shapefile to postgres - the API database is used for this test and
configured user requires insert and delete rights on the api database for the
test to succeed.

```bash
# Run tests locally
Expand All @@ -230,30 +279,42 @@ python -m unittest discover tests/dataproc
docker-compose run test-dataproc
```

#### API & DataProcessing End2End
#### API & DataProcessing End to End

__NOTE__: API and Dataproc tests required access to shared processing and package folders for assertion of processor outputs.
API and Dataproc tests required access to shared processing and package folders
for assertion of processor outputs.

__NOTE__: API tests will add and remove boundary test-data to/from the Db during execution.
API tests will add and remove boundary test-data to/from the Db during
execution.

__NOTE__: API tests will add and remove package data to/from the configured packages directory during execution. Temporary processing data for `natural_earth_raster` will also be generated and removed from the configured processing backend folder.
API tests will add and remove package data to/from the configured packages
directory during execution. Temporary processing data for `natural_earth_raster`
will also be generated and removed from the configured processing backend
folder.

__NOTE__: Dataproc will add and remove package data to/from the packages source tree during execution. Processors will also remove data from their configured temporary processing directoryies, depenign on how they are configured.
Dataproc will add and remove package data to/from the packages source tree
during execution. Processors will also remove data from their configured
temporary processing directoryies, depenign on how they are configured.

__NOTE__: Individual processor integration tests require access to source data to run successfully.
Individual processor integration tests require access to source data to run
successfully.

#### Locally

Ensure the Celery Worker, Redis, PG and API service running are running somewhere (ideally in an isolated environment as assets will be generated by the tests) if you want to run the integration tests successfully.
Ensure the Celery Worker, Redis, PG and API service running are running
somewhere (ideally in an isolated environment as assets will be generated by the
tests) if you want to run the integration tests successfully.

Ensure you also have `AUTOPKG_LOCALFS_STORAGE_BACKEND_ROOT_TEST` set in the environment, so both API and Celery worker can pickup the same package source tree
Ensure you also have `AUTOPKG_LOCALFS_STORAGE_BACKEND_ROOT_TEST` set in the
environment, so both API and Celery worker can pickup the same package source
tree

```bash
export AUTOPKG_DEPLOYMENT_ENV=test
# Run API
uvicorn api.main:app --host 0.0.0.0 --port 8000

# Run Worker
# Run Worker
celery --app dataproc.tasks worker --loglevel=debug --concurrency=1

# Run tests locally
Expand All @@ -272,21 +333,26 @@ docker-compose run test-api

#### Localfs or S3 Backend

Altering deployment env with `AUTOPKG_STORAGE_BACKEND=awss3` or `AUTOPKG_STORAGE_BACKEND=localfs` will also mean tests run against the configured
backend.
Altering deployment env with `AUTOPKG_STORAGE_BACKEND=awss3` or
`AUTOPKG_STORAGE_BACKEND=localfs` will also mean tests run against the
configured backend.

__NOTE__ awss3 integration tests require supplied access keys to have RW permissions on the configured bucket.
**NOTE** awss3 integration tests require supplied access keys to have RW
permissions on the configured bucket.

```bash
export AUTOPKG_STORAGE_BACKEND=awss3 && python -m unittest discover tests/dataproc
```


### Extending / New Processor Development

* Create a new folder for your dataset beneath `dataproc/processors/core` (e.g. `dataproc/processors/core/my_dataset`)
* Add a new Python-file for the dataset version within the folder (and supporting __init__.py). (e.g. `dataproc/processors/core/my_dataset/version_1.py`)
* Add a Metadata Class containing the processor-version metadata (which must sub-class MetadataABC), e.g.:
- Create a new folder for your dataset beneath `dataproc/processors/core` (e.g.
`dataproc/processors/core/my_dataset`)
- Add a new Python-file for the dataset version within the folder (and
supporting **init**.py). (e.g.
`dataproc/processors/core/my_dataset/version_1.py`)
- Add a Metadata Class containing the processor-version metadata (which must
sub-class MetadataABC), e.g.:

```python
class Metadata(BaseMetadataABC):
Expand Down Expand Up @@ -314,7 +380,12 @@ class Metadata(BaseMetadataABC):
data_formats = ["GeoTIFF"]
```

* Add a Processor Class (which must sub-class BaseProcessorABC so it can be run by the global Celery Task), which runs the fetching, cropping and moving logic for your dataset-version. (__NOTE__: Helper methods are already provided for the majority of tasks - e.g. Storage backend classes are provided for LocalFS and AWSS3), e.g.:
- Add a Processor Class (which must sub-class BaseProcessorABC so it can be run
by the global Celery Task), which runs the fetching, cropping and moving logic
for your dataset-version. (**NOTE**: Helper methods are already provided for
the majority of tasks - e.g. Storage backend classes are provided for LocalFS
and AWSS3), e.g.:

```python
class Processor(BaseProcessorABC):
"""A Test Processor"""
Expand Down Expand Up @@ -360,10 +431,13 @@ class Processor(BaseProcessorABC):
)
```

* Write tests against the new Processor (see: `tests/dataproc/integration` for examples)
* Rebuild image and deploy: The API will expose any valid processor-folder placed under the `dataproc/core` folder.

- Write tests against the new Processor (see: `tests/dataproc/integration` for
examples)
- Rebuild image and deploy: The API will expose any valid processor-folder
placed under the `dataproc/core` folder.

## Acknowledgments

This research received funding from the FCDO Climate Compatible Growth Programme. The views expressed here do not necessarily reflect the UK government's official policies.
This research received funding from the FCDO Climate Compatible Growth
Programme. The views expressed here do not necessarily reflect the UK
government's official policies.

0 comments on commit 1ff4e20

Please sign in to comment.