Skip to content

Commit

Permalink
Merge pull request #164 from opentargets/tskir-3102-airflow-set-up
Browse files Browse the repository at this point in the history
 [Preprocess #2] Instructions for setting up Airflow
  • Loading branch information
tskir authored Oct 24, 2023
2 parents fe74705 + e5e3201 commit 99a662a
Show file tree
Hide file tree
Showing 10 changed files with 108 additions and 287 deletions.
3 changes: 0 additions & 3 deletions docs/contributing/_contributing.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/development/_development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This section contains various technical information on how to develop and run the code.
98 changes: 98 additions & 0 deletions docs/development/airflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Running Airflow workflows

Airflow code is located in `src/airflow`. Make sure to execute all of the instructions from that directory, unless stated otherwise.

## Set up Docker

We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):

```bash
sudo apt install docker-compose
```

Next, verify that you can run Docker. This should say "Hello from Docker":

```bash
docker run hello-world
```

If the command above raises a permission error, fix it and reboot:

```bash
sudo usermod -a -G docker $USER
newgrp docker
```

## Set up Airflow

This section is adapted from instructions from https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html. When you run the commands, make sure your current working directory is `src/airflow`.

```bash
# Download the latest docker-compose.yaml file.
curl -sLfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml

# Make expected directories.
mkdir -p ./config ./dags ./logs ./plugins

# Construct the modified Docker image with additional PIP dependencies.
docker build . --tag opentargets-airflow:2.7.1

# Set environment variables.
cat << EOF > .env
AIRFLOW_UID=$(id -u)
AIRFLOW_IMAGE_NAME=opentargets-airflow:2.7.1
EOF
```

Now modify `docker-compose.yaml` and add the following to the x-airflow-common → environment section:
```
GOOGLE_APPLICATION_CREDENTIALS: '/opt/airflow/config/application_default_credentials.json'
AIRFLOW__CELERY__WORKER_CONCURRENCY: 32
AIRFLOW__CORE__PARALLELISM: 32
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 32
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 16
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1
```

## Start Airflow

```bash
docker-compose up
```

Airflow UI will now be available at http://localhost:8080/home. Default username and password are both `airflow`.

## Configure Google Cloud access

In order to be able to access Google Cloud and do work with Dataproc, Airflow will need to be configured. First, obtain Google default application credentials by running this command and following the instructions:

```bash
gcloud auth application-default login
```

Next, copy the file into the `config/` subdirectory which we created above:

```bash
cp ~/.config/gcloud/application_default_credentials.json config/
```

Now open the Airflow UI and:

* Navigate to Admin → Connections.
* Click on "Add new record".
* Set "Connection type" to `Google Cloud``.
* Set "Connection ID" to `google_cloud_default`.
* Set "Credential Configuration File" to `/opt/airflow/config/application_default_credentials.json`.
* Click on "Save".

## Run a workflow

Workflows, which must be placed under the `dags/` directory, will appear in the "DAGs" section of the UI, which is also the main page. They can be triggered manually by opening a workflow and clicking on the "Play" button in the upper right corner.

In order to restart a failed task, click on it and then click on "Clear task".

## Troubleshooting

Note that when you a a new workflow under `dags/`, Airflow will not pick that up immediately. By default the filesystem is only scanned for new DAGs every 300s. However, once the DAG is added, updates are applied nearly instantaneously.

Also, if you edit the DAG while an instance of it is running, it might cause problems with the run, as Airflow will try to update the tasks and their properties in DAG according to the file changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Guidelines
title: Contributing guidelines
---

# Contributing guidelines
Expand Down Expand Up @@ -51,7 +51,7 @@ When making changes, and especially when implementing a new module or feature, i
- [ ] Run `make check`. This will run the linter and formatter to ensure that the code is compliant with the project conventions.
- [ ] Develop unit tests for your code and run `make test`. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods.
- [ ] Update the configuration if necessary.
- [ ] Update the documentation and check it with `run build-documentation`. This will start a local server to browse it (URL will be printed, usually `http://127.0.0.1:8000/`)
- [ ] Update the documentation and check it with `make build-documentation`. This will start a local server to browse it (URL will be printed, usually `http://127.0.0.1:8000/`)

For more details on each of these steps, see the sections below.
### Documentation
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ Ingestion and analysis of genetic and functional genomic data for the identifica

This project is still in experimental phase. Please refer to the [roadmap section](roadmap.md) for more information.

For information on how to contribute to the project see the [contributing section](./contributing/_contributing.md).
For all development information, including running the code, troubleshooting, or contributing, see the [development section](./development/).
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ site_name: Open Targets Genetics
nav:
- installation.md
- usage.md
- ... | contributing/**
- ... | development/**
- ... | python_api/**

plugins:
Expand Down
2 changes: 0 additions & 2 deletions src/airflow/.env

This file was deleted.

32 changes: 5 additions & 27 deletions src/airflow/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,29 +1,7 @@
FROM apache/airflow:2.7.1-python3.8
FROM apache/airflow:2.7.1-python3.10

# Install additional Python requirements.
# --no-cache-dir is a good practice when installing packages using pip, because it helps to keep the image lightweight.
COPY requirements.txt /requirements.txt
RUN pip install --user --upgrade pip
RUN pip install --no-cache-dir --user -r /requirements.txt # --no-cache-dir good practise when installing packages using pip. It helps to keep the image lightweight


# Source: https://airflow.apache.org/docs/docker-stack/recipes.html
# Installing the GCP CLI in the container
SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

USER 0
ARG CLOUD_SDK_VERSION=322.0.0
ENV GCLOUD_HOME=/home/google-cloud-sdk

ENV PATH="${GCLOUD_HOME}/bin/:${PATH}"

RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \
&& TMP_DIR="$(mktemp -d)" \
&& curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \
&& mkdir -p "${GCLOUD_HOME}" \
&& tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \
&& "${GCLOUD_HOME}/install.sh" \
--bash-completion=false \
--path-update=false \
--usage-reporting=false \
--quiet \
&& rm -rf "${TMP_DIR}" \
&& gcloud --version
RUN pip install --no-cache-dir --user --quiet --upgrade pip
RUN pip install --no-cache-dir --user --quiet --requirement /requirements.txt
Loading

0 comments on commit 99a662a

Please sign in to comment.