diff --git a/docs/contributing/_contributing.md b/docs/contributing/_contributing.md deleted file mode 100644 index 188cafbe5..000000000 --- a/docs/contributing/_contributing.md +++ /dev/null @@ -1,3 +0,0 @@ -# Contributing - -TBC diff --git a/docs/development/_development.md b/docs/development/_development.md new file mode 100644 index 000000000..be1e3426c --- /dev/null +++ b/docs/development/_development.md @@ -0,0 +1 @@ +This section contains various technical information on how to develop and run the code. diff --git a/docs/development/airflow.md b/docs/development/airflow.md new file mode 100644 index 000000000..efd276613 --- /dev/null +++ b/docs/development/airflow.md @@ -0,0 +1,98 @@ +# Running Airflow workflows + +Airflow code is located in `src/airflow`. Make sure to execute all of the instructions from that directory, unless stated otherwise. + +## Set up Docker + +We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu): + +```bash +sudo apt install docker-compose +``` + +Next, verify that you can run Docker. This should say "Hello from Docker": + +```bash +docker run hello-world +``` + +If the command above raises a permission error, fix it and reboot: + +```bash +sudo usermod -a -G docker $USER +newgrp docker +``` + +## Set up Airflow + +This section is adapted from instructions from https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html. When you run the commands, make sure your current working directory is `src/airflow`. + +```bash +# Download the latest docker-compose.yaml file. +curl -sLfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml + +# Make expected directories. +mkdir -p ./config ./dags ./logs ./plugins + +# Construct the modified Docker image with additional PIP dependencies. +docker build . --tag opentargets-airflow:2.7.1 + +# Set environment variables. +cat << EOF > .env +AIRFLOW_UID=$(id -u) +AIRFLOW_IMAGE_NAME=opentargets-airflow:2.7.1 +EOF +``` + +Now modify `docker-compose.yaml` and add the following to the x-airflow-common → environment section: +``` +GOOGLE_APPLICATION_CREDENTIALS: '/opt/airflow/config/application_default_credentials.json' +AIRFLOW__CELERY__WORKER_CONCURRENCY: 32 +AIRFLOW__CORE__PARALLELISM: 32 +AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 32 +AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 16 +AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1 +``` + +## Start Airflow + +```bash +docker-compose up +``` + +Airflow UI will now be available at http://localhost:8080/home. Default username and password are both `airflow`. + +## Configure Google Cloud access + +In order to be able to access Google Cloud and do work with Dataproc, Airflow will need to be configured. First, obtain Google default application credentials by running this command and following the instructions: + +```bash +gcloud auth application-default login +``` + +Next, copy the file into the `config/` subdirectory which we created above: + +```bash +cp ~/.config/gcloud/application_default_credentials.json config/ +``` + +Now open the Airflow UI and: + +* Navigate to Admin → Connections. +* Click on "Add new record". +* Set "Connection type" to `Google Cloud``. +* Set "Connection ID" to `google_cloud_default`. +* Set "Credential Configuration File" to `/opt/airflow/config/application_default_credentials.json`. +* Click on "Save". + +## Run a workflow + +Workflows, which must be placed under the `dags/` directory, will appear in the "DAGs" section of the UI, which is also the main page. They can be triggered manually by opening a workflow and clicking on the "Play" button in the upper right corner. + +In order to restart a failed task, click on it and then click on "Clear task". + +## Troubleshooting + +Note that when you a a new workflow under `dags/`, Airflow will not pick that up immediately. By default the filesystem is only scanned for new DAGs every 300s. However, once the DAG is added, updates are applied nearly instantaneously. + +Also, if you edit the DAG while an instance of it is running, it might cause problems with the run, as Airflow will try to update the tasks and their properties in DAG according to the file changes. diff --git a/docs/contributing/guidelines.md b/docs/development/contributing.md similarity index 96% rename from docs/contributing/guidelines.md rename to docs/development/contributing.md index 85a61de57..bdd0e3f54 100644 --- a/docs/contributing/guidelines.md +++ b/docs/development/contributing.md @@ -1,5 +1,5 @@ --- -title: Guidelines +title: Contributing guidelines --- # Contributing guidelines @@ -51,7 +51,7 @@ When making changes, and especially when implementing a new module or feature, i - [ ] Run `make check`. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run `make test`. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. -- [ ] Update the documentation and check it with `run build-documentation`. This will start a local server to browse it (URL will be printed, usually `http://127.0.0.1:8000/`) +- [ ] Update the documentation and check it with `make build-documentation`. This will start a local server to browse it (URL will be printed, usually `http://127.0.0.1:8000/`) For more details on each of these steps, see the sections below. ### Documentation diff --git a/docs/contributing/troubleshooting.md b/docs/development/troubleshooting.md similarity index 100% rename from docs/contributing/troubleshooting.md rename to docs/development/troubleshooting.md diff --git a/docs/index.md b/docs/index.md index 75fa8fd5b..8de285d2f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -30,4 +30,4 @@ Ingestion and analysis of genetic and functional genomic data for the identifica This project is still in experimental phase. Please refer to the [roadmap section](roadmap.md) for more information. -For information on how to contribute to the project see the [contributing section](./contributing/_contributing.md). +For all development information, including running the code, troubleshooting, or contributing, see the [development section](./development/). diff --git a/mkdocs.yml b/mkdocs.yml index 3d043e45f..28ecc992e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -3,7 +3,7 @@ site_name: Open Targets Genetics nav: - installation.md - usage.md - - ... | contributing/** + - ... | development/** - ... | python_api/** plugins: diff --git a/src/airflow/.env b/src/airflow/.env deleted file mode 100644 index 2b6ade116..000000000 --- a/src/airflow/.env +++ /dev/null @@ -1,2 +0,0 @@ -AIRFLOW_UID=1126896676 -AIRFLOW_IMAGE_NAME=extending_airflow:latest diff --git a/src/airflow/Dockerfile b/src/airflow/Dockerfile index 2962e4a98..aaed92a5b 100644 --- a/src/airflow/Dockerfile +++ b/src/airflow/Dockerfile @@ -1,29 +1,7 @@ -FROM apache/airflow:2.7.1-python3.8 +FROM apache/airflow:2.7.1-python3.10 +# Install additional Python requirements. +# --no-cache-dir is a good practice when installing packages using pip, because it helps to keep the image lightweight. COPY requirements.txt /requirements.txt -RUN pip install --user --upgrade pip -RUN pip install --no-cache-dir --user -r /requirements.txt # --no-cache-dir good practise when installing packages using pip. It helps to keep the image lightweight - - -# Source: https://airflow.apache.org/docs/docker-stack/recipes.html -# Installing the GCP CLI in the container -SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] - -USER 0 -ARG CLOUD_SDK_VERSION=322.0.0 -ENV GCLOUD_HOME=/home/google-cloud-sdk - -ENV PATH="${GCLOUD_HOME}/bin/:${PATH}" - -RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \ - && TMP_DIR="$(mktemp -d)" \ - && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \ - && mkdir -p "${GCLOUD_HOME}" \ - && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \ - && "${GCLOUD_HOME}/install.sh" \ - --bash-completion=false \ - --path-update=false \ - --usage-reporting=false \ - --quiet \ - && rm -rf "${TMP_DIR}" \ - && gcloud --version +RUN pip install --no-cache-dir --user --quiet --upgrade pip +RUN pip install --no-cache-dir --user --quiet --requirement /requirements.txt diff --git a/src/airflow/docker-compose.yaml b/src/airflow/docker-compose.yaml deleted file mode 100644 index 32c50f23d..000000000 --- a/src/airflow/docker-compose.yaml +++ /dev/null @@ -1,251 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -# - -# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. -# -# WARNING: This configuration is for local development. Do not use it in a production deployment. -# -# This configuration supports basic configuration using environment variables or an .env file -# The following variables are supported: -# -# AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. -# Default: apache/airflow:2.7.1 -# AIRFLOW_UID - User ID in Airflow containers -# Default: 50000 -# AIRFLOW_PROJ_DIR - Base path to which all the files will be volumed. -# Default: . -# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode -# -# _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). -# Default: airflow -# _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). -# Default: airflow -# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. -# Use this option ONLY for quick checks. Installing requirements at container -# startup is done EVERY TIME the service is started. -# A better way is to build a custom image or extend the official image -# as described in https://airflow.apache.org/docs/docker-stack/build.html. -# Default: '' -# -# Feel free to modify this file to suit your needs. ---- -version: '3.8' -x-airflow-common: - &airflow-common - # In order to add custom dependencies or upgrade provider packages you can use your extended image. - # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml - # and uncomment the "build" line below, Then run `docker-compose build` to build the images. - image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.7.1} - # build: . - environment: - &airflow-common-env - AIRFLOW__CORE__EXECUTOR: LocalExecutor - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow - # For backward compatibility, with Airflow <2.3 - AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow - AIRFLOW__CORE__FERNET_KEY: '' - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' - AIRFLOW__CORE__LOAD_EXAMPLES: 'false' - AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session' - # yamllint disable rule:line-length - # Use simple http server on scheduler for health checks - # See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server - # yamllint enable rule:line-length - AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true' - # WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks - # for other purpose (development, test and especially production usage) build/extend Airflow image. - _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} - # GCLOUD Authentication - GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json - AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' - # Nice to have, Not necessary - GCP_PROJECT_ID: 'open-targets-genetics-dev' - GCP_GCS_BUCKET: 'gs://genetics_etl_python_playground/' - - volumes: - - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags - - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs - - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config - - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins - # GCLOUD Authentication - - ~/.google/credentials/:/.google/credentials:ro - user: "${AIRFLOW_UID:-50000}:0" - depends_on: - &airflow-common-depends-on - redis: - condition: service_healthy - postgres: - condition: service_healthy - - - -services: - postgres: - image: postgres:13 - environment: - POSTGRES_USER: airflow - POSTGRES_PASSWORD: airflow - POSTGRES_DB: airflow - volumes: - - postgres-db-volume:/var/lib/postgresql/data - healthcheck: - test: ["CMD", "pg_isready", "-U", "airflow"] - interval: 10s - retries: 5 - start_period: 5s - restart: always - - redis: - image: redis:latest - expose: - - 6379 - healthcheck: - test: ["CMD", "redis-cli", "ping"] - interval: 10s - timeout: 30s - retries: 50 - start_period: 30s - restart: always - - airflow-scheduler: - <<: *airflow-common - command: scheduler - healthcheck: - test: ["CMD", "curl", "--fail", "http://localhost:8974/health"] - interval: 30s - timeout: 10s - retries: 5 - start_period: 30s - restart: always - depends_on: - <<: *airflow-common-depends-on - airflow-init: - condition: service_completed_successfully - - airflow-webserver: - <<: *airflow-common - command: webserver - ports: - - "8080:8080" - healthcheck: - test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] - interval: 30s - timeout: 10s - retries: 5 - start_period: 30s - restart: always - depends_on: - <<: *airflow-common-depends-on - airflow-init: - condition: service_completed_successfully - - airflow-triggerer: - <<: *airflow-common - command: triggerer - healthcheck: - test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"'] - interval: 30s - timeout: 10s - retries: 5 - start_period: 30s - restart: always - depends_on: - <<: *airflow-common-depends-on - airflow-init: - condition: service_completed_successfully - - airflow-init: - <<: *airflow-common - entrypoint: /bin/bash - # yamllint disable rule:line-length - command: - - -c - - | - function ver() { - printf "%04d%04d%04d%04d" $${1//./ } - } - airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version) - airflow_version_comparable=$$(ver $${airflow_version}) - min_airflow_version=2.2.0 - min_airflow_version_comparable=$$(ver $${min_airflow_version}) - if (( airflow_version_comparable < min_airflow_version_comparable )); then - echo - echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m" - echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" - echo - exit 1 - fi - if [[ -z "${AIRFLOW_UID}" ]]; then - echo - echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" - echo "If you are on Linux, you SHOULD follow the instructions below to set " - echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." - echo "For other operating systems you can get rid of the warning with manually created .env file:" - echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user" - echo - fi - one_meg=1048576 - mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) - cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) - disk_available=$$(df / | tail -1 | awk '{print $$4}') - warning_resources="false" - if (( mem_available < 4000 )) ; then - echo - echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" - echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" - echo - warning_resources="true" - fi - if (( cpus_available < 2 )); then - echo - echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" - echo "At least 2 CPUs recommended. You have $${cpus_available}" - echo - warning_resources="true" - fi - if (( disk_available < one_meg * 10 )); then - echo - echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" - echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" - echo - warning_resources="true" - fi - if [[ $${warning_resources} == "true" ]]; then - echo - echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" - echo "Please follow the instructions to increase amount of resources available:" - echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin" - echo - fi - mkdir -p /sources/logs /sources/dags /sources/plugins - chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} - exec /entrypoint airflow version - # yamllint enable rule:line-length - environment: - <<: *airflow-common-env - _AIRFLOW_DB_MIGRATE: 'true' - _AIRFLOW_WWW_USER_CREATE: 'true' - _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} - _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} - _PIP_ADDITIONAL_REQUIREMENTS: '' - user: "0:0" - volumes: - - ${AIRFLOW_PROJ_DIR:-.}:/sources - -volumes: - postgres-db-volume: