Skip to content

Latest commit

 

History

History
129 lines (89 loc) · 5.21 KB

README.md

File metadata and controls

129 lines (89 loc) · 5.21 KB

kube-airflow

Docker Hub Docker Pulls Docker Stars

kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster. This is useful when you'd want:

  • Easy high availability of the Airflow scheduler
  • Easy parallelism of task executions
    • The common way to scale out workers in Airflow is to utilize Celery. However, managing a H/A backend database and Celery workers just for parallelising task executions sounds like a hassle. This is where Kubernetes comes into play, again. If you already had a K8S cluster, just let K8S manage them for you.
    • If you have ever considered to avoid Celery for task parallelism, yes, K8S can still help you for a while. Just keep using LocalExecutor instead of CeleryExecutor and delegate actual tasks to Kubernetes by calling e.g. kubectl run --restart=Never ... from your tasks. It will work until the concurrent kubectl run executions(up to the concurrency implied by scheduler's max_threads and LocalExecutor's parallelism. See this SO question for gotchas) consumes all the resources a single airflow-scheduler pod provides, which will be after the pretty long time.

This repository contains:

  • Dockerfile(.template) of airflow for Docker images published to the public Docker Hub Registry.
  • airflow.all.yaml for creating Kubernetes services and deployments to run Airflow on Kubernetes

Informations

Create cluster

gcloud container clusters create airflow-cluster --enable-autorepair --machine-type=n1-standard-2 --num-nodes=1

create google iam service account with the following roles

bigquery data owner
bigquery job user
storage object admin

download the json file and save in #{AIRFLOW_JSON_PATH}

create persistent disk named postgres-data with 10gb

The persistent disk will be used by postgresql database.

Build

git clone this repository and then just run:

    export PROJECT_ID=xxxxx
    # Set the version of airflow dags to replace KUBE_AIRFLOW_VERSION in Makefile
    cd ../ && export VERSION="$(TZ=Asia/Tokyo date +%Y%m%dt%H%M%S)-$(git rev-parse --short HEAD)" && echo $VERSION && cd -
    GCP_JSON_PATH=#{AIRFLOW_JSON_PATH} make apply
    
    or 
    GCP_JSON_PATH=#{AIRFLOW_JSON_PATH} make publish
    make rolling-update

apply task depends on publish task which depend on build task

Publish to GCP

Create all the deployments and services for Airflow:

    make publish

Usage

Create all the deployments and services to run Airflow on Kubernetse: vim airflow.all.yaml make create # first deployment make deploy #update

   make list-services
   make list-pods
   pod_name="web-2874099158-lxgm2" make pod-login

It will create deployments for:

  • postgres
  • rabbitmq
  • airflow-webserver
  • airflow-scheduler
  • airflow-flower
  • airflow-worker

and services for:

  • postgres
  • rabbitmq
  • airflow-webserver
  • airflow-flower

Login into pod.

pod_name="scheduler-1413753147-fd3q7" make login-pod

You can browse the Airflow dashboard via running:

make browse-web

the Flower dashboard via running:

make browse-flower

If you want to use Ad hoc query, make sure you've configured connections: Go to Admin -> Connections and Edit "mysql_default" set this values (equivalent to values in config/airflow.cfg) :

  • Host : mysql
  • Schema : airflow
  • Login : airflow
  • Password : airflow

Check Airflow Documentation

Run the test "tutorial"

    kubectl exec web-<id> --namespace airflow-dev airflow backfill tutorial -s 2015-05-01 -e 2015-06-01

Scale the number of workers

For now, update the value for the replicas field of the deployment you want to scale and then:

    make apply

connect to the cluster

gcloud container clusters get-credentials airflow-cluster --zone us-central1-a --project #{GCP_PROJECT_ID}

kubectl proxy