From 1d42476f9f3e04e20c47e11224fd76cdb338881c Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 11 Sep 2024 14:55:14 +0200 Subject: [PATCH] Add descriptions --- .../pages/getting_started/first_steps.adoc | 1 + .../airflow/pages/getting_started/index.adoc | 4 +++- .../pages/getting_started/installation.adoc | 1 + docs/modules/airflow/pages/index.adoc | 4 ++-- .../pages/required-external-components.adoc | 4 +++- .../applying-custom-resources.adoc | 24 +++++++++++++------ .../airflow/pages/usage-guide/index.adoc | 3 +++ .../pages/usage-guide/listenerclass.adoc | 7 ++++-- .../airflow/pages/usage-guide/logging.adoc | 1 + .../airflow/pages/usage-guide/monitoring.adoc | 5 ++-- .../pages/usage-guide/mounting-dags.adoc | 20 ++++++++++++---- .../airflow/pages/usage-guide/overrides.adoc | 6 ++--- .../airflow/pages/usage-guide/security.adoc | 7 ++++-- .../pages/usage-guide/storage-resources.adoc | 1 + .../using-kubernetes-executors.adoc | 1 + 15 files changed, 64 insertions(+), 25 deletions(-) diff --git a/docs/modules/airflow/pages/getting_started/first_steps.adoc b/docs/modules/airflow/pages/getting_started/first_steps.adoc index 4633cac1..fd37d1f7 100644 --- a/docs/modules/airflow/pages/getting_started/first_steps.adoc +++ b/docs/modules/airflow/pages/getting_started/first_steps.adoc @@ -1,4 +1,5 @@ = First steps +:description: Set up an Apache Airflow cluster using Stackable Operator, PostgreSQL, and Redis. Run and monitor example workflows (DAGs) via the web UI or command line. Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the Operator and its dependencies, you will now deploy a Airflow cluster and its dependencies. Afterwards you can <<_verify_that_it_works, verify that it works>> by running and tracking an example DAG. diff --git a/docs/modules/airflow/pages/getting_started/index.adoc b/docs/modules/airflow/pages/getting_started/index.adoc index f1eb6250..2da2ed71 100644 --- a/docs/modules/airflow/pages/getting_started/index.adoc +++ b/docs/modules/airflow/pages/getting_started/index.adoc @@ -1,6 +1,8 @@ = Getting started +:description: Get started with the Stackable Operator for Apache Airflow by installing the operator, SQL database, and Redis, then setting up and running your first DAG. -This guide will get you started with Airflow using the Stackable Operator. It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs). +This guide will get you started with Airflow using the Stackable Operator. +It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs). == Prerequisites for this guide diff --git a/docs/modules/airflow/pages/getting_started/installation.adoc b/docs/modules/airflow/pages/getting_started/installation.adoc index 235eb6c6..7ab34e31 100644 --- a/docs/modules/airflow/pages/getting_started/installation.adoc +++ b/docs/modules/airflow/pages/getting_started/installation.adoc @@ -1,4 +1,5 @@ = Installation +:description: Install the Stackable operator for Apache Airflow with PostgreSQL, Redis, and required components using Helm or stackablectl. On this page you will install the Stackable Airflow Operator, the software that Airflow depends on - Postgresql and Redis - as well as the commons, secret and listener operator which are required by all Stackable Operators. diff --git a/docs/modules/airflow/pages/index.adoc b/docs/modules/airflow/pages/index.adoc index d33d3f0d..064bae38 100644 --- a/docs/modules/airflow/pages/index.adoc +++ b/docs/modules/airflow/pages/index.adoc @@ -1,6 +1,6 @@ = Stackable Operator for Apache Airflow -:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions. -:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL +:description: The Stackable Operator for Apache Airflow manages Airflow clusters on Kubernetes, supporting custom workflows, executors, and external databases for efficient orchestration. +:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, job pipeline, scheduler, ETL :airflow: https://airflow.apache.org/ :dags: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html :k8s-crs: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ diff --git a/docs/modules/airflow/pages/required-external-components.adoc b/docs/modules/airflow/pages/required-external-components.adoc index 5cc166c9..b58d4ecf 100644 --- a/docs/modules/airflow/pages/required-external-components.adoc +++ b/docs/modules/airflow/pages/required-external-components.adoc @@ -1,6 +1,8 @@ = Required external components +:description: Airflow requires PostgreSQL, MySQL, or SQLite for database support, and Redis for Celery executors. MSSQL has experimental support. -Airflow requires an SQL database to operate. The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies: +Airflow requires an SQL database to operate. +The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies: Fully supported for production usage: diff --git a/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc b/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc index 46ef6215..4d440766 100644 --- a/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc +++ b/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc @@ -1,6 +1,10 @@ = Applying Custom Resources +:description: Learn to apply custom resources in Airflow, such as Spark jobs, using Kubernetes connections, roles, and modular DAGs with git-sync integration. -Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done. The DAG will consist of modularized python files and will be provisioned using the git-sync facility. +Airflow can be used to apply custom resources from within a cluster. +An example of this could be a SparkApplication job that is to be triggered by Airflow. +The steps below describe how this can be done. +The DAG will consist of modularized Python files and will be provisioned using the git-sync facility. == Define an in-cluster Kubernetes connection @@ -38,7 +42,9 @@ include::example$example-airflow-spark-clusterrolebinding.yaml[] == DAG code -Now for the DAG itself. The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi. The file structure fetched to the root git-sync folder looks like this: +Now for the DAG itself. +The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi. +The file structure fetched to the root git-sync folder looks like this: ---- dags @@ -57,12 +63,15 @@ The Spark job will calculate the value of pi using one of the example scripts th include::example$example-pyspark-pi.yaml[] ---- -This will be called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here].There are two classes that are used to: +This will be called from within a DAG by using the connection that was defined earlier. +It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here]. +There are two classes that are used to: -- start the job -- monitor the status of the job +* start the job +* monitor the status of the job -The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them. This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync. +The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them. +This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync. [source,python] ---- @@ -100,6 +109,7 @@ TIP: A full example of the above is used as an integration test https://github.c == Logging -As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`. If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch): +As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`. +If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch): image::airflow_dag_log_opensearch.png[Opensearch] diff --git a/docs/modules/airflow/pages/usage-guide/index.adoc b/docs/modules/airflow/pages/usage-guide/index.adoc index fbbf5141..882f4e5b 100644 --- a/docs/modules/airflow/pages/usage-guide/index.adoc +++ b/docs/modules/airflow/pages/usage-guide/index.adoc @@ -1 +1,4 @@ = Usage guide +:description: Practical instructions to make the most out of the Stackable operator for Apache Airflow. + +Practical instructions to make the most out of the Stackable operator for Apache Airflow. diff --git a/docs/modules/airflow/pages/usage-guide/listenerclass.adoc b/docs/modules/airflow/pages/usage-guide/listenerclass.adoc index af4b7538..67c9f330 100644 --- a/docs/modules/airflow/pages/usage-guide/listenerclass.adoc +++ b/docs/modules/airflow/pages/usage-guide/listenerclass.adoc @@ -1,8 +1,11 @@ = Service exposition with ListenerClasses +:description: Configure Airflow service exposure with ListenerClasses: cluster-internal, external-unstable, or external-stable. -Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role. The Operator deploys a service called `-webserver` (where `` is the name of the AirflowCluster) through which Airflow can be reached. +Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role. +The Operator deploys a service called `-webserver` (where `` is the name of the AirflowCluster) through which Airflow can be reached. -This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level. +This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. +Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level. This is how the listener class is configured: diff --git a/docs/modules/airflow/pages/usage-guide/logging.adoc b/docs/modules/airflow/pages/usage-guide/logging.adoc index 027d5dd8..3ed2f025 100644 --- a/docs/modules/airflow/pages/usage-guide/logging.adoc +++ b/docs/modules/airflow/pages/usage-guide/logging.adoc @@ -1,4 +1,5 @@ = Log aggregation +:description: Forward Airflow logs to a Vector aggregator by configuring the ConfigMap and enabling the log agent. The logs can be forwarded to a Vector log aggregator by providing a discovery ConfigMap for the aggregator and by enabling the log agent: diff --git a/docs/modules/airflow/pages/usage-guide/monitoring.adoc b/docs/modules/airflow/pages/usage-guide/monitoring.adoc index 26710b41..b41ee906 100644 --- a/docs/modules/airflow/pages/usage-guide/monitoring.adoc +++ b/docs/modules/airflow/pages/usage-guide/monitoring.adoc @@ -1,4 +1,5 @@ = Monitoring +:description: Airflow instances export Prometheus metrics for monitoring. -The managed Airflow instances are automatically configured to export Prometheus metrics. See -xref:operators:monitoring.adoc[] for more details. +The managed Airflow instances are automatically configured to export Prometheus metrics. +See xref:operators:monitoring.adoc[] for more details. diff --git a/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc b/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc index 4f082523..52ff1e9c 100644 --- a/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc +++ b/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc @@ -1,6 +1,8 @@ = Mounting DAGs +:description: Mount DAGs in Airflow via ConfigMap for single DAGs or use git-sync for multiple DAGs. git-sync pulls from a Git repo and handles updates automatically. -DAGs can be mounted by using a `ConfigMap` or `git-sync`. This is best illustrated with an example of each, shown in the sections below. +DAGs can be mounted by using a `ConfigMap` or `git-sync`. +This is best illustrated with an example of each, shown in the sections below. == via `ConfigMap` @@ -23,13 +25,18 @@ include::example$example-airflow-dags-configmap.yaml[] WARNING: If a DAG mounted via ConfigMap consists of modularized files then using the standard location is mandatory as python will use this as a "root" folder when looking for referenced files. -The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below. +The advantage of this approach is that a DAG can be provided "in-line", as it were. +This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. +For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below. == via `git-sync` === Overview -https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required. An example of this usage is given in the next section. +https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. +The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. +Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required. +An example of this usage is given in the next section. === Example @@ -51,6 +58,9 @@ include::example$example-airflow-gitsync.yaml[] <11> Git-sync settings can be provided inline, although some of these (`--dest`, `--root`) are specified internally in the operator and will be ignored if provided by the user. Git-config settings can also be specified, although a warning will be logged if `safe.directory` is specified as this is defined internally, and should not be defined by the user. -IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed. +IMPORTANT: The example above shows a _list_ of git-sync definitions, with a single element. +This is to avoid breaking-changes in future releases. +Currently, only one such git-sync definition is considered and processed. -NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files. See the xref:usage-guide/applying-custom-resources.adoc[] example for more details. +NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files. +See the xref:usage-guide/applying-custom-resources.adoc[] example for more details. diff --git a/docs/modules/airflow/pages/usage-guide/overrides.adoc b/docs/modules/airflow/pages/usage-guide/overrides.adoc index 88e14b23..26d16d74 100644 --- a/docs/modules/airflow/pages/usage-guide/overrides.adoc +++ b/docs/modules/airflow/pages/usage-guide/overrides.adoc @@ -1,10 +1,10 @@ = Configuration & Environment Overrides +:description: Airflow supports configuration and environment variable overrides per role or role group, with role group settings taking precedence. Be cautious with overrides. The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role). -IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended -that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected. +IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected. == Configuration Properties @@ -13,7 +13,7 @@ Airflow exposes an environment variable for every Airflow configuration setting, As Airflow can be configured with python code too, arbitrary code can be added to the `webserver_config.py`. You can use either `EXPERIMENTAL_FILE_HEADER` to add code to the top or `EXPERIMENTAL_FILE_FOOTER` to add to the bottom. -IMPORTANT: This is an experimental feature +IMPORTANT: This is an experimental feature. [source,yaml] ---- diff --git a/docs/modules/airflow/pages/usage-guide/security.adoc b/docs/modules/airflow/pages/usage-guide/security.adoc index 6b2d6692..9d54dc45 100644 --- a/docs/modules/airflow/pages/usage-guide/security.adoc +++ b/docs/modules/airflow/pages/usage-guide/security.adoc @@ -1,4 +1,5 @@ = Security +:description: Airflow supports authentication via Web UI or LDAP, with role-based access control managed by Flask AppBuilder, and LDAP users assigned default roles. == Authentication @@ -6,13 +7,15 @@ Every user has to authenticate themselves before using Airflow and there are sev === Webinterface -The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly: +The default setting is to view and manually set up users via the Webserver UI. +Note the blue "+" button where users can be added directly: image::airflow_security.png[Airflow Security menu] === LDAP -Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an AuthenticationClass for the LDAP server. +Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server. +This requires setting up an AuthenticationClass for the LDAP server. The AuthenticationClass is then referenced in the AirflowCluster resource as follows: [source,yaml] diff --git a/docs/modules/airflow/pages/usage-guide/storage-resources.adoc b/docs/modules/airflow/pages/usage-guide/storage-resources.adoc index 1a2a61b4..e266e0c1 100644 --- a/docs/modules/airflow/pages/usage-guide/storage-resources.adoc +++ b/docs/modules/airflow/pages/usage-guide/storage-resources.adoc @@ -1,4 +1,5 @@ = Resource Requests +:description: Find out about minimal HA Airflow requirements for CPU and memory, with defaults for schedulers, Celery executors, webservers using Kubernetes resource limits. include::home:concepts:stackable_resource_requests.adoc[] diff --git a/docs/modules/airflow/pages/usage-guide/using-kubernetes-executors.adoc b/docs/modules/airflow/pages/usage-guide/using-kubernetes-executors.adoc index a3717991..6d713e91 100644 --- a/docs/modules/airflow/pages/usage-guide/using-kubernetes-executors.adoc +++ b/docs/modules/airflow/pages/usage-guide/using-kubernetes-executors.adoc @@ -1,4 +1,5 @@ = Using Kubernetes executors +:description: Configure Kubernetes executors in Airflow to dynamically create pods for tasks, replacing Celery executors and bypassing Redis for job routing. Instead of using the Celery workers you can let Airflow run the tasks using Kubernetes executors, where pods are created dynamically as needed without jobs being routed through a redis queue to the workers.