Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

kskalski · 2024-11-19T03:45:17Z

kskalski
Nov 19, 2024

My setup is as follows:

running Google composer (composer-2.9.11-airflow-2.10.2)
the DAG starts several (e.g. 10) tasks using EksPodOperator on certain hour
composer runs in US region, while the tasks are started in AWS Asia region (thus in theory there is a bit of delay/slowness in communication)

The issue I'm observing is that workers get restarted due to liveness probe in composer set-up (maybe it is a composer-specific configuration, so far I tried scaling up the environment, but problem keeps popping up) after they are starting to execute the pod creation task.

The log line is

[2024-11-19, 01:52:42 UTC] {connection_wrapper.py:325} INFO - AWS Connection (conn_id='aws_g', conn_type='aws') credentials retrieved from login and password.
[2024-11-19, 01:52:45 UTC] {baseoperator.py:405} WARNING - EksPodOperator.execute cannot be called outside TaskInstance!
[2024-11-19, 01:52:45 UTC] {pod.py:1139} INFO - Building pod pcap-parse-ciavnb0t with labels: {'dag_id': 'pcap', 'task_id': 'parse.m0', 'run_id': 'scheduled__2024-11-18T0048000000-1fa6cc691', 'kubernetes_pod_operator': 'True', 'try_number': '4'}

after which the the worker gets killed, all existing running tasks are set to failed (they manage to re-claim the running pods if there are remaining attempts and the new attempt task get to run again).

When tasks get started in a slower fashion (e.g. one after another in delay of minutes), it seems to behave more stable, so this is likely just resource exhaustion on the worker, however I'm puzzled by how quickly it runs bad:

tried bumping available workers from 1 to 2
adding more cpu to the worker
switching composer environment to medium (in case this is related to some other ops being done e.g. on database)

It looks like starting something like >3 EksPodOperator tasks at the same moment on the worker makes those tasks stuck / extremely slow and taking the whole worker out.

I'm looking into suggestions if:

there is a way to limit concurrent starting of tasks (it seems that after the tasks get started successfully, it behaves mostly stable), as I would like to have the limit of concurrently running tasks high
this is in fact just CPU limit issue (at the time this happens the worker is clearly getting higher CPU usage, but still below like 60% of limit) and I should keep adding more cpu to worker(s)
EksPodOperator is doing something wrong / deadlock (?) due to several of them starting on the same worker
there are some other configuration knobs I could try

kskalski · 2024-11-23T08:51:20Z

kskalski
Nov 23, 2024
Author

When I enabled debug logging in airflow I can see the last operation is loading kube_config

[2024-11-23, 08:35:27 UTC] {pod.py:1139} INFO - Building pod pcap-o0nfng46 with labels: {'dag_id': 'pcap', 'task_id': 'parse0', 'run_id': 'scheduled__2024-11-22T0048000000-040080e83', 'kubernetes_pod_operator': 'True', 'try_number': '6'}
[2024-11-23, 08:35:27 UTC] {kubernetes.py:241} DEBUG - loading kube_config from: /tmp/tmpwy0ztbk0
[2024-11-23, 08:36:05 UTC] {retries.py:95} DEBUG - Running Job._fetch_from_db with retries. Try 1 of 3
[2024-11-23, 08:36:08 UTC] {retries.py:95} DEBUG - Running Job._update_heartbeat with retries. Try 1 of 3
[2024-11-23, 08:36:08 UTC] {job.py:234} DEBUG - [heartbeat]
[2024-11-23, 08:36:48 UTC] {retries.py:95} DEBUG - Running Job._fetch_from_db with retries. Try 1 of 3
[2024-11-23, 08:36:48 UTC] {retries.py:95} DEBUG - Running Job._update_heartbeat with retries. Try 1 of 3
[2024-11-23, 08:36:50 UTC] {job.py:234} DEBUG - [heartbeat]

on a successful run this is normally followed by response from EKS api:

[2024-11-23, 08:34:09 UTC] {kubernetes.py:241} DEBUG - loading kube_config from: /tmp/tmpirxnul08
[2024-11-23, 08:34:20 UTC] {rest.py:235} DEBUG - response body: {"kind":"PodList","apiVersion":

and this one shows that EKS response times are very long, so I suppose this could just be EKS's fault of long / never responding.

However the impact on airflow tasks / workers is terrible, after a few minutes the whole worker seems to be failing liveness probe and gets killed.
A separate issue is whether EksPodOperator should set some timeout for its requests (and retry them after they get exceeded)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

kskalski Nov 19, 2024

Replies: 1 comment

kskalski Nov 23, 2024 Author

kskalski
Nov 19, 2024

kskalski
Nov 23, 2024
Author