Replies: 1 comment
-
When I enabled debug logging in airflow I can see the last operation is loading
on a successful run this is normally followed by response from EKS api:
and this one shows that EKS response times are very long, so I suppose this could just be EKS's fault of long / never responding. However the impact on airflow tasks / workers is terrible, after a few minutes the whole worker seems to be failing liveness probe and gets killed. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My setup is as follows:
The issue I'm observing is that workers get restarted due to liveness probe in composer set-up (maybe it is a composer-specific configuration, so far I tried scaling up the environment, but problem keeps popping up) after they are starting to execute the pod creation task.
The log line is
after which the the worker gets killed, all existing running tasks are set to failed (they manage to re-claim the running pods if there are remaining attempts and the new attempt task get to run again).
When tasks get started in a slower fashion (e.g. one after another in delay of minutes), it seems to behave more stable, so this is likely just resource exhaustion on the worker, however I'm puzzled by how quickly it runs bad:
It looks like starting something like >3
EksPodOperator
tasks at the same moment on the worker makes those tasks stuck / extremely slow and taking the whole worker out.I'm looking into suggestions if:
EksPodOperator
is doing something wrong / deadlock (?) due to several of them starting on the same workerBeta Was this translation helpful? Give feedback.
All reactions