Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: gke then eks #67

Merged
merged 1 commit into from
Feb 19, 2024
Merged

testing: gke then eks #67

merged 1 commit into from
Feb 19, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Feb 19, 2024

I am making small changes as I test on GKE and EKS. My first tests on GKE had me creating / deleting jobs, and I think the state of fluence (fluxion) got out of sync with the jobs, meaning that fluxion thought jobs were running that were not and then was unable to allocate new ones. To adjust for that we can add back in the cancel response, but this will only work given that fluence has not lost memory of the job id. We likely need an approach that can either save the jobids to the state data (that could be reloaded) or a way to inspect jobs explicitly and purge, OR (better) a way to look up a job not based on the id, but based on the group id (the command in the jobspec). That way, regardless of a jobid, we could lose all of our state and still find the old (stale) job to delete. With a fresh state and larger cluster I am able to run jobs on GKE, but they are enormously slow - lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence - GKE networking sucks. To keep debugging I likely need to move over to AWS with EFA, of course that introduces more things to figure out like EFA, etc.

I am making small changes as I test on GKE and EKS. My first tests on GKE had
me creating / deleting jobs, and I think the state of fluence (fluxion) got out
of sync with the jobs, meaning that fluxion thought jobs were running that were not
and then was unable to allocate new ones. To adjust for that we can add back in the
cancel response, but this will only work given that fluence has not lost memory
of the job id. We likely need an approach that can either save the jobids to the
state data (that could be reloaded) or a way to inspect jobs explicitly and purge,
OR (better) a way to look up a job not based on the id, but based on the group id
(the command in the jobspec). That way, regardless of a jobid, we could lose all
of our state and still find the old (stale) job to delete. With a fresh state
and larger cluster I am able to run jobs on GKE, but they are enormously slow -
lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence -
GKE networking sucks. To keep debugging I likely need to move over to AWS with
EFA, of course that introduces more things to figure out like EFA, etc.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Member Author

vsoch commented Feb 19, 2024

Still having some trouble - I think now because the networking in GKE is abysmal, and there is still an issue of state in our operator. But I was able to get a few runs in and at least get a rough comparison.

I think likely next we want to get this running on EKS (so the network isn't an issue) and think harder about the state (jobid mapping, primarily) problem.
lammps-total-times_lammps-total-times

I also think there is an issue with fluence not properly seeing the resources being used by other things (not installed with it, which is most of the stuff on the node, which happens at smaller node sizes, hence why I increased size for this test) and likely we need an ability to get a more full picture of what is running on the cluster (and update it, that might also help with our current "getting stale" problem).

@vsoch vsoch merged commit 71156c5 into fluence-controller Feb 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant