Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky] when Creating a multikueue admission check Should run a kubeflow XGBoostJob #2838

Open
alculquicondor opened this issue Aug 15, 2024 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@alculquicondor
Copy link
Contributor

What happened:

End To End MultiKueue Suite: kindest/node:v1.30.0: [It] MultiKueue when Creating a multikueue admission check Should run a kubeflow XGBoostJob on worker if admitted expand_less	9s
{Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:688 with:
Expected object to be comparable, diff:   &v1.ReplicaStatus{
- 	Active:        1,
+ 	Active:        0,
- 	Succeeded:     0,
+ 	Succeeded:     1,
  	Failed:        0,
  	LabelSelector: nil,
  	Selector:      "",
  }
 failed [FAILED] Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:688 with:
Expected object to be comparable, diff:   &v1.ReplicaStatus{
- 	Active:        1,
+ 	Active:        0,
- 	Succeeded:     0,
+ 	Succeeded:     1,
  	Failed:        0,
  	LabelSelector: nil,
  	Selector:      "",
  }
In [It] at: /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:703 @ 08/15/24 06:05:55.326
}

What you expected to happen:

Test to pass

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2024
@alculquicondor
Copy link
Contributor Author

/assign @mszadkow

@tenzen-y
Copy link
Member

/kind flake

The XGBoostJob has some state transition bugs. So, maybe we need to remove the test case from Kueue or fix the root bug in the training-operator.

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 15, 2024
@alculquicondor
Copy link
Contributor Author

I see, thanks for the context.

@mszadkow any chance you can take a look in the training-operator code?
In the meantime, let's disable this test by calling ginkgo.Skip() with an accompanying comment.

@mszadkow
Copy link
Contributor

@tenzen-y Can you explain more about the transition bug, is it known one?

@mszadkow
Copy link
Contributor

Yes, sure I can have a look there but like you said will skip it for now.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 16, 2024

@tenzen-y Can you explain more about the transition bug, is it known one?

Depending on historical reasons, we just used to rerun the failed flaky tests in the TrainingOperator.
So, we do not have a dedicated issue for specific transitions.

But, we explained the transition issue a little bit here: kubeflow/training-operator#1711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

4 participants