Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Packit: Run cockpit storage tests in PRs #1161

Merged
merged 1 commit into from
Sep 1, 2023

Conversation

martinpitt
Copy link
Contributor

@martinpitt martinpitt commented Aug 4, 2023

See https://cockpit-project.org/blog/tmt-cross-project-testing.html


I tested this against my fork in martinpitt#2 . That also demonstrates that this "just works" in forks without magic setup, which makes it easy to try out stuff and collaborate.

When done seriously, this is an ongoing commitment. In our recent years the Cockpit team has reported a couple dozen udisks bug reports found by our tests, some of them were actual regressions like these:

With this setup, we have a good chance of prevent landing such regressions.

This approach is fairly new at least for us in the Fedora world, so let's treat this as an experiment. We do need to talk about commitments and expectations, though. From our side, I propose:

  • We don't expect you to become experts in our tests. When they fail, it would be nice if you could have a quick look at the log and see if it's something obvious (for example, the screenshot shows an error message, or the journal shows an udisks crash). But in general, we expect that someone from our team will investigate and discuss that with you. The PR where that happened provides a nice place to collect notes.

  • For now, contacting us is manual, i.e. writing a comment like "hey @martinpitt @marusak @mvollmer please have a look at the cockpit failure here". I put that list into the test plan, I'm happy to add it anywhere else where it's easier for you to find. In the future this can hopefully be automated, see RFE: Notify on failure + tag github user(s) along with custom failure message packit/packit-service#1911

  • We don't expect you to block PRs on these tests, especially not if the testing farm has infrastructure problems . Each Monday morning they tend to run into provisioning errors. Just ignore these then -- we see them too in our projects, and will usually prod #testing-farm then.

  • We would like you to at least give us a chance on that workday to look at a failed test (not infra failure, a "real" one) before you land a PR, so that this all makes sense. If something is urgent and you quick-land, then at least we still retain the benefit of having a trace in which PR tests started to fail, but then the damage is possibly already done. Note that this isn't a big new commitment from our side: we already spend many hours every week looking at regressions, and it takes a lot more effort to track them down weeks after they happened. So this will acually reduce both our and your time spent on hunting down regressions, because the context (PR) is fresh and small, as opposed to "whatever changed in Fedora in the last 4 weeks".

  • I set this up for Fedora latest and rawhide for the time being. If you think that rawhide has too much churn, of course feel free to drop that from .packit.yaml, or I do it here right away when you want to start slow. We have enabled rawhide in cockpit projects a while ago, and it does often fail due to unrelated reasons. Improving this is pretty much exactly why I start this initiative, and at least knowing about problems is still better IMHO than entirely ignoring rawhide, but it can get annoying.

Please let me know about any other question that you may have. I'm happy to discuss them here or in a gmeet for higher bandwidth.

Thanks!

@StorageGhoul
Copy link

Can one of the admins verify this patch?

@martinpitt

This comment was marked as resolved.

@martinpitt
Copy link
Contributor Author

So f38 is fine, rawhide failed because it seems to be uninstallable in the udisks-daily COPR. That may be part of the "too much churn in rawhide" issue from the description, or spot an actual bug.

@vojtechtrefny , @tbzatek I'll leave it at that for now, and let you do a round of review and commenting, in particular of the points in the description. Thanks!

@martinpitt
Copy link
Contributor Author

@vojtechtrefny , @tbzatek : FYI, I'll be on PTO for two weeks from tomorrow on. Please ask @marusak and @jelly for questions around this. Thanks!

@martinpitt
Copy link
Contributor Author

martinpitt commented Aug 24, 2023

Interesting, I get a consistent LVM failure in F39 and rawhide, while F38 succeeds. Cockpit so far has run the udisks-daily COPR only on Fedora 38 + updates-testing, that's why we didn't see that previously.

That might just be exactly the thing that I'm looking for with these tests. I'll reproduce/investigate tomorrow, maybe we can prevent a bug from landing in the next release :-) In the meantime this is blocked, moving to draft.

This reproduces easily locally. I prepare a fedora-39 cockpit test VM, the test works. I run dnf -y copr enable @storage/udisks-daily; dnf -y --setopt=install_weak_deps=False update. That exactly installs all updates from the COPR (udisks and libblockdev), no other updates from Fedora (my VM is current).

@martinpitt martinpitt marked this pull request as draft August 24, 2023 16:34
@martinpitt
Copy link
Contributor Author

Indeed that regression is already reported as issue #1170

@martinpitt
Copy link
Contributor Author

So this is actually exactly what I'd like to achieve -- with this in place, we would have noticed the regression right in PR #1158 which introduced it.

@vojtechtrefny , @tbzatek : So I propose to block this PR until #1170 is fixed, so that you don't start with red statuses. But could we already discuss this approach in general? I have a bunch of questions in the description that we should agree on. Happy to schedule a gmeet as well!

@martinpitt
Copy link
Contributor Author

Ah, I was confused -- in fact, #1170 does affect Fedora 38 as well, but @mvollmer already added a "naughty pattern" (failure suppression) for it. I now applied that to F39/40 as well, so that they should succeed now. So this is once again unblocked.

But it nicely demonstrates that we can avoid introducing such regression in the first place with this revdeps testing 😁

@martinpitt martinpitt marked this pull request as ready for review August 25, 2023 07:07
Copy link
Member

@vojtechtrefny vojtechtrefny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you.

@martinpitt
Copy link
Contributor Author

@tbzatek I discussed this mostly with @vojtechtrefny in slack yesterday (maybe you didn't see these messages? bridge may not work?). Please let me know if you have further questions, or doubts about the expectations laid out in the description. Thanks!

@tbzatek
Copy link
Member

tbzatek commented Aug 30, 2023

So this is actually exactly what I'd like to achieve -- with this in place, we would have noticed the regression right in PR #1158 which introduced it.

Exactly, I believe this regression would have been caught right in the iscsi fixes pull request. On the other hand, the failure in the lvm2 module was not covered by udisks tests either but well, tests will never cover every line of code. At least not at the time of their creation, one is expecting some stuff would never fail...

There's another bugreport that would have been caught the same way I believe: #1172

So running Cockpit tests is always a great asset.

To be fair, I rarely look at our Jenkins nightly runs. I also tend to ignore the unstable tests in our PR runs. I do however check rawhide even though there's always some known issue, it is also the first place where regressions introduced by other packages or the kernel do show up first.

Fixing tests in UDisks is often time consuming and not always failing test means a problem in the real world use. There are number of random failures caused by timing or race conditions (often on the tests side) or simply a slow system that can't keep up with uevent processing. In the past couple of years we've invested quite a bit of time to fix tests and various corner cases while the time spent on this may have been invested somewhere else. It's always about the tradeoff of quality vs. quantity. On an understaffed project like UDisks there's nobody else that would take care of such bugfixing however.

I would say: let's try this and let's see what our habit of checking Cockpit tests will be.

  • We don't expect you to block PRs on these tests, especially not if the testing farm has infrastructure problems . Each Monday morning they tend to run into provisioning errors. Just ignore these then -- we see them too in our projects, and will usually prod #testing-farm then.

That's fine, I have absolutely no problem with ignoring unknown errors :-)

  • We would like you to at least give us a chance on that workday to look at a failed test (not infra failure, a "real" one) before you land a PR, so that this all makes sense. If something is urgent and you quick-land, then at least we still retain the benefit of having a trace in which PR tests started to fail, but then the damage is possibly already done. Note that this isn't a big new commitment from our side: we already spend many hours every week looking at regressions, and it takes a lot more effort to track them down weeks after they happened. So this will acually reduce both our and your time spent on hunting down regressions, because the context (PR) is fresh and small, as opposed to "whatever changed in Fedora in the last 4 weeks".

Urgent PRs are quite an exception in UDisks, perhaps only shortly before upcoming release. Backports in Fedora and RHEL are always manual, not even tracking upstream releases. Upstream PRs can stay open for a while.

  • I set this up for Fedora latest and rawhide for the time being. If you think that rawhide has too much churn, of course feel free to drop that from .packit.yaml, or I do it here right away when you want to start slow. We have enabled rawhide in cockpit projects a while ago, and it does often fail due to unrelated reasons. Improving this is pretty much exactly why I start this initiative, and at least knowing about problems is still better IMHO than entirely ignoring rawhide, but it can get annoying.

Rawhide is important. It lets us spot kernel changes that would soon get to RHEL. Block layer, device drivers, filesystem drives - there have been various changes that looked innocent from kernel POV but made a big difference in userspace and UDisks especially. Same thing with userspace package rebases - e.g. util-linux. Even c9s often gets ahead of rawhide and is a good candidate to the most bleeding edge distro.

@tbzatek
Copy link
Member

tbzatek commented Aug 30, 2023

Jenkins, ok to test.

@martinpitt
Copy link
Contributor Author

The integration test results look rather sad, but I suppose these are unrelated? can they be retried?

@tbzatek
Copy link
Member

tbzatek commented Aug 31, 2023

The integration test results look rather sad, but I suppose these are unrelated? can they be retried?

Welcome to our world. Actually the number of failures this time is rather low.

These might be partially fixed (or silenced or removed) once #1168 and #1162 are merged. However tests are failing randomly and as I said above, not every test failure indicates an issue in the real world. It's also incredibly time consuming chasing every single test failure like these.

@martinpitt
Copy link
Contributor Author

So is anything still blocking this PR?

@tbzatek
Copy link
Member

tbzatek commented Sep 1, 2023

Nope, let's go for it!

@tbzatek tbzatek merged commit c640ad7 into storaged-project:master Sep 1, 2023
14 checks passed
@martinpitt martinpitt deleted the run-cockpit-storage branch September 1, 2023 15:12
@martinpitt
Copy link
Contributor Author

🚀 yay! Please holler if you see any trouble with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants