-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debug testRaidRepair on rawhide #19745
Conversation
cb96338
to
547e364
Compare
The first round failed for the wrong reason, retried. But it did save unique attachments 👍 |
547e364
to
1b85880
Compare
Second round succeeded. |
This time around it failed in the right way. As @mvollmer guessed, the screenshot doesn't actually show an error, but just being busy. Let's check if increasing the timeout works, plus amplifying the test. |
Meh, that's a testing farm networking error. Not sure when it'll time out, I'll just force-push again. |
aa0a213
to
df92c68
Compare
The previous run finished after all, and it failed the same way. It can't possibly take longer than 5 minutes to repair a 50 MB Spielzeug-RAID, so this is clearly a bug. I will try harder to reproduce it locally, now that we know that it is a race. |
Explicitly check if the VG exists, and fail the cleanup if `vgremove` fails. This is more explicit than letting the next test run in an unclean VM, and also gives us more information as it doesn't hide stderr.
So that we get screenshots from all retries of a test, for example.
This reverts commit 3956fac.
df92c68
to
22b78ea
Compare
I added D-bus debugging. I want to see if the .Repair() call actually ever finishes. It does seem to finish on the kernel side, and there is some chance that we miss the finish call in the UI. It could also be somewhere between libblockdev and udisks. The resulting test run shows that the dialog makes the D-Bus call:
But indeed there never is a response for call 79. It creates a job:
but this also doesn't get any updates. There are no signals or other D-Bus messages from UDisks after that point. There has been no recent udisks rawhide update, nor libblockdev, nor lvm2. I suppose the next step would be to try and reproduce that interactively on a reserved tmt instance and strace what udisks/libblockdev do -- I figure they issue some kind of |
In a successful run, we see the following at the end:
There is some action from dmeventd and lvm after the recovery. This is missing in a failed run. Thus, I think there is something broken below libblockdev. |
There is a way via the C API, but udisks2 would need to call it, which it doesn't. I have wrapped /usr/sbin/lvm to log which commands are run and how they finish. |
605c96d
to
773d6fe
Compare
Alright, I try to get a strace or debugging spew out of it. |
Ok, from local testing it looks like lvconvert waits for dmeventd to start monitoring after the repair, and we know already that it doesn't... let's see what the logs say for a real failure. |
It passed like any good Heisenbug ought to.. Re-running. |
653d80c
to
c848bf2
Compare
Now it failed properly again. I don't immediately see an strace in the journal but there's tons of debug info. Let's look at this tomorrow with @mvollmer |
The interesting part of the log:
lvconvert is "dm resume"ing all device mapper things, but the last one gets stuck. Presumably, lvconvert does not receive some expected udev event with the expected cookie. Interestingly, mdraid is done recovering at the same time. In a successful run, mdraid is done recovering much earlier, and I guess in a realistic scenario mdraid would be done recovering much much later and not interfere with lvconvert at all. So we might run into this because our test disk is too small or too fast. IF this has to do with when mdraid is done recovering, which is just a guess. |
Hmm, no it's not waiting for anything according to a strace of a good run. Let's see... |
This reverts commit 4704127.
Ok, lvconvert just hangs, very likely in a device mapper ioctl. (strace does not log the syscall since it hasn't completed)
This log shows that lvconvert does a ioctl after logging the "dm resume" line. This ioctl is missing from the logs after the "dm resume" for 253:6, probably because it hangs in the kernel. |
1d9fb8c
to
78bf91f
Compare
78bf91f
to
e222bb6
Compare
And here is the kernel stack trace of the hanging ioctl:
|
FTR, I cherry-picked the first two commits into #19761, let's land them -- they are good. |
Concluded in #19798. |
The journal actually looks fine, it seems the resyncing succeeded and was quick: But somehow the UI doesn't react to it?