debug testRaidRepair on rawhide #19745

mvollmer · 2023-12-13T08:57:06Z

The journal actually looks fine, it seems the resyncing succeeded and was quick: But somehow the UI doesn't react to it?

Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: device offline error, dev sda, sector 48888 op 0x0:(READ) flags 0x0 phys_seg 31 prio class 2
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal systemd-homed[752]: block device /sys/devices/pseudo_0/adapter0/host0/target0:0:0/0:0:0:0/block/sda has been removed.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49136, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49144, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49152, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49160, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49168, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49176, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49184, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49192, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49200, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49208, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md/raid:mdX: Disk failure on dm-5, disabling device.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md/raid:mdX: Operation continuing on 2 devices.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: mdX: resync interrupted.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal audit[57278]: USER_END pid=57278 uid=0 auid=0 ses=206 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login id=0 exe="/usr/sbin/sshd" hostname=? addr=::1 terminal=ssh res=success'
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Device #2 of raid5_ls array, vgroup0-lvol0, has failed.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: waiting for resynchronization to finish before initiating repair on RAID device vgroup0-lvol0.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Couldn't find device with uuid FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: VG vgroup0 is missing PV FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2 (last written to /dev/sda).
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Couldn't find device with uuid FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: Use 'lvconvert --repair vgroup0/lvol0' to replace failed device.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: resync of RAID array mdX
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: mdX: resync done.

martinpitt · 2023-12-13T10:56:17Z

The first round failed for the wrong reason, retried. But it did save unique attachments 👍

mvollmer · 2023-12-13T11:59:49Z

Second round succeeded.

martinpitt · 2023-12-13T14:46:50Z

This time around it failed in the right way. As @mvollmer guessed, the screenshot doesn't actually show an error, but just being busy.

Let's check if increasing the timeout works, plus amplifying the test.

martinpitt · 2023-12-13T15:54:24Z

Meh, that's a testing farm networking error. Not sure when it'll time out, I'll just force-push again.

martinpitt · 2023-12-13T16:13:13Z

The previous run finished after all, and it failed the same way. It can't possibly take longer than 5 minutes to repair a 50 MB Spielzeug-RAID, so this is clearly a bug. I will try harder to reproduce it locally, now that we know that it is a race.

Explicitly check if the VG exists, and fail the cleanup if `vgremove` fails. This is more explicit than letting the next test run in an unclean VM, and also gives us more information as it doesn't hide stderr.

So that we get screenshots from all retries of a test, for example.

This reverts commit 3956fac.

martinpitt · 2023-12-13T17:03:56Z

I added D-bus debugging. I want to see if the .Repair() call actually ever finishes. It does seem to finish on the kernel side, and there is some chance that we miss the finish call in the UI. It could also be somewhere between libblockdev and udisks.

The resulting test run shows that the dialog makes the D-Bus call:

debug: dbus: {"call":["/org/freedesktop/UDisks2/lvm/vgroup0/lvol0","org.freedesktop.UDisks2.LogicalVolume","Repair",[["/org/freedesktop/UDisks2/block_devices/loop12"],{}]],"id":"79"}

But indeed there never is a response for call 79. It creates a job:

> debug: dbus: {
  "notify": {
    "/org/freedesktop/UDisks2/jobs/65": {
      "org.freedesktop.UDisks2.Job": {
        "Operation": "lvm-lvol-repair",
        "Progress": 0.0,
        "ProgressValid": false,
        "Bytes": 0,
        "Rate": 0,
        "StartTime": 1702488375182588,
        "ExpectedEndTime": 0,
        "Objects": [
          "/org/freedesktop/UDisks2/lvm/vgroup0/lvol0"
        ],
        "StartedByUID": 0,
        "Cancelable": true
      }
    }
  }
}

but this also doesn't get any updates. There are no signals or other D-Bus messages from UDisks after that point.

There has been no recent udisks rawhide update, nor libblockdev, nor lvm2.

I suppose the next step would be to try and reproduce that interactively on a reserved tmt instance and strace what udisks/libblockdev do -- I figure they issue some kind of vgrepair command, would be interesting to see if it finishes and what it does. I wonder if libblockdev has a way to increase verbosity, so that it logs what it does to the journal?

mvollmer · 2023-12-14T07:50:53Z

The journal actually looks fine, it seems the resyncing succeeded and was quick: But somehow the UI doesn't react to it?

In a successful run, we see the following at the end:

Dec 14 07:44:16 fedora-39-127-0-0-2-2201 kernel: md: mdX: recovery interrupted.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 systemd-homed[741]: block device /sys/devices/virtual/block/dm-9 has been removed.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 systemd-homed[741]: block device /sys/devices/virtual/block/dm-9 has been removed.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 kernel: md: recovery of RAID array mdX
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 kernel: md: mdX: recovery done.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 dmeventd[2080]: Monitoring RAID device vgroup0-lvol0 for events.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 dmeventd[2080]: raid5_ls array, vgroup0-lvol0, is now in-sync.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2230]: PV /dev/loop11 online, VG vgroup0 incomplete (need 1).
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2229]: PV /dev/loop12 online, VG vgroup0 incomplete (need 1).
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2228]: PV /dev/loop10 online, VG vgroup0 incomplete (need 1).

There is some action from dmeventd and lvm after the recovery. This is missing in a failed run.

Thus, I think there is something broken below libblockdev.

mvollmer · 2023-12-14T09:06:22Z

I wonder if libblockdev has a way to increase verbosity, so that it logs what it does to the journal?

There is a way via the C API, but udisks2 would need to call it, which it doesn't.

I have wrapped /usr/sbin/lvm to log which commands are run and how they finish.

mvollmer · 2023-12-14T13:01:03Z

I have wrapped /usr/sbin/lvm to log which commands are run and how they finish.

Alright, lvconvert --repair never finishes: https://artifacts.dev.testing-farm.io/849850a9-17c7-46ff-9b06-d5578e8c327f/work-optionalqoo1feu1/plans/all/optional/execute/data/guest/default-0/test/browser/optional-1/data/TestStorageLvm2-testRaidRepair6-fedora-40-localhost-22-FAIL.log.gz

I try to get a strace or debugging spew out of it.

mvollmer · 2023-12-14T13:28:42Z

Ok, from local testing it looks like lvconvert waits for dmeventd to start monitoring after the repair, and we know already that it doesn't... let's see what the logs say for a real failure.

martinpitt · 2023-12-14T14:25:02Z

It passed like any good Heisenbug ought to.. Re-running.

martinpitt · 2023-12-14T15:51:30Z

Now it failed properly again. I don't immediately see an strace in the journal but there's tons of debug info. Let's look at this tomorrow with @mvollmer

mvollmer · 2023-12-15T07:18:54Z

The interesting part of the log:

14:59:58 root[58360]: 14:59:58.453434 lvconvert[58359] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0_rmeta_0 (253:0).
14:59:58 root[58360]: 14:59:58.453440 lvconvert[58359] device_mapper/libdm-common.c:2445  Udev cookie 0xd4ddbb6 (semid 32814) incremented to 9
14:59:58 root[58360]: 14:59:58.453444 lvconvert[58359] device_mapper/libdm-common.c:2678  Udev cookie 0xd4ddbb6 (semid 32814) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK         (0x2e)
14:59:58 root[58360]: 14:59:58.453449 lvconvert[58359] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:0) [ noopencount flush ]   [2048] (*1)
14:59:58 root[58360]: 14:59:58.453468 lvconvert[58359] device_mapper/libdm-common.c:1485  vgroup0-lvol0_rmeta_0: Stacking NODE_ADD (253,0) 0:6 0660 [trust_udev]
14:59:58 root[58360]: 14:59:58.453472 lvconvert[58359] device_mapper/libdm-common.c:1496  vgroup0-lvol0_rmeta_0: Stacking NODE_READ_AHEAD 256 (flags=1)
14:59:58 root[58360]: 14:59:58.453476 lvconvert[58359] device_mapper/libdm-common.c:290  Suspended device counter reduced to 1
14:59:58 root[58360]: 14:59:58.453480 lvconvert[58359] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0 (253:6).
14:59:58 root[58360]: 14:59:58.453486 lvconvert[58359] device_mapper/libdm-common.c:2445  Udev cookie 0xd4ddbb6 (semid 32814) incremented to 10
14:59:58 root[58360]: 14:59:58.453490 lvconvert[58359] device_mapper/libdm-common.c:2678  Udev cookie 0xd4ddbb6 (semid 32814) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)
14:59:58 root[58360]: 14:59:58.453495 lvconvert[58359] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:6) [ noopencount flush ]   [2048] (*1)
14:59:58 kernel: md: mdX: recovery interrupted.
14:59:58 kernel: md: recovery of RAID array mdX
14:59:58 kernel: md: mdX: recovery done.
15:00:19 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
15:00:19 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
15:00:19 audit: BPF prog-id=2437 op=UNLOAD

lvconvert is "dm resume"ing all device mapper things, but the last one gets stuck. Presumably, lvconvert does not receive some expected udev event with the expected cookie. Interestingly, mdraid is done recovering at the same time.

In a successful run, mdraid is done recovering much earlier, and I guess in a realistic scenario mdraid would be done recovering much much later and not interfere with lvconvert at all.

So we might run into this because our test disk is too small or too fast. IF this has to do with when mdraid is done recovering, which is just a guess.

mvollmer · 2023-12-15T07:29:50Z

Presumably, lvconvert does not receive some expected udev event with the expected cookie.

Hmm, no it's not waiting for anything according to a strace of a good run. Let's see...

This reverts commit 4704127.

mvollmer · 2023-12-15T11:41:17Z

Ok, lvconvert just hangs, very likely in a device mapper ioctl. (strace does not log the syscall since it hasn't completed)

10:09:11 root[69065]: 69066 write(2, "10:09:11.423052 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  ", 73) = 73
10:09:11 root[69065]: 69066 write(2, "dm resume   (253:0) [ noopencount flush ]   [2048] (*1)", 55) = 55
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 ioctl(5, DM_DEV_SUSPEND, [{version=[4, 0, 0], data_size=2048, dev=makedev(0xfd, 0), event_nr=7217192, flags=DM_EXISTS_FLAG|DM_PERSISTENT_DEV_FLAG|DM_SKIP_BDGET_FLAG}] => [{version=[4, 48, 0], data_size=305, dev=makedev(0xfd, 0), name="vgroup0-lvol0_rmeta_0", uuid="LVM-e0qDDDOTLXbVtdzupmEwiMfvO4g9qTdKsdu1sObueWEORF37KBKIVfIgEynIQ48w", target_count=1, open_count=1, event_nr=0, flags=DM_EXISTS_FLAG|DM_PERSISTENT_DEV_FLAG|DM_ACTIVE_PRESENT_FLAG|DM_SKIP_BDGET_FLAG|DM_UEVENT_GENERATED_FLAG}]) = 0
10:09:11 root[69065]: 69066 write(2, "10:09:11.423649 lvconvert[69066] device_mapper/libdm-common.c:1485  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "vgroup0-lvol0_rmeta_0: Stacking NODE_ADD (253,0) 0:6 0660 [trust_udev]", 70) = 70
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.423815 lvconvert[69066] device_mapper/libdm-common.c:1496  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "vgroup0-lvol0_rmeta_0: Stacking NODE_READ_AHEAD 256 (flags=1)", 61) = 61
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.424105 lvconvert[69066] device_mapper/libdm-common.c:290  ", 67) = 67
10:09:11 root[69065]: 69066 write(2, "Suspended device counter reduced to 1", 37) = 37
10:09:11 root[62486]: KERNEL[937.341720] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[57164]: KERNEL[937.342072] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[65163]: KERNEL[937.343218] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[59827]: KERNEL[937.341867] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[69062]: 10:09:11.424105 lvconvert[69066] device_mapper/libdm-common.c:290  Suspended device counter reduced to 1
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.426340 lvconvert[69066] device_mapper/libdm-deptree.c:1335  ", 69) = 69
10:09:11 root[69065]: 69066 write(2, "Resuming vgroup0-lvol0 (253:6).", 31) = 31
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69062]: 10:09:11.426340 lvconvert[69066] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0 (253:6).
10:09:11 root[69065]: 69066 semget(0xd4d2028, 1, 000)         = 65543
10:09:11 root[69065]: 69066 semtimedop(65543, [{sem_num=0, sem_op=1, sem_flg=0}], 1, NULL) = 0
10:09:11 root[69065]: 69066 semctl(65543, 0, GETVAL, NULL)    = 10
10:09:11 root[69065]: 69066 write(2, "10:09:11.426719 lvconvert[69066] device_mapper/libdm-common.c:2445  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "Udev cookie 0xd4d2028 (semid 65543) incremented to 10", 53) = 53
10:09:11 root[69062]: 10:09:11.426719 lvconvert[69066] device_mapper/libdm-common.c:2445  Udev cookie 0xd4d2028 (semid 65543) incremented to 10
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.427185 lvconvert[69066] device_mapper/libdm-common.c:2678  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "Udev cookie 0xd4d2028 (semid 65543) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)", 113) = 113
10:09:11 root[69062]: 10:09:11.427185 lvconvert[69066] device_mapper/libdm-common.c:2678  Udev cookie 0xd4d2028 (semid 65543) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.430153 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  ", 73) = 73
10:09:11 root[69065]: 69066 write(2, "dm resume   (253:6) [ noopencount flush ]   [2048] (*1)", 55) = 55
10:09:11 root[69062]: 10:09:11.430153 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:6) [ noopencount flush ]   [2048] (*1)
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[67876]: UDEV  [937.365534] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.365677] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[59827]: KERNEL[937.366479] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[59827]: UDEV  [937.366533] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[62486]: UDEV  [937.365996] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: KERNEL[937.368672] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: KERNEL[937.368118] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: UDEV  [937.368177] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[67876]: KERNEL[937.368571] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[62486]: KERNEL[937.368622] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[67876]: UDEV  [937.395297] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.395798] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[65163]: UDEV  [937.395430] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[57164]: UDEV  [937.395980] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[62486]: UDEV  [937.399084] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.412151] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[57164]: UDEV  [937.412237] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[62486]: UDEV  [937.415253] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[67876]: UDEV  [937.413388] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[65163]: UDEV  [937.418307] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[67876]: UDEV  [937.425707] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[65163]: UDEV  [937.428412] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[62486]: UDEV  [937.428491] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[57164]: UDEV  [937.428814] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[59827]: UDEV  [937.428878] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[67876]: UDEV  [937.446285] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.446999] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.453140] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[59827]: UDEV  [937.447582] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[57164]: UDEV  [937.452743] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[57164]: UDEV  [937.453353] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[62486]: UDEV  [937.452947] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[67876]: UDEV  [937.453794] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[59827]: UDEV  [937.453638] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[62486]: UDEV  [937.453722] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[65163]: UDEV  [937.461439] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[62486]: UDEV  [937.459791] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[67876]: UDEV  [937.459533] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[59827]: UDEV  [937.463049] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[57164]: UDEV  [937.461563] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[67876]: UDEV  [937.470090] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[62486]: UDEV  [937.470199] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.470388] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[65163]: UDEV  [937.471202] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[57164]: UDEV  [937.472681] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[67876]: UDEV  [937.539775] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[65163]: UDEV  [937.540380] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[59827]: UDEV  [937.540323] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[62486]: UDEV  [937.540115] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: UDEV  [937.541119] change   /devices/virtual/block/dm-6 (block)
10:09:11 kernel: md: mdX: recovery done.
10:09:32 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
10:09:32 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

This log shows that lvconvert does a ioctl after logging the "dm resume" line. This ioctl is missing from the logs after the "dm resume" for 253:6, probably because it hangs in the kernel.

mvollmer · 2023-12-15T19:54:17Z

And here is the kernel stack trace of the hanging ioctl:

[<0>] stop_sync_thread+0x12a/0x1d0
[<0>] __md_stop_writes+0x15/0xe0
[<0>] md_stop+0x12/0x20
[<0>] raid_dtr+0x27/0x40 [dm_raid]
[<0>] dm_table_destroy+0x5f/0x140
[<0>] dev_suspend+0x109/0x320
[<0>] ctl_ioctl+0x30c/0x5e0
[<0>] dm_ctl_ioctl+0xe/0x20
[<0>] __x64_sys_ioctl+0x97/0xd0
[<0>] do_syscall_64+0x64/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

martinpitt · 2023-12-19T10:59:27Z

FTR, I cherry-picked the first two commits into #19761, let's land them -- they are good.

mvollmer · 2024-01-02T08:51:26Z

Concluded in #19798.

mvollmer added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Dec 13, 2023

mvollmer force-pushed the test-unique-attachements branch from cb96338 to 547e364 Compare December 13, 2023 09:17

mvollmer force-pushed the test-unique-attachements branch from 547e364 to 1b85880 Compare December 13, 2023 11:59

martinpitt force-pushed the test-unique-attachements branch from aa0a213 to df92c68 Compare December 13, 2023 15:55

martinpitt and others added 4 commits December 13, 2023 17:42

test: Factorize and robustify VG cleanup

19b9bf5

Explicitly check if the VG exists, and fail the cleanup if `vgremove` fails. This is more explicit than letting the next test run in an unclean VM, and also gives us more information as it doesn't hide stderr.

test: Write attachments to unique filenames

dd120ac

So that we get screenshots from all retries of a test, for example.

DBG - Revert "fmf: Skip TestStorageLvm2.testRaidRepair on Rawhide"

e4276b7

This reverts commit 3956fac.

Amplify testRaidRepair

22b78ea

martinpitt force-pushed the test-unique-attachements branch from df92c68 to 22b78ea Compare December 13, 2023 16:42

martinpitt changed the title ~~test: Write attachements to unique filenames~~ test: Write attchements to unique filenames; debug testRaidRepair on rawhide Dec 13, 2023

martinpitt changed the title ~~test: Write attchements to unique filenames; debug testRaidRepair on rawhide~~ test: Write attachments to unique filenames; debug testRaidRepair on rawhide Dec 13, 2023

get d-dbus debugging

4704127

DBG - log lvm invocations

773d6fe

mvollmer force-pushed the test-unique-attachements branch from 605c96d to 773d6fe Compare December 14, 2023 09:45

DBG - add cleanup

07f1ca8

DBG - verbose repair

c848bf2

martinpitt force-pushed the test-unique-attachements branch from 653d80c to c848bf2 Compare December 14, 2023 14:27

mvollmer added 3 commits December 15, 2023 09:33

DBG - strace and udev monitor

0a33b19

Reverting "get d-dbus debugging"

9f6cbd5

This reverts commit 4704127.

DBG - add strace

9676dfe

DBG - use bigger disks so that mdraid recovery takes longer

aec5969

mvollmer force-pushed the test-unique-attachements branch from 1d9fb8c to 78bf91f Compare December 15, 2023 13:36

DBG - kernel stacks

e222bb6

mvollmer force-pushed the test-unique-attachements branch from 78bf91f to e222bb6 Compare December 15, 2023 14:55

DBG - bigger disks

b6f215e

martinpitt mentioned this pull request Dec 19, 2023

lib: Make date-fns/locale import compatible with 3.0; test improvements #19761

Merged

martinpitt changed the title ~~test: Write attachments to unique filenames; debug testRaidRepair on rawhide~~ debug testRaidRepair on rawhide Dec 19, 2023

mvollmer closed this Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug testRaidRepair on rawhide #19745

debug testRaidRepair on rawhide #19745

mvollmer commented Dec 13, 2023 •

edited by martinpitt

Loading

martinpitt commented Dec 13, 2023

mvollmer commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023 •

edited

Loading

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

martinpitt commented Dec 14, 2023

martinpitt commented Dec 14, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

martinpitt commented Dec 19, 2023

mvollmer commented Jan 2, 2024

debug testRaidRepair on rawhide #19745

debug testRaidRepair on rawhide #19745

Conversation

mvollmer commented Dec 13, 2023 • edited by martinpitt Loading

martinpitt commented Dec 13, 2023

mvollmer commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023

martinpitt commented Dec 13, 2023 • edited Loading

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

mvollmer commented Dec 14, 2023

martinpitt commented Dec 14, 2023

martinpitt commented Dec 14, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

mvollmer commented Dec 15, 2023

martinpitt commented Dec 19, 2023

mvollmer commented Jan 2, 2024

mvollmer commented Dec 13, 2023 •

edited by martinpitt

Loading

martinpitt commented Dec 13, 2023 •

edited

Loading