Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug testRaidRepair on rawhide #19745

Closed

Conversation

mvollmer
Copy link
Member

@mvollmer mvollmer commented Dec 13, 2023

The journal actually looks fine, it seems the resyncing succeeded and was quick: But somehow the UI doesn't react to it?

Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: device offline error, dev sda, sector 48888 op 0x0:(READ) flags 0x0 phys_seg 31 prio class 2
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal systemd-homed[752]: block device /sys/devices/pseudo_0/adapter0/host0/target0:0:0/0:0:0:0/block/sda has been removed.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49136, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49144, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49152, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49160, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49168, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49176, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49184, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49192, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49200, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: mdX_raid5: attempt to access beyond end of device
                                                                    sda: rw=0, sector=49208, nr_sectors = 8 limit=0
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md/raid:mdX: Disk failure on dm-5, disabling device.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md/raid:mdX: Operation continuing on 2 devices.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: mdX: resync interrupted.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal audit[57278]: USER_END pid=57278 uid=0 auid=0 ses=206 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login id=0 exe="/usr/sbin/sshd" hostname=? addr=::1 terminal=ssh res=success'
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Device #2 of raid5_ls array, vgroup0-lvol0, has failed.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: waiting for resynchronization to finish before initiating repair on RAID device vgroup0-lvol0.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Couldn't find device with uuid FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: VG vgroup0 is missing PV FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2 (last written to /dev/sda).
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: WARNING: Couldn't find device with uuid FhanLT-Y18q-aw2f-cBoF-XpX3-h0eW-UNZHO2.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal dmeventd[56501]: Use 'lvconvert --repair vgroup0/lvol0' to replace failed device.
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: resync of RAID array mdX
Dec 13 12:32:03 ip-172-31-24-161.us-east-2.compute.internal kernel: md: mdX: resync done.

@mvollmer mvollmer added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Dec 13, 2023
@martinpitt
Copy link
Member

The first round failed for the wrong reason, retried. But it did save unique attachments 👍

@mvollmer
Copy link
Member Author

Second round succeeded.

@martinpitt
Copy link
Member

This time around it failed in the right way. As @mvollmer guessed, the screenshot doesn't actually show an error, but just being busy.

Let's check if increasing the timeout works, plus amplifying the test.

@martinpitt
Copy link
Member

Meh, that's a testing farm networking error. Not sure when it'll time out, I'll just force-push again.

@martinpitt
Copy link
Member

The previous run finished after all, and it failed the same way. It can't possibly take longer than 5 minutes to repair a 50 MB Spielzeug-RAID, so this is clearly a bug. I will try harder to reproduce it locally, now that we know that it is a race.

martinpitt and others added 4 commits December 13, 2023 17:42
Explicitly check if the VG exists, and fail the cleanup if `vgremove`
fails. This is more explicit than letting the next test run in an
unclean VM, and also gives us more information as it doesn't hide
stderr.
So that we get screenshots from all retries of a test, for example.
@martinpitt martinpitt changed the title test: Write attachements to unique filenames test: Write attchements to unique filenames; debug testRaidRepair on rawhide Dec 13, 2023
@martinpitt martinpitt changed the title test: Write attchements to unique filenames; debug testRaidRepair on rawhide test: Write attachments to unique filenames; debug testRaidRepair on rawhide Dec 13, 2023
@martinpitt
Copy link
Member

martinpitt commented Dec 13, 2023

I added D-bus debugging. I want to see if the .Repair() call actually ever finishes. It does seem to finish on the kernel side, and there is some chance that we miss the finish call in the UI. It could also be somewhere between libblockdev and udisks.

The resulting test run shows that the dialog makes the D-Bus call:

debug: dbus: {"call":["/org/freedesktop/UDisks2/lvm/vgroup0/lvol0","org.freedesktop.UDisks2.LogicalVolume","Repair",[["/org/freedesktop/UDisks2/block_devices/loop12"],{}]],"id":"79"}

But indeed there never is a response for call 79. It creates a job:

> debug: dbus: {
  "notify": {
    "/org/freedesktop/UDisks2/jobs/65": {
      "org.freedesktop.UDisks2.Job": {
        "Operation": "lvm-lvol-repair",
        "Progress": 0.0,
        "ProgressValid": false,
        "Bytes": 0,
        "Rate": 0,
        "StartTime": 1702488375182588,
        "ExpectedEndTime": 0,
        "Objects": [
          "/org/freedesktop/UDisks2/lvm/vgroup0/lvol0"
        ],
        "StartedByUID": 0,
        "Cancelable": true
      }
    }
  }
}

but this also doesn't get any updates. There are no signals or other D-Bus messages from UDisks after that point.

There has been no recent udisks rawhide update, nor libblockdev, nor lvm2.

I suppose the next step would be to try and reproduce that interactively on a reserved tmt instance and strace what udisks/libblockdev do -- I figure they issue some kind of vgrepair command, would be interesting to see if it finishes and what it does. I wonder if libblockdev has a way to increase verbosity, so that it logs what it does to the journal?

@mvollmer
Copy link
Member Author

The journal actually looks fine, it seems the resyncing succeeded and was quick: But somehow the UI doesn't react to it?

In a successful run, we see the following at the end:

Dec 14 07:44:16 fedora-39-127-0-0-2-2201 kernel: md: mdX: recovery interrupted.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 systemd-homed[741]: block device /sys/devices/virtual/block/dm-9 has been removed.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 systemd-homed[741]: block device /sys/devices/virtual/block/dm-9 has been removed.
Dec 14 07:44:16 fedora-39-127-0-0-2-2201 kernel: md: recovery of RAID array mdX
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 kernel: md: mdX: recovery done.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 dmeventd[2080]: Monitoring RAID device vgroup0-lvol0 for events.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 dmeventd[2080]: raid5_ls array, vgroup0-lvol0, is now in-sync.
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2230]: PV /dev/loop11 online, VG vgroup0 incomplete (need 1).
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2229]: PV /dev/loop12 online, VG vgroup0 incomplete (need 1).
Dec 14 07:44:17 fedora-39-127-0-0-2-2201 lvm[2228]: PV /dev/loop10 online, VG vgroup0 incomplete (need 1).

There is some action from dmeventd and lvm after the recovery. This is missing in a failed run.

Thus, I think there is something broken below libblockdev.

@mvollmer
Copy link
Member Author

I wonder if libblockdev has a way to increase verbosity, so that it logs what it does to the journal?

There is a way via the C API, but udisks2 would need to call it, which it doesn't.

I have wrapped /usr/sbin/lvm to log which commands are run and how they finish.

@mvollmer
Copy link
Member Author

I have wrapped /usr/sbin/lvm to log which commands are run and how they finish.

Alright, lvconvert --repair never finishes: https://artifacts.dev.testing-farm.io/849850a9-17c7-46ff-9b06-d5578e8c327f/work-optionalqoo1feu1/plans/all/optional/execute/data/guest/default-0/test/browser/optional-1/data/TestStorageLvm2-testRaidRepair6-fedora-40-localhost-22-FAIL.log.gz

I try to get a strace or debugging spew out of it.

@mvollmer
Copy link
Member Author

Ok, from local testing it looks like lvconvert waits for dmeventd to start monitoring after the repair, and we know already that it doesn't... let's see what the logs say for a real failure.

@martinpitt
Copy link
Member

It passed like any good Heisenbug ought to.. Re-running.

@martinpitt
Copy link
Member

Now it failed properly again. I don't immediately see an strace in the journal but there's tons of debug info. Let's look at this tomorrow with @mvollmer

@mvollmer
Copy link
Member Author

The interesting part of the log:

14:59:58 root[58360]: 14:59:58.453434 lvconvert[58359] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0_rmeta_0 (253:0).
14:59:58 root[58360]: 14:59:58.453440 lvconvert[58359] device_mapper/libdm-common.c:2445  Udev cookie 0xd4ddbb6 (semid 32814) incremented to 9
14:59:58 root[58360]: 14:59:58.453444 lvconvert[58359] device_mapper/libdm-common.c:2678  Udev cookie 0xd4ddbb6 (semid 32814) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK         (0x2e)
14:59:58 root[58360]: 14:59:58.453449 lvconvert[58359] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:0) [ noopencount flush ]   [2048] (*1)
14:59:58 root[58360]: 14:59:58.453468 lvconvert[58359] device_mapper/libdm-common.c:1485  vgroup0-lvol0_rmeta_0: Stacking NODE_ADD (253,0) 0:6 0660 [trust_udev]
14:59:58 root[58360]: 14:59:58.453472 lvconvert[58359] device_mapper/libdm-common.c:1496  vgroup0-lvol0_rmeta_0: Stacking NODE_READ_AHEAD 256 (flags=1)
14:59:58 root[58360]: 14:59:58.453476 lvconvert[58359] device_mapper/libdm-common.c:290  Suspended device counter reduced to 1
14:59:58 root[58360]: 14:59:58.453480 lvconvert[58359] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0 (253:6).
14:59:58 root[58360]: 14:59:58.453486 lvconvert[58359] device_mapper/libdm-common.c:2445  Udev cookie 0xd4ddbb6 (semid 32814) incremented to 10
14:59:58 root[58360]: 14:59:58.453490 lvconvert[58359] device_mapper/libdm-common.c:2678  Udev cookie 0xd4ddbb6 (semid 32814) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)
14:59:58 root[58360]: 14:59:58.453495 lvconvert[58359] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:6) [ noopencount flush ]   [2048] (*1)
14:59:58 kernel: md: mdX: recovery interrupted.
14:59:58 kernel: md: recovery of RAID array mdX
14:59:58 kernel: md: mdX: recovery done.
15:00:19 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
15:00:19 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
15:00:19 audit: BPF prog-id=2437 op=UNLOAD

lvconvert is "dm resume"ing all device mapper things, but the last one gets stuck. Presumably, lvconvert does not receive some expected udev event with the expected cookie. Interestingly, mdraid is done recovering at the same time.

In a successful run, mdraid is done recovering much earlier, and I guess in a realistic scenario mdraid would be done recovering much much later and not interfere with lvconvert at all.

So we might run into this because our test disk is too small or too fast. IF this has to do with when mdraid is done recovering, which is just a guess.

@mvollmer
Copy link
Member Author

Presumably, lvconvert does not receive some expected udev event with the expected cookie.

Hmm, no it's not waiting for anything according to a strace of a good run. Let's see...

@mvollmer
Copy link
Member Author

Ok, lvconvert just hangs, very likely in a device mapper ioctl. (strace does not log the syscall since it hasn't completed)

10:09:11 root[69065]: 69066 write(2, "10:09:11.423052 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  ", 73) = 73
10:09:11 root[69065]: 69066 write(2, "dm resume   (253:0) [ noopencount flush ]   [2048] (*1)", 55) = 55
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 ioctl(5, DM_DEV_SUSPEND, [{version=[4, 0, 0], data_size=2048, dev=makedev(0xfd, 0), event_nr=7217192, flags=DM_EXISTS_FLAG|DM_PERSISTENT_DEV_FLAG|DM_SKIP_BDGET_FLAG}] => [{version=[4, 48, 0], data_size=305, dev=makedev(0xfd, 0), name="vgroup0-lvol0_rmeta_0", uuid="LVM-e0qDDDOTLXbVtdzupmEwiMfvO4g9qTdKsdu1sObueWEORF37KBKIVfIgEynIQ48w", target_count=1, open_count=1, event_nr=0, flags=DM_EXISTS_FLAG|DM_PERSISTENT_DEV_FLAG|DM_ACTIVE_PRESENT_FLAG|DM_SKIP_BDGET_FLAG|DM_UEVENT_GENERATED_FLAG}]) = 0
10:09:11 root[69065]: 69066 write(2, "10:09:11.423649 lvconvert[69066] device_mapper/libdm-common.c:1485  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "vgroup0-lvol0_rmeta_0: Stacking NODE_ADD (253,0) 0:6 0660 [trust_udev]", 70) = 70
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.423815 lvconvert[69066] device_mapper/libdm-common.c:1496  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "vgroup0-lvol0_rmeta_0: Stacking NODE_READ_AHEAD 256 (flags=1)", 61) = 61
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.424105 lvconvert[69066] device_mapper/libdm-common.c:290  ", 67) = 67
10:09:11 root[69065]: 69066 write(2, "Suspended device counter reduced to 1", 37) = 37
10:09:11 root[62486]: KERNEL[937.341720] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[57164]: KERNEL[937.342072] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[65163]: KERNEL[937.343218] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[59827]: KERNEL[937.341867] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[69062]: 10:09:11.424105 lvconvert[69066] device_mapper/libdm-common.c:290  Suspended device counter reduced to 1
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.426340 lvconvert[69066] device_mapper/libdm-deptree.c:1335  ", 69) = 69
10:09:11 root[69065]: 69066 write(2, "Resuming vgroup0-lvol0 (253:6).", 31) = 31
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69062]: 10:09:11.426340 lvconvert[69066] device_mapper/libdm-deptree.c:1335  Resuming vgroup0-lvol0 (253:6).
10:09:11 root[69065]: 69066 semget(0xd4d2028, 1, 000)         = 65543
10:09:11 root[69065]: 69066 semtimedop(65543, [{sem_num=0, sem_op=1, sem_flg=0}], 1, NULL) = 0
10:09:11 root[69065]: 69066 semctl(65543, 0, GETVAL, NULL)    = 10
10:09:11 root[69065]: 69066 write(2, "10:09:11.426719 lvconvert[69066] device_mapper/libdm-common.c:2445  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "Udev cookie 0xd4d2028 (semid 65543) incremented to 10", 53) = 53
10:09:11 root[69062]: 10:09:11.426719 lvconvert[69066] device_mapper/libdm-common.c:2445  Udev cookie 0xd4d2028 (semid 65543) incremented to 10
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.427185 lvconvert[69066] device_mapper/libdm-common.c:2678  ", 68) = 68
10:09:11 root[69065]: 69066 write(2, "Udev cookie 0xd4d2028 (semid 65543) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)", 113) = 113
10:09:11 root[69062]: 10:09:11.427185 lvconvert[69066] device_mapper/libdm-common.c:2678  Udev cookie 0xd4d2028 (semid 65543) assigned to RESUME task(5) with flags DISABLE_LIBRARY_FALLBACK         (0x20)
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[69065]: 69066 write(2, "10:09:11.430153 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  ", 73) = 73
10:09:11 root[69065]: 69066 write(2, "dm resume   (253:6) [ noopencount flush ]   [2048] (*1)", 55) = 55
10:09:11 root[69062]: 10:09:11.430153 lvconvert[69066] device_mapper/ioctl/libdm-iface.c:2097  dm resume   (253:6) [ noopencount flush ]   [2048] (*1)
10:09:11 root[69065]: 69066 write(2, "\n", 1)                 = 1
10:09:11 root[67876]: UDEV  [937.365534] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.365677] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[59827]: KERNEL[937.366479] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[59827]: UDEV  [937.366533] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[62486]: UDEV  [937.365996] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: KERNEL[937.368672] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: KERNEL[937.368118] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: UDEV  [937.368177] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[67876]: KERNEL[937.368571] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[62486]: KERNEL[937.368622] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[67876]: UDEV  [937.395297] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.395798] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[65163]: UDEV  [937.395430] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[57164]: UDEV  [937.395980] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[62486]: UDEV  [937.399084] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.412151] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[57164]: UDEV  [937.412237] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[62486]: UDEV  [937.415253] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[67876]: UDEV  [937.413388] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[65163]: UDEV  [937.418307] change   /devices/virtual/block/dm-1 (block)
10:09:11 root[67876]: UDEV  [937.425707] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[65163]: UDEV  [937.428412] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[62486]: UDEV  [937.428491] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[57164]: UDEV  [937.428814] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[59827]: UDEV  [937.428878] change   /devices/virtual/block/dm-2 (block)
10:09:11 root[67876]: UDEV  [937.446285] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.446999] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[65163]: UDEV  [937.453140] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[59827]: UDEV  [937.447582] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[57164]: UDEV  [937.452743] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[57164]: UDEV  [937.453353] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[62486]: UDEV  [937.452947] change   /devices/virtual/block/dm-7 (block)
10:09:11 root[67876]: UDEV  [937.453794] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[59827]: UDEV  [937.453638] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[62486]: UDEV  [937.453722] change   /devices/virtual/block/dm-3 (block)
10:09:11 root[65163]: UDEV  [937.461439] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[62486]: UDEV  [937.459791] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[67876]: UDEV  [937.459533] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[59827]: UDEV  [937.463049] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[57164]: UDEV  [937.461563] change   /devices/virtual/block/dm-0 (block)
10:09:11 root[67876]: UDEV  [937.470090] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[62486]: UDEV  [937.470199] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[59827]: UDEV  [937.470388] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[65163]: UDEV  [937.471202] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[57164]: UDEV  [937.472681] change   /devices/virtual/block/dm-8 (block)
10:09:11 root[67876]: UDEV  [937.539775] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[65163]: UDEV  [937.540380] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[59827]: UDEV  [937.540323] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[62486]: UDEV  [937.540115] change   /devices/virtual/block/dm-6 (block)
10:09:11 root[57164]: UDEV  [937.541119] change   /devices/virtual/block/dm-6 (block)
10:09:11 kernel: md: mdX: recovery done.
10:09:32 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
10:09:32 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

This log shows that lvconvert does a ioctl after logging the "dm resume" line. This ioctl is missing from the logs after the "dm resume" for 253:6, probably because it hangs in the kernel.

@mvollmer
Copy link
Member Author

And here is the kernel stack trace of the hanging ioctl:

[<0>] stop_sync_thread+0x12a/0x1d0
[<0>] __md_stop_writes+0x15/0xe0
[<0>] md_stop+0x12/0x20
[<0>] raid_dtr+0x27/0x40 [dm_raid]
[<0>] dm_table_destroy+0x5f/0x140
[<0>] dev_suspend+0x109/0x320
[<0>] ctl_ioctl+0x30c/0x5e0
[<0>] dm_ctl_ioctl+0xe/0x20
[<0>] __x64_sys_ioctl+0x97/0xd0
[<0>] do_syscall_64+0x64/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

@martinpitt
Copy link
Member

FTR, I cherry-picked the first two commits into #19761, let's land them -- they are good.

@martinpitt martinpitt changed the title test: Write attachments to unique filenames; debug testRaidRepair on rawhide debug testRaidRepair on rawhide Dec 19, 2023
@mvollmer
Copy link
Member Author

mvollmer commented Jan 2, 2024

Concluded in #19798.

@mvollmer mvollmer closed this Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-test For doc/workflow changes, or experiments which don't need a full CI run,
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants