Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snaps] Filesystem errors when loading a snapshot on a different machine #4036

Closed
3 tasks done
CuriousGeorgiy opened this issue Aug 15, 2023 · 17 comments
Closed
3 tasks done

Comments

@CuriousGeorgiy
Copy link

CuriousGeorgiy commented Aug 15, 2023

Describe the bug

I am hacking on firecracker-containerd to support firecracker snapshots, and I am facing the following problem.

I create a snapshot of a microVM running a nginx container, and then I try to load this snapshot on a different machine (for preserving the disk state, I simply commit a container snapshot).

When restoring the disk state, I patch (see also #4014) the snapshot's disk device path for the container snapshot device to point to a fresh containerd snapshot mount point (the previously committed image is mounted).

Snapshot loading succeeds and the container is responsive even to http requests (I am dropping the network setup details since I don't have problems with it now), but the nginx container returns internal server errors, and the following error appears in the microVM's kernel logs:

DEBU[2023-08-15T07:21:33.072516371-04:00] [   38.858355] EXT4-fs error (device vdb): ext4_find_entry:1447: inode #262: comm nginx: checksumming directory block 0  jailer=noop runtime=aws.firecracker vmID=1 vmm_stream=stdout
DEBU[2023-08-15T07:21:33.076584363-04:00] [   38.862540] EXT4-fs error (device vdb): ext4_find_entry:1447: inode #262: comm nginx: checksumming directory block 0  jailer=noop runtime=aws.firecracker vmID=1 vmm_stream=stdout

It seems like the restored disk state (via container snapshot commit) is inconsistent with the microVM's disk state.

In this regard, do you have any ideas of what the problem is (if it doesn't simply come down to state disk state inconsistency) and how it could be solved? It seems like this requires a sort of a filesystem checkpoint, but I couldn't find any relevant information on the web.

This issue is rather a question than a bug report, so I am not adding any other context (though I can if it will help clarify my use case).

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
  • Are you certain the bug being reported is a Firecracker issue?
@CuriousGeorgiy CuriousGeorgiy changed the title [Bug] Filesystem errors when loading a snapshot on a different machine [Snaps] Filesystem errors when loading a snapshot on a different machine Aug 15, 2023
@roypat
Copy link
Contributor

roypat commented Aug 16, 2023

Hi,
thanks for opening this issue! Firecracker does not support renaming resources at snapshot-restore. While we are aware of this limitation, we also have no explored yet what changes to firecracker would be required to properly support this.
Can you reproduce this issue using an unpatched firecracker binary? E.g. trying to restore a snapshot where the path to the disk's backing file on the new machine is identical to the path on the machine on which the snapshot was taken.

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 16, 2023

@roypat

Hi! Okay, I will try tinkering around with mounting to achieve this, though I don't catch how the block device path can affect functionality.

Also, some additional context just in case, I save the disk state the following way (step 4 should actually be redundant, but I added it just in case):

  1. Pause microVM
  2. Create microVM snapshot
  3. dmsetup suspend on the corresponding container mount
  4. ioctl BLKFLSBUF on the corresponding container mount
  5. dmsetup resume on the corresponding container mount
  6. an analog of docker commit for containerd (it simply creates a new container image that accounts for the RW (topmost) layer of the container)
  7. Resume microVM

I can provide a code snippet for clarity if needed.

@CuriousGeorgiy
Copy link
Author

/cc @pb8o

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 17, 2023

@roypat I got the same result using an unpatched firecracker binary.

@bchalios
Copy link
Contributor

Hey @CuriousGeorgiy some thoughts:

Is there any chance that these:

  1. dmsetup suspend on the corresponding container mount
  2. ioctl BLKFLSBUF on the corresponding container mount

modify the contents of the file backing the drive?

Also, could you try to perform a flush operation in the guest and ensure it completes before pausing the microVM and taking the snapshot?

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 17, 2023

@bchalios

No, they only flush the disk state.

As per the dmsetup suspend man page 1:

Suspends a device. Any I/O that has already been mapped
by the device but has not yet completed will be flushed.
Any further I/O to that device will be postponed for as
long as the device is suspended.

As per ioctl BLKFLSBUF, it is supposed to flush the buffer cache for the device2.

I also added an explicit fsync call on the device, but it didn't help and I still get the same errors.

Also, could you try to perform a flush operation in the guest and ensure it completes before pausing the microVM and taking the snapshot?

AFAIC, this is not enough, since disk state is apparently changed during snapshot creation: if I suspend the container snapshot device via dmsetup suspend (which blocks I/O) and take a snapshot of the microVM, the firecracker process hangs in D state.

So, what I am doing right now, is trying to flush the guest after creating a snapshot of the microVM while it is still in paused state.

Footnotes

  1. https://man7.org/linux/man-pages/man8/dmsetup.8.html

  2. https://github.com/torvalds/linux/blob/4853c74bd7ab7fdb83f319bd9ace8a08c031e9b6/tools/include/uapi/linux/fs.h#L152

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 18, 2023

@bchalios

Also, could you try to perform a flush operation in the guest and ensure it completes before pausing the microVM and taking the snapshot?

Sorry, looks like I got your question wrong. What do mean by performing a flush operation in the guest? If you mean manually flushing files used by the container, then:

  1. How is this generalizable to any container/process?
  2. I tried running a simple Golang http server (no filesystem manipulation except for possible lazy page loading of the binary), and when I restore the container, and send a request, it crashes (panics), so I think the problem is not with file flushing.

AFAIC, there is no firecracker API exposed for performing a flush operation in the guest, except for snapshot creation, which is supposed to do exactly this:

The generated snapshot files are immediately available to be used (current process releases ownership). At this point, the block devices backing files should be backed up externally by the user. Please note that block device contents are only guaranteed to be committed/flushed to the host FS, but not necessarily to the underlying persistent storage (could still live in host FS cache).1

Footnotes

  1. https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md#creating-snapshots

@bchalios
Copy link
Contributor

bchalios commented Aug 18, 2023

How is this generalizable to any container?

It is not on files, but you can do that for the drives using sync. In any case, I was just trying to root-cause. My thinking went: maybe while pausing the microVM some write operation was on-the-fly, and maybe this triggers a corner-case we don't handle, i.e. the write was interrupted and the guest does not know how to handle this.

Generally speaking, we mainly test the snapshotting feature with guests in a sort of "quiesced" state, so there might be a corner case we are missing.

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 18, 2023

@bchalios

It is not on files, but you can do that for the drives using sync.

Yeah, that's what I am already trying to achieve using dmsetup suspend, fsync and ioctl BLKFLSBUF (see comment above).

My thinking went: maybe while pausing the microVM some write operation was on-the-fly, and maybe this triggers a corner-case we don't handle, i.e. the write was interrupted and the guest does not know how to handle this.

I expect this to be handled by the snapshot creation request (and, as I noticed above, I observe some I/O on the container mount drive during snapshot creation).

Is there anything I could do to try to narrow down the problem and make a simple reproducer? As I said, I tried running a very simple Golang http server and try do some filesystem I/O, but it just crashes even without filesystem manipulation, so I'm not sure what I can try doing.

@bchalios
Copy link
Contributor

Could you try to reproduce the same scenario without firecracker-containerd?:

  1. Launch a firecracker microVM "by hand", you could follow the Getting Started Guide
  2. Launch your Go server inside the microVM.
  3. Pause & snapshot
  4. Resume from the snapshot

And see if it reproduces. After step 4. you could also try to patch the drive to another path, so that we see if that's the problem.

@bchalios
Copy link
Contributor

Regarding the last comment, you can:

4a. LoadSnapshot (without resume)
4b. Patch the drive file path on the host
4c. Resume the microVM.

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 20, 2023

@bchalios

I have tried what you suggested with a RW rootfs, and didn't encounter any problems, everything works okay.

Regarding the last comment, you can:

4a. LoadSnapshot (without resume)
4b. Patch the drive file path on the host
4c. Resume the microVM.

I haven't yet tried doing the same stuff on an attached block device like firecracker-containerd does.

In the meanwhile, I have tried patching the firecracker-go-sdk and adding patch drive requests in between loading a snapshot without resuming and resuming the microVM, and got the following errors in the logs:

DEBU[2023-08-20T15:15:59.964532664-04:00] Running handler fcinit.LoadSnapshot           runtime=aws.firecracker
INFO[2023-08-20T15:15:59.973124142-04:00] Updating drive /var/lib/firecracker-containerd/runtime/default-rootfs.img, slot root_drive, root true.  runtime=aws.firecracker
INFO[2023-08-20T15:15:59.973851533-04:00] Updated drive /var/lib/firecracker-containerd/runtime/default-rootfs.img: [PATCH /drives/{drive_id}][204] patchGuestDriveByIdNoContent   runtime=aws.firecracker
DEBU[2023-08-20T15:15:59.973889025-04:00] updateDrive returned for /var/lib/firecracker-containerd/runtime/default-rootfs.img  runtime=aws.firecracker
INFO[2023-08-20T15:15:59.973939266-04:00] Updating drive /var/lib/firecracker-containerd/shim-base/vanilla-firecracker-snapshots#0/ctrstub0, slot MN2HE43UOVRDA, root false.  runtime=aws.firecracker
INFO[2023-08-20T15:15:59.974400352-04:00] Updated drive /var/lib/firecracker-containerd/shim-base/vanilla-firecracker-snapshots#0/ctrstub0: [PATCH /drives/{drive_id}][204] patchGuestDriveByIdNoContent   runtime=aws.firecracker
DEBU[2023-08-20T15:15:59.974444103-04:00] updateDrive returned for /var/lib/firecracker-containerd/shim-base/vanilla-firecracker-snapshots#0/ctrstub0  runtime=aws.firecracker
DEBU[2023-08-20T15:15:59.975089233-04:00] VM resumed successfully                       runtime=aws.firecracker
DEBU[2023-08-20T15:15:59.975147605-04:00] snapshot loaded successfully                  runtime=aws.firecracker
INFO[2023-08-20T15:15:59.975205975-04:00] calling agent                                 runtime=aws.firecracker vmID=0
DEBU[2023-08-20T15:15:59.978783082-04:00] [    4.331801] virtio_blk virtio0: new size: 125800 512-byte logical blocks (64.4 MB/61.4 MiB)  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:15:59.981208538-04:00] [    4.334353] virtio_blk virtio1: new size: 1 512-byte logical blocks (512 B/512 B)  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:15:59.982562932-04:00] [    4.335939] vdb: detected capacity change from 10737418240 to 512  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:15:59.994733454-04:00] [    4.348439] VFS: busy inodes on changed media or resized disk vdb  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
INFO[2023-08-20T15:16:00.077125413-04:00] successfully started the VM                   runtime=aws.firecracker vmID=0
DEBU[2023-08-20T15:16:00.077233305-04:00] event published                               ns=vanilla-firecracker-snapshots runtime=aws.firecracker topic=/firecracker-vm/start type=VMStart
DEBU[2023-08-20T15:16:00.077696072-04:00] event forwarded                               ns=vanilla-firecracker-snapshots topic=/firecracker-vm/start type=VMStart
DEBU[2023-08-20T15:16:02.527490728-04:00] [    6.880446] attempt to access beyond end of device  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.529025777-04:00] [    6.882101] vdb: rw=2049, want=8650760, limit=1  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.531628879-04:00] [    6.883647] Buffer I/O error on dev vdb, logical block 1081344, lost sync page write  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.534177637-04:00] [    6.886330] JBD2: Error -5 detected when updating journal superblock for vdb-8.  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.535708470-04:00] [    6.888824] Aborting journal on device vdb-8.  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.537336420-04:00] [    6.890319] attempt to access beyond end of device  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.538357728-04:00] [    6.891948] vdb: rw=133121, want=8650760, limit=1  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.540002989-04:00] [    6.892971] Buffer I/O error on dev vdb, logical block 1081344, lost sync page write  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout
DEBU[2023-08-20T15:16:02.541503916-04:00] [    6.894613] JBD2: Error -5 detected when updating journal superblock for vdb-8.  jailer=noop runtime=aws.firecracker vmID=0 vmm_stream=stdout

Can the whole problem be related to data integrity issues with attachable drives that are described here?

@bchalios
Copy link
Contributor

It could be, yes. Are these drives mounted when you resume the microVM? I guess, so right? If that's the case, is there a way we could unmount them before snapshotting?

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 21, 2023

@bchalios

I haven't yet tried doing the same stuff on an attached block device like firecracker-containerd does.

Everything also works when mounting a regular file (formatted as a filesystem) as a drive and running the server there.

AFAIC, to a rough approximation, the only thing that firecracker-containerd does differently (and the last thing I can do to try to narrow down this issue to firecracker) is it uses devmapper for creating drives (i.e., it uses virtual block devices), I will try setting this up tomorrow.

Are these drives mounted when you resume the microVM? I guess, so right?

AFAIC, yes.

If that's the case, is there a way we could unmount them before snapshotting?

I guess that requires hacking the firecracker-containerd agent running inside firecracker, not sure I can figure it out quickly.

@CuriousGeorgiy
Copy link
Author

CuriousGeorgiy commented Aug 22, 2023

@bchalios

AFAIC, to a rough approximation, the only thing that firecracker-containerd does differently (and the last thing I can do to try to narrow down this issue to firecracker) is it uses devmapper for creating drives (i.e., it uses virtual block devices), I will try setting this up tomorrow.

I also tried using a thin pool device for the drive, like firecracker-containerd does, and everything still works fine.

I guess further investigation requires digging deep into firecracker-containerd internals, and we should move this issue there. What do you think?

@bchalios
Copy link
Contributor

@CuriousGeorgiy thanks a lot for the effort put in this. I think it makes sense, yes. It looks like it's not a Firecracker problem, so the firecracker-containerd folks might be better suited to help you with this.

Feel free to move it. You can always open a new issue here if you find something that is Firecracker related.

@bchalios
Copy link
Contributor

I 'll close this now since the issue is tracked in firecracker-microvm/firecracker-containerd#759.

Feel free to re-open if you realize the problem is with Firecracker after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants