vfs: redesign MemFS strict mode #3888

RaduBerinde · 2024-08-26T13:54:11Z

Currently we use MemFS in "strict" mode to test crash recovery in
the following way:

at the desired crash point we call SetIgnoreSyncs(true)
after we close the database, we call ResetToSyncedState() and
SetIgnoreSyncs(false) and proceed using the same filesystem.

This model is a bit fragile in the sense that both the previous
operation that we're simulating a crash of and the new operation use
the same filesystem. For example, a background operation that is
finishing up some cleanup could in principle interfere with the new
process.

We switch to a "crash clone" model, where we instead extract a
crash-consistent copy of the filesystem; further testing can proceed
on this independent copy. This allows for more usage patterns - e.g.
we can take multiple crash clones at various points and check them all
afterwards.

We also add functionality to randomly retain part of the unsynced
data (which is closer to what would happen in a real crash).

cockroach-teamcity · 2024-08-26T13:54:17Z

This change is

jbowens

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @RaduBerinde)

error_test.go line 436 at r1 (raw file):

		inj := errorfs.InjectorFunc(func(op errorfs.Op) error {
			if op.Kind.ReadOrWrite() == errorfs.OpIsWrite && crashIndex.Add(-1) == -1 {
				// Allow an arbitrary subset of non-0synced state to survive beyond the

nit: "non-synced"

Currently we use `MemFS` in "strict" mode to test crash recovery in the following way: - at the desired crash point we call `SetIgnoreSyncs(true)` - after we close the database, we call `ResetToSyncedState()` and `SetIgnoreSyncs(false)` and proceed using the same filesystem. This model is a bit fragile in the sense that both the previous operation that we're simulating a crash of and the new operation use the same filesystem. For example, a background operation that is finishing up some cleanup could in principle interfere with the new process. We switch to a "crash clone" model, where we instead extract a crash-consistent copy of the filesystem; further testing can proceed on this independent copy. This allows for more usage patterns - e.g. we can take multiple crash clones at various points and check them all afterwards. We also add functionality to randomly retain part of the unsynced data (which is closer to what would happen in a real crash).

RaduBerinde · 2024-08-26T21:52:24Z

TFTR!

RaduBerinde requested a review from jbowens August 26, 2024 13:54

RaduBerinde requested a review from a team as a code owner August 26, 2024 13:54

jbowens approved these changes Aug 26, 2024

View reviewed changes

RaduBerinde force-pushed the memfs-clone branch from 4977c3c to 5baaf71 Compare August 26, 2024 21:50

RaduBerinde merged commit a70d5b3 into cockroachdb:master Aug 26, 2024
11 checks passed

RaduBerinde deleted the memfs-clone branch August 26, 2024 22:07

RaduBerinde mentioned this pull request Aug 27, 2024

tests: use UnsyncedDataPercent in more cases #3891

Open

itsbilal mentioned this pull request Aug 28, 2024

github.com/cockroachdb/pebble/internal/metamorphic: TestMetaTwoInstance failed #3894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vfs: redesign MemFS strict mode #3888

vfs: redesign MemFS strict mode #3888

RaduBerinde commented Aug 26, 2024

cockroach-teamcity commented Aug 26, 2024

jbowens left a comment

RaduBerinde commented Aug 26, 2024

vfs: redesign MemFS strict mode #3888

vfs: redesign MemFS strict mode #3888

Conversation

RaduBerinde commented Aug 26, 2024

cockroach-teamcity commented Aug 26, 2024

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde commented Aug 26, 2024