Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: redesign MemFS strict mode #3888

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

RaduBerinde
Copy link
Member

Currently we use MemFS in "strict" mode to test crash recovery in
the following way:

  • at the desired crash point we call SetIgnoreSyncs(true)
  • after we close the database, we call ResetToSyncedState() and
    SetIgnoreSyncs(false) and proceed using the same filesystem.

This model is a bit fragile in the sense that both the previous
operation that we're simulating a crash of and the new operation use
the same filesystem. For example, a background operation that is
finishing up some cleanup could in principle interfere with the new
process.

We switch to a "crash clone" model, where we instead extract a
crash-consistent copy of the filesystem; further testing can proceed
on this independent copy. This allows for more usage patterns - e.g.
we can take multiple crash clones at various points and check them all
afterwards.

We also add functionality to randomly retain part of the unsynced
data (which is closer to what would happen in a real crash).

@RaduBerinde RaduBerinde requested a review from jbowens August 26, 2024 13:54
@RaduBerinde RaduBerinde requested a review from a team as a code owner August 26, 2024 13:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @RaduBerinde)


error_test.go line 436 at r1 (raw file):

		inj := errorfs.InjectorFunc(func(op errorfs.Op) error {
			if op.Kind.ReadOrWrite() == errorfs.OpIsWrite && crashIndex.Add(-1) == -1 {
				// Allow an arbitrary subset of non-0synced state to survive beyond the

nit: "non-synced"

Currently we use `MemFS` in "strict" mode to test crash recovery in
the following way:
 - at the desired crash point we call `SetIgnoreSyncs(true)`
 - after we close the database, we call `ResetToSyncedState()` and
   `SetIgnoreSyncs(false)` and proceed using the same filesystem.

This model is a bit fragile in the sense that both the previous
operation that we're simulating a crash of and the new operation use
the same filesystem. For example, a background operation that is
finishing up some cleanup could in principle interfere with the new
process.

We switch to a "crash clone" model, where we instead extract a
crash-consistent copy of the filesystem; further testing can proceed
on this independent copy. This allows for more usage patterns - e.g.
we can take multiple crash clones at various points and check them all
afterwards.

We also add functionality to randomly retain part of the unsynced
data (which is closer to what would happen in a real crash).
@RaduBerinde
Copy link
Member Author

TFTR!

@RaduBerinde RaduBerinde merged commit a70d5b3 into cockroachdb:master Aug 26, 2024
11 checks passed
@RaduBerinde RaduBerinde deleted the memfs-clone branch August 26, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants