Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make storage async #120

Closed
wants to merge 16 commits into from
Closed

Conversation

dan-da
Copy link
Collaborator

@dan-da dan-da commented Mar 20, 2024

This is a first cut at making the storage layer async in neptune-core.

I am making it a draft PR because I have a few remaining todos to polish things up:

  • add sync and storage [1] benches from twenty-first, modified as needed (easy)
  • investigate removal of impl Mmr for ArchivalMmr. done, see writeup.
  • investigate remaining runtime error(s). (do they also occur on master?)
  • rename crate::locks::sync to crate::locks::std. (easy)
  • rebase and squash commits, and/or improve commit messages.

stretch goal:

[1] storage benches put aside for now because they would require async support in divan which appears to be in progress, or else a re-write without divan.

Present status:

  • all tests and doctests pass.
  • fixed a runtime error: "In order to be transferred, a Block must have a non-None proof field." This seems to have been introduced elsewhere, not due to async mods.
  • One node (of 3) hits this runtime error:
thread 'main' panicked at /home/danda/dev/neptune/neptune-core/src/models/state/wallet/wallet_state.rs:559:9:
assertion `left == right` failed: Mutator set in wallet-handler must agree with that from applied block
  • there is a warning:
"2024-03-20T02:35:03.002579345Z  WARN ThreadId(03) neptune_core::mine_loop: Received block is timestamped in the future; mining on future-timestamped block."

dan-da added 16 commits March 17, 2024 10:16
This is an unmodified copy of twenty_first::storage at twenty-first
revision 890c451e4e513018d8500bedcd5bf76dd0bafdd9 (master)

It is not yet incorporated into the build.
fixes error:
  In order to be transferred, a Block must have a non-None proof field.

ProofType enum enables specifying/transferring an unimplemented Proof.
This is only temporary.

BlockType enum enables specifying Genesis vs Standard block.

A Standard block has a ProofType
The Genesis block has no ProofType
Under crate::locks we previously had:
 - sync
 - tokio

Each contains an impl of AtomicRw and AtomicMutex, with basically the
same API.  Yet:

 - sync refers to synchronous locks, ie sync vs async.
 - tokio refers to the tokio::sync lock implementation.

So the names are referring to different things.  Instead we change it to:

 - std
 - tokio

Now each refers to a lock implementation, ie std::sync and tokio::sync.

note: we could instead have changed `tokio` to `async`, but then there
might be multiple async lock impls to choose from.  So it seems cleanest
to use the name of each impl.
Moved this benchmark over from twenty_first.

Presently unable to build the twenty_first db_* bench tests because
the storage layer is now async and the divan bench crate doesn't yet
support async.  However, it may soon, see:

nvzqz/divan#39
@dan-da
Copy link
Collaborator Author

dan-da commented Mar 20, 2024

Regarding ArchivalMmr and the Mmr trait

Decoupling ArchivalMmr from the (non-async) Mmr trait seems complex. That trait is used in a number of places, such as:

  MutatorSetKernel<MMR: Mmr<Hash>>
  MsMembershipProof::batch_update_from_addition()
  RemovalRecord::batch_update_from_addition()
  mutator_set::insert_mock_item()
  mutator_set::remove_mock_item()

MutatorSetKernel is the most entangled. It is defined as:

    pub struct MutatorSetKernel<MMR: Mmr<Hash>> {
        pub aocl: MMR,
        pub swbf_inactive: MMR,
        pub swbf_active: ActiveWindow,
    }

It is instantiated in other locations as both ArchivalMmr (async) and MmrAccumulator (sync, from twenty_first). It has several complex methods, and is non-trivial.

Presently, I have ArchivalMmr implementing Mmr trait by spawning a new OS thread for each trait-method call which creates a new tokio runtime, which calls the ArchivalMmr async method that actually implements the functionality (and interacts with storage layer). Example:

    fn append(&mut self, new_leaf: Digest) -> MmrMembershipProof<H> {
        std::thread::scope(|s| {
            s.spawn(|| {
                let runtime = tokio::runtime::Runtime::new().unwrap();
                runtime.block_on(self.append_async(new_leaf))
            })
            .join()
            .unwrap()
        })
    }

This works, but seems quite inefficient. Unfortunately though the only way I know of to call an async method from a sync method already executing in an async runtime is to spawn another thread with a new runtime.

Worse though than inefficient, is that this actually makes the async calling fn block until the thread is finished. So for ArchivalMmr, this is no better (actually worse) than before this PR with regards to blocking on storage calls. Since ArchivalMmr is used heavily by MutatorSetKernel, this seems a big problem.

However, It may be adequate for merging this PR, and then I can optimize/improve in a follow-up.

Optimizing: Getting rid of the thread::spawn()

Brainstorming approaches.

  1. An improvement, though not a solution, would be to create a long-lived thread for each ArchivalMmr instance with a 2-way message channel. So instead of creating a thread per method call, we would pass tasks and results back and forth. This eliminates the cost of creating an OS thread plus tokio runtime for each method call. Unfortunately however it still means that ArchivalMmr methods block on storage, and our goal is to avoid that.

  2. An (obvious?) approach is to make a new MmrAsync trait and a new MmrAccumulatorAsync that also uses it. That should work and might be the best solution, but it forces code to become async that isn't presently, and doesn't really need to be. It just becomes forced into it because of implementing or using the Mmr trait. This also gets messy because implementors of various traits from tasm-lib are involved.


  1. Do not use any Mmr trait and keep MmrAccumulator sync. This would require that MutatorSetKernel can somehow accomodate both sync and async. It could mean adding a MutatorSetKernelAsync that duplicates a lot of logic. Or it could mean that we define it something like:
    pub enum MmrType {
        ArchivalMmr,
        MmrAccumulator,
    }

    pub struct MutatorSetKernel {
        pub aocl: MmrType,
        pub swbf_inactive: MmrType,
        pub swbf_active: ActiveWindow,
    }

The difficulty here is that all MutatorSetKernel methods must use a ton of match statements like:

   match self.aocl {
       MmrType::ArchivalMmr(m) => m.count_leaves().await,
       MmrType::MmrAccumulator(m) => m.count_leaves(),
   }

This is unwieldy. (though probably could be made less ugly with a macro)


Both approaches 2 and 3 require MutatorSetKernel to become async, which means that anything using it must also be async. Approach 3 allows MmrAccumulator to remain sync, is the primary advantage, at cost of making MutatorSetKernel more complex.


  1. A final approach I considered would impl ArchivalMmr with a Vec instead of a StorageVec. So it would not be database backed, but would instead load all data from DB at creation, store pending writes, and persist back on request. This should allow ArchivalMmr trait methods to remain sync. Async methods would exist for loading and persisting data.

The drawback is that all data must be loaded up-front and remain in mem. I'm unsure of how much data we are talking about, but my guess is that it may be huge eventually, and not possible to load/store all of it in RAM.

Specifically, ArchivalMmr is used for MutatorSetKernel::aocl and MutatorSetKernel::swbf_inactive fields.


All this considered, approach 2 seems to me the most straight-forward and the one I intend to pursue.

@dan-da
Copy link
Collaborator Author

dan-da commented Mar 22, 2024

Ok, I've made branch make_storage_and_mmr_trait_async with approach 2 (async Mmr trait) that builds cleanly, though not tests yet.

ArchivalMmr no longer needs to spawn a thread plus runtime for each trait method invocation. So that's a win.

I do have to spawn a thread in one other place though: RemovalRecordsIntegrity::rust_shadow(). This is in the impl CompiledProgram for RemovalRecordsIntegrity trait impl, and the trait is defined in tasm-lib and has default method impls, so that fn cannot be made async easily.

In order to get the program to build, I had to comment out several impls of tasm-lib traits. These seem mainly related to testing, and I think they really belong within test module anyway. I'm not certain they can be made to build again unless the tasm traits are made async as well.

my next task will be to get the tests building again.

@dan-da
Copy link
Collaborator Author

dan-da commented Mar 23, 2024

tests are building cleanly and passing in branch make_storage_and_mmr_trait_async.

except for some proptests I had to comment-out for now because proptest crate/macro is not async compatible. I see somebody created a proptest_async crate recently (yesterday?) but it only works with async_std, not tokio yet. The author states it would be easy to add tokio support.

The thread spawn in RemovalRecordsIntegrity::rust_shadow() remains. I don't know how often that fn will be called, or what impact it might have on performance, if any. For now I don't have a better solution, but it may be possible to refactor it out for those who understand tasm-lib well.

Anyway, I am happier with the code in make_storage_and_mmr_trait_async, so after a little more cleanup/review I will open a new PR for that branch that will obsolete this PR.

@aszepieniec
Copy link
Contributor

This error

thread 'main' panicked at /home/danda/dev/neptune/neptune-core/src/models/state/wallet/wallet_state.rs:559:9:
assertion `left == right` failed: Mutator set in wallet-handler must agree with that from applied block

indicates that somewhere the mutator set removal records or membership proofs are being updated incorrectly. The following propositions are probably true (quantified over unknown details):

  • The error was introduced as a side-effect of simplifying the Block struct, specifically the Body part. Previously it held two mutator set accumulators, one before and one after applying the additions and removals given by the transaction. This is redundant, but also makes keeping track of things easier.
  • The error is caused by incorrectly updating removal records or membership proofs with the update contained in a block. The new block structure makes this update step rather more complex.
  • It is possible to reduce this error to a failing unit test, by which I mean a test function that does not require running multiple interacting nodes or even threads.

This was referenced Mar 26, 2024
@dan-da
Copy link
Collaborator Author

dan-da commented Apr 2, 2024

closing in favor of #124.

@dan-da dan-da closed this Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants