Stack overflow/segfault in RocksDB stress test #675

jsantell · 2023-10-11T20:19:08Z

Running into a stack overflow/segfault when running RocksDB in multiplayer::orb_can_render_peers_in_the_sphere_address_book, discovered in #655.

To summarize, introducing RocksDB causes a segfault in one of our stress tests, consistently reproducible locally and in CI, depending on things that we wouldn't expect to cause this, like the shape of some structs. In #655, we store a Storage instance in SphereDb. This change alone causes the segfault:

--- a/rust/noosphere-storage/src/db.rs
+++ b/rust/noosphere-storage/src/db.rs
@@ -40,6 +40,7 @@ where
     link_store: S::KeyValueStore,
     version_store: S::KeyValueStore,
     metadata_store: S::KeyValueStore,
+    storage: S,
 }
 
 impl<S> SphereDb<S>
@@ -52,6 +53,7 @@ where
             link_store: storage.get_key_value_store(LINK_STORE).await?,
             version_store: storage.get_key_value_store(VERSION_STORE).await?,
             metadata_store: storage.get_key_value_store(METADATA_STORE).await?,
+            storage: storage.to_owned(),
         })
     }

In #655, changing RocksDbStore's name property to be an Arc "fixes" the segfault:

--- a/rust/noosphere-storage/src/implementation/rocks_db.rs
+++ b/rust/noosphere-storage/src/implementation/rocks_db.rs
@@ -85,13 +85,13 @@ impl Storage for RocksDbStorage {
 
 #[derive(Clone)]
 pub struct RocksDbStore {
-    name: String,
+    name: Arc<String>,
     db: Arc<DbInner>,
 }
 
 impl RocksDbStore {
     pub fn new(db: Arc<DbInner>, name: String) -> Result<Self> {
-        Ok(RocksDbStore { db, name })
+        Ok(RocksDbStore { db, name: Arc::new(name) })
     }

While Arc is more appropriate here anyway, it shouldn't have an effect on this segfault. Using an even more appropriate Cow instead still fails. That is to say, there is some spooky issue regardless of #655 and using RocksDB.

Things we've tried:

Are we overflowing the stack?
- No, running with unlimited stack size (ulimit -s unlimited) had no effect. Only using 2MB of 8MB stack space according to measuring address pointers in gdb.
Does the multi-threaded impl of RocksDB work?
- No, using our single or multi threaded implementations have seemingly no effect.
While RocksDB is thread-safe, there are some issues with Sync and cf handles. Would wrapping the DB in a mutex fix it?
- No, wrapping the DB in a mutex has no effect.
Using latest rocksdb from source, building from source
- No effect.
Any hints from valgrind?
- No invalid reads or writes, though some suspicious "definitely lost" blocks only in the fail scenario, which could be attributed to no cleanups occurring
- We allocate 750MB of memory in this single test flow, could be relevant
Any RocksDB flags we can set that would avoid this, or give us more info?
- Looking through logs, setting paranoid checks, no new useful information
What about nightly?
- buzzer sounds
Any insight from running sanitizers?
- RUSTFLAGS="-Z sanitizer=memory" cargo +nightly test --target x86_64-unknown-linux-gnu --features rocksdb,test-kubo orb_can_render
- 👍 Address sanitizer: passes, what??
  - Failed once due to broken CAR stream
  - Unexpected warning: WARN flush_to_writer{writer=SphereWriter { kind: Root { force_full_render: false }, paths: SpherePaths { root: "/tmp/.tmpRjestr" }, base: Once(Uninit), mount: Once(Uninit), private: Once(Uninit) }}: Content write failed: No such file or directory (os error 2)
- Thread sanitizer: Fails earlier in the test flow, though test runner has issues with it ThreadSanitizer detects a data race in the test runner (rustc --test) rust-lang/rust#39608
- Memory sanitizer: does not work with test runner MemorySanitizer detects an use of unitialized value in the test runner (rustc --test) rust-lang/rust#39610
- Leak sanitizer: segfaults much earlier in the test flow: Thread 4 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault. when setting a key and serializing a multihash (libp2p/ns)

Stack trace of the offending stack:

(cpp) rocksdb::TableCache::Get
..
BlockStore::get_block
..
OnceCell::get_or_try_init/closure
Sphere::to_memo
OnceCell::get_or_try_init/closure
Sphere::to_body
Sphere:get_content
Sphere::derive_mutation
Sphere::hydrate_with_cid
Sphere::hydrate
Sphere::hydrate_timeslice
Sphere::rebase
sync::fetch_remote_changes

The text was updated successfully, but these errors were encountered:

… stack overflow (#675).

jsantell · 2023-10-19T16:48:25Z

Also reproducible on arm64 Macos

jsantell added the Bug Incorrect or unexpected behavior label Oct 11, 2023

jsantell mentioned this issue Oct 11, 2023

feat: Introduce EphemeralStorage #655

Open

jsantell added a commit that referenced this issue Oct 19, 2023

chore: Disable failing rocksdb integration test that has a consistent…

83e8cad

… stack overflow (#675).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow/segfault in RocksDB stress test #675

Stack overflow/segfault in RocksDB stress test #675

jsantell commented Oct 11, 2023

jsantell commented Oct 19, 2023

Stack overflow/segfault in RocksDB stress test #675

Stack overflow/segfault in RocksDB stress test #675

Comments

jsantell commented Oct 11, 2023

jsantell commented Oct 19, 2023