Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow/segfault in RocksDB stress test #675

Open
jsantell opened this issue Oct 11, 2023 · 1 comment
Open

Stack overflow/segfault in RocksDB stress test #675

jsantell opened this issue Oct 11, 2023 · 1 comment
Labels
Bug Incorrect or unexpected behavior

Comments

@jsantell
Copy link
Contributor

Running into a stack overflow/segfault when running RocksDB in multiplayer::orb_can_render_peers_in_the_sphere_address_book, discovered in #655.

To summarize, introducing RocksDB causes a segfault in one of our stress tests, consistently reproducible locally and in CI, depending on things that we wouldn't expect to cause this, like the shape of some structs. In #655, we store a Storage instance in SphereDb. This change alone causes the segfault:

--- a/rust/noosphere-storage/src/db.rs
+++ b/rust/noosphere-storage/src/db.rs
@@ -40,6 +40,7 @@ where
     link_store: S::KeyValueStore,
     version_store: S::KeyValueStore,
     metadata_store: S::KeyValueStore,
+    storage: S,
 }
 
 impl<S> SphereDb<S>
@@ -52,6 +53,7 @@ where
             link_store: storage.get_key_value_store(LINK_STORE).await?,
             version_store: storage.get_key_value_store(VERSION_STORE).await?,
             metadata_store: storage.get_key_value_store(METADATA_STORE).await?,
+            storage: storage.to_owned(),
         })
     }

In #655, changing RocksDbStore's name property to be an Arc "fixes" the segfault:

--- a/rust/noosphere-storage/src/implementation/rocks_db.rs
+++ b/rust/noosphere-storage/src/implementation/rocks_db.rs
@@ -85,13 +85,13 @@ impl Storage for RocksDbStorage {
 
 #[derive(Clone)]
 pub struct RocksDbStore {
-    name: String,
+    name: Arc<String>,
     db: Arc<DbInner>,
 }
 
 impl RocksDbStore {
     pub fn new(db: Arc<DbInner>, name: String) -> Result<Self> {
-        Ok(RocksDbStore { db, name })
+        Ok(RocksDbStore { db, name: Arc::new(name) })
     }

While Arc is more appropriate here anyway, it shouldn't have an effect on this segfault. Using an even more appropriate Cow instead still fails. That is to say, there is some spooky issue regardless of #655 and using RocksDB.


Things we've tried:

  • Are we overflowing the stack?
    • No, running with unlimited stack size (ulimit -s unlimited) had no effect. Only using 2MB of 8MB stack space according to measuring address pointers in gdb.
  • Does the multi-threaded impl of RocksDB work?
    • No, using our single or multi threaded implementations have seemingly no effect.
  • While RocksDB is thread-safe, there are some issues with Sync and cf handles. Would wrapping the DB in a mutex fix it?
    • No, wrapping the DB in a mutex has no effect.
  • Using latest rocksdb from source, building from source
    • No effect.
  • Any hints from valgrind?
    • No invalid reads or writes, though some suspicious "definitely lost" blocks only in the fail scenario, which could be attributed to no cleanups occurring
    • We allocate 750MB of memory in this single test flow, could be relevant
  • Any RocksDB flags we can set that would avoid this, or give us more info?
    • Looking through logs, setting paranoid checks, no new useful information
  • What about nightly?
    • buzzer sounds
  • Any insight from running sanitizers?
    • RUSTFLAGS="-Z sanitizer=memory" cargo +nightly test --target x86_64-unknown-linux-gnu --features rocksdb,test-kubo orb_can_render
    • 👍 Address sanitizer: passes, what??
      • Failed once due to broken CAR stream
      • Unexpected warning: WARN flush_to_writer{writer=SphereWriter { kind: Root { force_full_render: false }, paths: SpherePaths { root: "/tmp/.tmpRjestr" }, base: Once(Uninit), mount: Once(Uninit), private: Once(Uninit) }}: Content write failed: No such file or directory (os error 2)
    • Thread sanitizer: Fails earlier in the test flow, though test runner has issues with it ThreadSanitizer detects a data race in the test runner (rustc --test) rust-lang/rust#39608
    • Memory sanitizer: does not work with test runner MemorySanitizer detects an use of unitialized value in the test runner (rustc --test) rust-lang/rust#39610
    • Leak sanitizer: segfaults much earlier in the test flow: Thread 4 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault. when setting a key and serializing a multihash (libp2p/ns)

Stack trace of the offending stack:

(cpp) rocksdb::TableCache::Get
..
BlockStore::get_block
..
OnceCell::get_or_try_init/closure
Sphere::to_memo
OnceCell::get_or_try_init/closure
Sphere::to_body
Sphere:get_content
Sphere::derive_mutation
Sphere::hydrate_with_cid
Sphere::hydrate
Sphere::hydrate_timeslice
Sphere::rebase
sync::fetch_remote_changes
@jsantell jsantell added the Bug Incorrect or unexpected behavior label Oct 11, 2023
jsantell added a commit that referenced this issue Oct 19, 2023
@jsantell
Copy link
Contributor Author

Also reproducible on arm64 Macos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Incorrect or unexpected behavior
Projects
None yet
Development

No branches or pull requests

1 participant