This document applies STPA-style hazard analysis to the sled embedded database for the purpose of guiding design and testing efforts to prevent unacceptable losses.
Outline
- purpose of analysis
- model of control structure
- identify unsafe control actions
- [identify loss scenarios][#identify-loss-scenarios)
- resources for learning more about STAMP, STPA, and CAST
We wish to prevent the following undesirable situations:
- data loss
- inconsistent (non-linearizable) data access
- process crash
- resource exhaustion
We draw the line between system and environment where we can reasonably invest our efforts to prevent losses.
Inside the boundary:
- codebase
- put safe control actions into place that prevent losses
- documentation
- show users how to use sled safely
- recommend hardware, kernels, user code
Outside the boundary:
- Direct changes to hardware, kernels, user code
These hazards can result in the above losses:
- data may be lost if
- bugs in the logging system
Db::flush
fails to make previous writes durable
- bugs in the GC system
- the old location is overwritten before the defragmented location becomes durable
- bugs in the recovery system
- hardare failures
- bugs in the logging system
- consistency violations may be caused by
- transaction concurrency control failure to enforce linearizability (strict serializability)
- non-linearizable lock-free single-key operations
- panic
- of user threads
- IO threads
- flusher & GC thread
- indexing
- unwraps/expects
- failed TryInto/TryFrom + unwrap
- persistent storage exceeding (2 + N concurrent writers) * logical data size
- in-memory cache exceeding the configured cache size
- caused by incorrect calculation of cache
- use-after-free
- data race
- memory leak
- integer overflow
- buffer overrun
- uninitialized memory access
for each control action we have, consider:
- what hazards happen when we fail to apply it / it does not exist?
- what hazards happen when we do apply it
- what hazards happen when we apply it too early or too late?
- what hazards happen if we apply it for too long or not long enough?
durability model
- recovery
- LogIter::max_lsn
- return None if last_lsn_in_batch >= self.max_lsn
- batch requirement set to last reservation base + inline len - 1
- reserve bumps
- bump_atomic_lsn(&self.iobufs.max_reserved_lsn, reservation_lsn + inline_buf_len as Lsn - 1);
- reserve bumps
- LogIter::max_lsn
lock-free linearizability model
transactional linearizability (strict serializability) model
panic model
memory usage model
storage usage model