diff --git a/async-bpm b/async-bpm index 5d3afa2..7d758f3 160000 --- a/async-bpm +++ b/async-bpm @@ -1 +1 @@ -Subproject commit 5d3afa24c698b309fbf310c373c7705b3f01e0c0 +Subproject commit 7d758f3cf47499646d6b929f3ecc79022561b22a diff --git a/proposal/15721-project.json b/proposal/15721-project.json new file mode 100644 index 0000000..97f453b --- /dev/null +++ b/proposal/15721-project.json @@ -0,0 +1,21 @@ +{ + "info": { + "title": "Eggstrain & Async Buffer Pool Manager", + "github": "https://github.com/cmu-db/15721-s24-ee1", + "description": "An asynchronous vectorized push-based execution and asynchronous buffer pool manager written in Rust.", + "students": [ + { + "name": "Kyle Booker", + "url": "https://www.linkedin.com/in/ktbooker/" + }, + { + "name": "Sarvesh Tandon", + "url": "https://www.linkedin.com/in/sarvesh-tandon/" + }, + { + "name": "Connor Tsui", + "url": "https://www.linkedin.com/in/connortsui/" + } + ] + } +} diff --git a/proposal/final-presentation.html b/proposal/final-presentation.html new file mode 100644 index 0000000..3a8185a --- /dev/null +++ b/proposal/final-presentation.html @@ -0,0 +1,247 @@ +
+

Eggstrain

+

Vectorized Push-Based inspired Execution Engine
+Asynchronous Buffer Pool Manager

+
+

Authors: Connor, Sarvesh, Kyle

+
+
+

Original Proposed Goals

+
    +
  • 75%: First 7 operators working + integration with other components
  • +
  • 100%: All operators listed above working
  • +
  • 125%: TPC-H benchmark working
  • +
+
+
+

Design Goals

+
    +
  • Robustness
  • +
  • Modularity
  • +
  • Extensibility
  • +
  • Forward Compatibility
  • +
+

We made heavy use of tokio and rayon in our implementation.

+
+
+

Refresher on Architecture

+
+
+

Refresher on operators

+
    +
  • TableScan
  • +
  • Filter
  • +
  • Projection
  • +
  • HashAggregation
  • +
  • HashJoin (HashProbe + HashBuild)
  • +
  • OrderBy
  • +
  • TopN
  • +
+
+
+

Example Operator Workflow

+
+
+

Progress Towards Goals

+
    +
  • 100%: All operators implemented, excluding HashJoin
  • +
  • 125%: TPC-H benchmark working for Q1
  • +
+
+
+

Execution Engine Benchmarks

+

Hardware:

+
    +
  • M1 Pro, 8 cores, 16GB RAM
  • +
+
+
+
+

Correctness Testing and Code Quality Assessment

+

We tested correctness by comparing our results to the results of the same queries run in DataFusion.

+

Our code quality is high with respect to documentation, integration tests, and code review.

+

However, we lack unit tests for each operator. We instead tested operators integrated inside of queries.

+
+
+

Problem: In Memory?

+

We found that we needed to spill data to disk to handle large queries.

+

However, to take advantage of our asynchronous architecture, we needed to implement an asynchronous buffer pool manager.

+
+
+

Recap: Buffer Pool Manager

+

A buffer pool manager manages synchronizing data between volatile memory and persistent storage.

+
    +
  • In charge of bringing data from storage into memory in the form of pages
  • +
  • In charge of synchronizing reads and writes to the memory-local page data
  • +
  • In charge of writing data back out to disk so it is synchronized
  • +
+
+
+

Traditional Buffer Pool Manager

+ +

Traditional BPMs will use a global hash table that maps page IDs to memory frames.

+
    +
  • Source: LeanStore: In-Memory Data Management Beyond Main Memory (2018)
  • +
+
+
+

Recap: Blocking I/O

+

Additionally, traditional buffer pool managers will use blocking reads and writes to send data between memory and persistent storage.

+

Blocking I/O is heavily reliant on the Operating System.

+
+

The DBMS can almost always manage memory better than the OS

+
+
    +
  • Source: 15-445 Lecture 6 on Buffer Pools
  • +
+
+
+

Recap: I/O System Calls

+

What happens when we issue a pread() or pwrite() call?

+
    +
  • We stop what we're doing
  • +
  • We transfer control to the kernel
  • +
  • We are blocked waiting for the kernel to finish and transfer control back +
      +
    • A read from disk is probably scheduled somewhere
    • +
    • Something gets copied into the kernel
    • +
    • The kernel copies that something into userspace
    • +
    +
  • +
  • We come back and resume execution
  • +
+
+
+

Blocking I/O for Buffer Pool Managers

+

Blocking I/O is fine for most situations, but might be a bottleneck for a DBMS's Buffer Pool Manager.

+
    +
  • Typically optimizations are implemented to offset the cost of blocking: +
      +
    • Pre-fetching
    • +
    • Scan-sharing
    • +
    • Background writing
    • +
    • O_DIRECT
    • +
    +
  • +
+
+
+

Non-blocking I/O

+

What if we could do I/O without blocking? There exist a few ways to do this:

+
    +
  • libaio
  • +
  • io_uring
  • +
  • SPDK
  • +
  • All of these allow for asynchronous I/O
  • +
+
+
+

io_uring

+ +

This Buffer Pool Manager is going to be built with asynchronous I/O using io_uring.

+
    +
  • Source: What Modern NVMe Storage Can Do, And How To Exploit It... (2023)
  • +
+
+
+

Asynchronous I/O

+

Asynchronous I/O really only works when the programs running on top of it implement cooperative multitasking.

+
    +
  • Normally, the kernel gets to decide what thread gets to run
  • +
  • Cooperative multitasking allows the program to decide who gets to run
  • +
  • Context switching between tasks is a much more lightweight maneuver
  • +
  • If one task is waiting for I/O, we can cheaply switch to a different task!
  • +
+
+
+

Eggstrain

+

The key thing here is that our Execution Engine eggstrain fully embraces asynchronous execution.

+
    +
  • Rust has first-class support for asynchronous programs
  • +
  • Using async libraries is almost as simple as plug-and-play
  • +
  • The tokio crate is an easy runtime to get set up
  • +
  • We can easily create a buffer pool manager in the form of a Rust library crate
  • +
+
+
+

Goals

+

The goal of this system is to fully exploit parallelism.

+
    +
  • NVMe drives have gotten really, really fast
  • +
  • Blocking I/O simply cannot match the full throughput of an NVMe drive
  • +
  • They are completely bottle-necked by today's software
  • +
  • If we can fully exploit parallelism in software and hardware... +
      +
    • We can actually get close to matching the speed of in-memory systems, while using persistent storage
    • +
    +
  • +
+
+
+
+

Proposed Design

+

The next slide has a proposed design for a fully asynchronous buffer pool manager. The full (somewhat incomplete) writeup can be found here.

+
    +
  • Heavily inspired by LeanStore +
      +
    • Eliminates the global page table and uses tagged pointers to data
    • +
    +
  • +
  • Even more inspired by this paper: +
      +
    • What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines (2023) +
        +
      • Gabriel Haas and Viktor Leis
      • +
      +
    • +
    +
  • +
  • The goal is to eliminate as many sources of global contention as possible
  • +
+
+
+
+

BPM Benchmarks

+

Hardware:

+
    +
  • Cray/Appro GB512X - 32 Threads Xeon E5-2670 @ 2.60GHz, 64 GiB DDR3 RAM, 1x 240GB SSD, Gigabit Ethernet, QLogic QDR Infiniband
  • +
  • We will benchmark against RocksDB as a buffer pool manager
  • +
+
+
+
+
+
+
+
+
+

Future Work

+
    +
  • Asynchronous BPM ergonomics and API
  • +
  • Proper io_uring polling and batch evictions
  • +
  • Shared user/kernel buffers and file descriptors (avoiding memcpy)
  • +
  • Multiple NVMe SSD support (Software-implemented RAID 0)
  • +
  • Optimistic hybrid latches
  • +
+
+

zipfian distribution, alpha = 1.01

zipfian distribution, alpha = 1.01

--- + +![bg 90%](./images/zip1.1.png) + +zipfian distribution, alpha = 1.1

--- + +![bg 90%](./images/zip1.2.png) +zipfian distribution, alpha = 1.2

\ No newline at end of file diff --git a/proposal/final-presentation.md b/proposal/final-presentation.md new file mode 100644 index 0000000..64fa8df --- /dev/null +++ b/proposal/final-presentation.md @@ -0,0 +1,289 @@ +--- +marp: true +theme: default +#class: invert # Remove this line for light mode +paginate: true +--- + +# Eggstrain + +Vectorized Push-Based inspired Execution Engine +Asynchronous Buffer Pool Manager + +
+ +## **Authors: Connor, Sarvesh, Kyle** + +--- + +# Original Proposed Goals + +- 75%: First 7 operators working + integration with other components +- 100%: All operators listed above working +- 125%: TPC-H benchmark working + +--- + +# Design Goals + +- Robustness +- Modularity +- Extensibility +- Forward Compatibility + +We made heavy use of `tokio` and `rayon` in our implementation. + +--- + +# Refresher on Architecture + +![bg right:60% 100%](./images/architecture.drawio.svg) + +--- + +# Refresher on operators + +- `TableScan` +- `Filter` +- `Projection` +- `HashAggregation` +- `HashJoin` (`HashProbe` + `HashBuild`) +- `OrderBy` +- `TopN` + +--- + +# Example Operator Workflow + +![bg right:70% 80%](./images/hashjoin.svg) + +--- + +# Progress Towards Goals + +- 100%: All operators implemented, excluding `HashJoin` +- 125%: TPC-H benchmark working for Q1 + +--- + +# Execution Engine Benchmarks + +Hardware: + +- M1 Pro, 8 cores, 16GB RAM + +--- + +![bg 90%](./images/csvreader.png) + +--- + +# Correctness Testing and Code Quality Assessment + +We tested correctness by comparing our results to the results of the same queries run in DataFusion. + +Our code quality is high with respect to documentation, integration tests, and code review. + +However, we lack unit tests for each operator. We instead tested operators integrated inside of queries. + +--- + +# Problem: In Memory? + +We found that we needed to spill data to disk to handle large queries. + +However, to take advantage of our asynchronous architecture, we needed to implement an **asynchronous buffer pool manager.** + + +--- + +# Recap: Buffer Pool Manager + +A buffer pool manager manages synchronizing data between volatile memory and persistent storage. + +* In charge of bringing data from storage into memory in the form of pages +* In charge of synchronizing reads and writes to the memory-local page data +* In charge of writing data back out to disk so it is synchronized + +--- + +# Traditional Buffer Pool Manager + +![bg right:50% 100%](images/traditional_bpm.png) + +Traditional BPMs will use a global hash table that maps page IDs to memory frames. + +* Source: _LeanStore: In-Memory Data Management Beyond Main Memory (2018)_ + +--- + +# Recap: Blocking I/O + +Additionally, traditional buffer pool managers will use blocking reads and writes to send data between memory and persistent storage. + +Blocking I/O is heavily reliant on the Operating System. + +> The DBMS can almost always manage memory better than the OS + +* Source: 15-445 Lecture 6 on Buffer Pools + +--- + +# Recap: I/O System Calls + +What happens when we issue a `pread()` or `pwrite()` call? + +* We stop what we're doing +* We transfer control to the kernel +* _We are blocked waiting for the kernel to finish and transfer control back_ + * _A read from disk is *probably* scheduled somewhere_ + * _Something gets copied into the kernel_ + * _The kernel copies that something into userspace_ +* We come back and resume execution + +--- + +# Blocking I/O for Buffer Pool Managers + +Blocking I/O is fine for most situations, but might be a bottleneck for a DBMS's Buffer Pool Manager. + +- Typically optimizations are implemented to offset the cost of blocking: + - Pre-fetching + - Scan-sharing + - Background writing + - `O_DIRECT` + +--- + +# Non-blocking I/O + +What if we could do I/O _without_ blocking? There exist a few ways to do this: + +- `libaio` +- `io_uring` +- SPDK +- All of these allow for _asynchronous I/O_ + +--- + +# `io_uring` + +![bg right:50% 90%](images/linux_io.png) + +This Buffer Pool Manager is going to be built with asynchronous I/O using `io_uring`. + +* Source: _What Modern NVMe Storage Can Do, And How To Exploit It... (2023)_ + +--- + +# Asynchronous I/O + +Asynchronous I/O really only works when the programs running on top of it implement _cooperative multitasking_. + +* Normally, the kernel gets to decide what thread gets to run +* Cooperative multitasking allows the program to decide who gets to run +* Context switching between tasks is a _much more_ lightweight maneuver +* If one task is waiting for I/O, we can cheaply switch to a different task! + +--- + +# Eggstrain + +The key thing here is that our Execution Engine `eggstrain` fully embraces asynchronous execution. + +* Rust has first-class support for asynchronous programs +* Using `async` libraries is almost as simple as plug-and-play +* The `tokio` crate is an easy runtime to get set up +* We can easily create a buffer pool manager in the form of a Rust library crate + +--- + +# Goals + +The goal of this system is to _fully exploit parallelism_. + +* NVMe drives have gotten really, really fast +* Blocking I/O simply cannot match the full throughput of an NVMe drive +* They are _completely_ bottle-necked by today's software +* If we can fully exploit parallelism in software _and_ hardware... + * **We can actually get close to matching the speed of in-memory systems, _while using persistent storage_** + +--- + +![bg 60%](images/modern_storage.png) + +--- + +# Proposed Design + +The next slide has a proposed design for a fully asynchronous buffer pool manager. The full (somewhat incomplete) writeup can be found [here](https://github.com/Connortsui20/async-bpm). + +- Heavily inspired by LeanStore + - Eliminates the global page table and uses tagged pointers to data +- Even more inspired by this paper: + - _What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines (2023)_ + - Gabriel Haas and Viktor Leis +- The goal is to _eliminate as many sources of global contention as possible_ + +--- + +![bg 90%](images/bpm_design.png) + +--- + +# BPM Benchmarks + +Hardware: + +* Cray/Appro GB512X - 32 Threads Xeon E5-2670 @ 2.60GHz, 64 GiB DDR3 RAM, 1x 240GB SSD, Gigabit Ethernet, QLogic QDR Infiniband +* We will benchmark against RocksDB as a buffer pool manager + +--- + +![bg 90%](./images/zip1.1dist.png) + +--- + +![bg 90%](./images/20w80r.png) + + + +--- + +![bg 90%](./images/80w20r.png) + +--- + +![bg 90%](./images/uniform20w80r.png) + +--- + +![bg 90%](./images/uniform80w20r.png) + +--- + +![bg 90%](./images/uniform5050.png) + + + + + + + +--- + +# Future Work + +- Asynchronous BPM ergonomics and API +- Proper `io_uring` polling and batch evictions +- Shared user/kernel buffers and file descriptors (avoiding `memcpy`) +- Multiple NVMe SSD support (Software-implemented RAID 0) +- Optimistic hybrid latches diff --git a/proposal/final-presentation.pdf b/proposal/final-presentation.pdf new file mode 100644 index 0000000..83d9c38 Binary files /dev/null and b/proposal/final-presentation.pdf differ diff --git a/proposal/final_designdoc.md b/proposal/final_designdoc.md new file mode 100644 index 0000000..ec3c6fe --- /dev/null +++ b/proposal/final_designdoc.md @@ -0,0 +1,246 @@ +# Execution Engine + +- Sarvesh (sarvesht) +- Kyle (kbooker) +- Connor (cjtsui) + +# Overview + +The purpose of this project was to create the Execution Engine (EE) for a distributed OLAP database. + +We took heavy inspiration from [DataFusion](https://arrow.apache.org/datafusion/), [Velox](https://velox-lib.io/), and [InfluxDB](https://github.com/influxdata/influxdb) (which itself is built on top of DataFusion). + +There were two subgoals. The first is to develop a functional EE, with a sufficient number of operators working and tested. Since all other components have to rely on us to see any sort of outputs, we had have implementations of operators that just work (even though some are inefficient). + +The second was to add either interesting features or optimize the engine to be more performant (or both). Since it is unlikely that we will outperform any off-the-shelf EEs like DataFusion, we will likely try to test some new feature that these engines do not use themselves. + +# Architectural Design + +> Explain the input and output of the component, describe interactions and breakdown the smaller components if any. Include diagrams if appropriate. + +We created a vectorized push-based EE. This means operators will push batches of data up to their parent operators in the physical plan tree. + +--- + +### Operators + +We implemented a subset of the operators that [Velox implements](https://facebookincubator.github.io/velox/develop/operators.html): + +- TableScan (Used Datafusion) +- Filter (Completed) +- Project (Completed) +- HashAggregation (Completed) +- HashProbe + HashBuild (Used Datafusion) +- OrderBy (Completed) +- TopN (Completed) + +The `trait` / interface to define these operators is unknown right now. We will likely follow whatever DataFusion is outputting from their [`ExecutionPlan::execute()`](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.execute) methods. + +--- + +### 3rd-party crates + +We will be making heavy use of `tokio` and `rayon`. + +[`tokio`](https://tokio.rs/) is a high-performance asynchronous runtime for Rust that provides a very well-tested work-stealing lightweight thread pool that is also very well-documented with lots of features and support. + +[`rayon`](https://docs.rs/rayon/latest/rayon/) is a data parallelism crate that allows us to write parallelized and safe code using a OS thread pool with almost no developer cost. + +The difference is a bit subtle, but it comes down to using `tokio` for interacting with the storage client from the I/O Service component (basically handling anything asynchronous), and using `rayon` for any CPU-intensive workloads (actual execution). `tokio` will provide the scheduling for us, and all we will need to do is write the executors that `tokio` will run for us. + +There is a Rust-native [`arrow`](https://arrow.apache.org/rust/arrow/index.html) crate that gives us bindings into the [Apache Arrow data format](https://arrow.apache.org/overview/). Notably, it provides a [`RecordBatch`](https://arrow.apache.org/rust/arrow/array/struct.RecordBatch.html) type that can hold vectorized batches of columnar data. This will be the main form of data communicated between the EE and other components, as well as between the operators of the EE itself. + +We will also be making heavy use of [`datafusion`](https://docs.rs/datafusion/latest/datafusion/). Since we are using their physical plan representation we will make heavy use of their proprietary datatypes. We will also use the DataFusion implementations for the operators we do not plan to implement ourselves for the time being. + +--- + +### Interface and API + +Each EE can be treated as a mathematical function. It receives as input a physical plan from the scheduler, and outputs an Apache Arrow `RecordBatch` or `SendableRecordBatchStream`. + +The physical plan tree it receives as input has nodes corresponding to one of the operators listed above. These plans will be given to each EE by the Scheduler / Coordinator as DataFusion query plan fragments. More specifically, the EE will parse out the specific nodes in the relational tree. + +Once the EE parses the plan and will figure out what data it requires. From there, it will make a high-level request for data it needs from the IO Service (e.g. logical columns from a table). For the first stages of this project, we will just request entire columns of data from specific tables. Later in the semester, we may want more granular columnar data, or even point lookups. + +The IO Service will have a `StorageClient` type that the EE can use locally on an execution node as a way to abstract the data it is requesting and receiving (_the IO service team will expose this type from their own crate, and we can link it into our EE crate_). The IO Service will return to the EE data as a `tokio::Stream` of Apache Arrow `RecordBatch` (something like `Stream`, which will probably mean using a re-exported [`SendableRecordBatchStream`](https://docs.rs/datafusion/latest/datafusion/physical_plan/type.SendableRecordBatchStream.html) from DataFusion). + +The IO Service will retrieve the data (presumably by talking with the Catalog) from the blob store, parse it into an Apache Arrow format (which means the IO service is in charge of decoding Parquet files into the Arrow format), and then send it to the EE as some form of asynchronous stream. + +### Implementation Details + +For our group, data between operators will also be sent and processed through Apache Arrow `RecordBatch` types. This is not how Velox does it, but it is how DataFusion does processing. + +It is likely that we also needed our own Buffer Pool Manager to manage in-memory data that needs to be spilled to disk in the case that the amount of data we need to handle grows too much. + +The buffer pool manager in Datafusion was not asynchronous. So in order to fully exploit the advantages of the tokio asynchronous runtime, we shifted focus completely in the last 4 weeks to build out an asynchronous buffer pool manager similar to Leanstore. + +# Testing Plan For In-Memory Execution Engine + +> How should the component be tested? +> The integration test were TPC-H, or something similar to TPC-H. This was a stretch goal. We have completed this and the results of running TPC-H query 1 with scale factor=10 are shown in the final presentation. + +# Glossary + +> If you are introducing new concepts or giving unintuitive names to components, write them down here. + +- "Vectorized execution" is the name given to the concept of outputting batches of data. But since there is a `Vec`tor type in Rust, we'll likely be calling everything Batches instead of Vectors. + +--- + +
+
+
+
+ +# **Asynchronous Buffer Pool** + +_Note: This design documentation for the asynchronous buffer pool is slightly outdated, but the_ +_high-level components are still the same. The only real difference is in the eviction algorithm._ + +For the real documentation, see the up-to-date repository +[here](https://github.com/Connortsui20/async-bpm). + +After cloning the repository, run this command to generate the documentation: + +```sh +$ cargo doc --document-private-items --open +``` + +# Design + +This model is aimed at a thread-per-core model with a single logical disk. +This implies that tasks (coroutines) given to worker threads cannot be moved between threads +and future work could introduce the all-to-all model of threads to distinct SSDs, +where each worker thread has a dedicated `io_uring` instance for every physical SSD. + +# Future Work + +There is still a lot of work to be done on this system. As of right now, it is in a state of +"barely working". However, in this "barely working" state, it still matches and even outperforms +RocksDB in IOPS on single-disk hardware. Even though this is not a very high bare, it shows the high +potential of this system, especially since the goal is to scale with better hardware. + +Almost all of the [issues](https://github.com/Connortsui20/async-bpm/issues) are geared towards +optimization, and it is not an overstatement to say that each of these features would contribute +to a significant performance gain. + +# Objects and Types + +## Thread Locals + +- `PageHandle`: A shared pointer to a `Page` (through an `Arc`) +- Local `io_uring` instances (that are `!Send`) +- Futures stored in a local hash table defining the lifecycle of an `io_uring` event (private) + +### Local Daemons + +These thread-local daemons exist as foreground tasks, just like any other task the DBMS might have. + +- Listener: Dedicated to polling local `io_uring` completion events +- Submitter: Dedicated to submitting `io_uring` submission entries +- Evictor: Dedicated to cooling `Hot` pages and evicting `Cool` pages + +## Shared Objects + +- Shared pre-registered buffers / frames + - Frames are owned types (can only belong to a specific `Page`) + - Frames also have pointers back to their parent `Page`s + - _Will have to register multiple sets, as you can only register 1024 frames at a time_ +- Shared multi-producer multi-consumer channel of frames +- `Page`: A hybrid-latched (read-write locked for now) page header + - State is either `Unloaded`, `Loading` (private), or `Loaded` + - `Unloaded` implies that the data is not in memory + - `Loading` implies that the data is being loaded from disk, and contains a future (private) + - `Loaded` implies the data is on one of the pre-registered buffers, and owns a registered buffer + - `Page`s also have eviction state + - `Hot` implies that this is a frequently-accessed page + - `Cool` implies that it is not frequently accessed, and might be evicted soon + +Note that the eviction state is really just an optimization for making a decision on pages to evict. +The page eviction state _does not_ imply anything about the state of a page +(so a page could be `Hot` and also `Unloaded`), and all accesses to a page must still go through the hybrid latch. + +In summary, the possible states that the `Page` can be in is: + +- `Loaded` and `Hot` (frequently accessed) +- `Loaded` and `Cool` (potential candidate for eviction) +- `Loading` (`Hot`/`Cold` plus private `io_uring` event state) +- `Unloaded` (`Cold`) + +# Algorithms + +### Write Access Algorithm + +Let P1 be the page we want to get write access for. + +- Set eviction state to `Hot` +- Write-lock P1 +- If `Loaded` (SEMI-HOT PATH): + - Modify the page data + - Unlock and return +- Else `Unloaded`, and we need to load the page + - Load a page via the [load algorithm](#load-algorithm) + - Modify the page data + - Unlock and return + +### Read Access Algorithm + +Let P1 be the page we want to get read access for. +All optimistic reads have to be done through a read closure (cannot construct a reference `&`). + +- Set eviction state to `Hot` +- Optimistically read P1 +- If `Loaded` (HOT PATH): + - Read the page data optimistically and return + - If the optimistic read fails, fallback to a pessimistic read + - If still `Loaded` (SEMI-HOT PATH): + - Read normally and return + - Else it is now `Unloaded`, so continue +- The state is `Unloaded`, and we need to load a page + - Upgrade the read lock into a write lock (either drop and retake, or directly upgrade) + - Load a page via the [load algorithm](#load-algorithm) + - Read the page data + - Unlock and return + +### Load algorithm + +Let P1 be the page we want to load from disk into memory. The caller must have the write lock on P1. +Once this algorithm is complete, the page is guaranteed to be loaded into the owned frame, +and the page eviction state will be `Hot`. + +- If the page is `Loaded`, then immediately return +- Otherwise, this page is `Unloaded` +- pick a random `FrameGroup` from the set of frame groups that we have +- run the cooling (clock) algorithm on all the frames in the `FrameGroup` until we have a free frame available +- Set the frame's parent pointer to P1 +- Read P1's data from disk into the buffer +- `await` read completion from the local `io_uring` instance +- At the end, set the page eviction state to `Hot` + +### General Eviction Algorithm + +On every worker thread, if the random `FrameGroup` that it picks does not have a frame, +it will start acting like an evictor and will start running the clock algorithm. +It will aim to have some certain threshold of free pages in the free list. + +- Iterate over all frames in the `FrameGroup` +- Collect the list of `Page`s that are `Loaded` (should not be more than the number of frames) +- For every `Page` that is `Hot`, change to `Cool` +- Collect all `Cool` pages +- Randomly choose some (small) constant number of pages (current this constant number is 1 but in the future we would like it to be 64 i.e. do 64 eviction flushes together) from the list of initially `Cool` pages +- `join!` the following page eviction futures: + - For each page Px we want to evict: + - Check if Px has been changed to `Hot`, and if so, return early + - Write-lock Px + - If Px is now either `Hot` or `Unloaded`, unlock and return early + - Write Px's buffer data out to disk via the local `io_uring` instance + - `await` write completion from the local `io_uring` instance + - Set Px to `Unloaded` + - Send Px's frame to the global channel of free frames + - Unlock Px + +# Glossary + +> If you are introducing new concepts or giving unintuitive names to components, write them down here. + +- "Vectorized execution" is the name given to the concept of outputting batches of data. But since there is a `Vec`tor type in Rust, we'll likely be calling everything Batches instead of Vectors. diff --git a/proposal/images/20w80r.png b/proposal/images/20w80r.png new file mode 100644 index 0000000..38f11d9 Binary files /dev/null and b/proposal/images/20w80r.png differ diff --git a/proposal/images/80w20r.png b/proposal/images/80w20r.png new file mode 100644 index 0000000..a882090 Binary files /dev/null and b/proposal/images/80w20r.png differ diff --git a/proposal/images/bpm_design.png b/proposal/images/bpm_design.png new file mode 100644 index 0000000..1e3a8ec Binary files /dev/null and b/proposal/images/bpm_design.png differ diff --git a/proposal/images/csvreader.png b/proposal/images/csvreader.png new file mode 100644 index 0000000..42125e8 Binary files /dev/null and b/proposal/images/csvreader.png differ diff --git a/proposal/images/linux_io.png b/proposal/images/linux_io.png new file mode 100644 index 0000000..f303f13 Binary files /dev/null and b/proposal/images/linux_io.png differ diff --git a/proposal/images/modern_storage.png b/proposal/images/modern_storage.png new file mode 100644 index 0000000..cb9b526 Binary files /dev/null and b/proposal/images/modern_storage.png differ diff --git a/proposal/images/traditional_bpm.png b/proposal/images/traditional_bpm.png new file mode 100644 index 0000000..08d4547 Binary files /dev/null and b/proposal/images/traditional_bpm.png differ diff --git a/proposal/images/uniform20w80r.png b/proposal/images/uniform20w80r.png new file mode 100644 index 0000000..2555be0 Binary files /dev/null and b/proposal/images/uniform20w80r.png differ diff --git a/proposal/images/uniform5050.png b/proposal/images/uniform5050.png new file mode 100644 index 0000000..93af5e9 Binary files /dev/null and b/proposal/images/uniform5050.png differ diff --git a/proposal/images/uniform80w20r.png b/proposal/images/uniform80w20r.png new file mode 100644 index 0000000..df4a1f5 Binary files /dev/null and b/proposal/images/uniform80w20r.png differ diff --git a/proposal/images/zip1.1.png b/proposal/images/zip1.1.png new file mode 100644 index 0000000..26208f8 Binary files /dev/null and b/proposal/images/zip1.1.png differ diff --git a/proposal/images/zip1.1dist.png b/proposal/images/zip1.1dist.png new file mode 100644 index 0000000..315ee82 Binary files /dev/null and b/proposal/images/zip1.1dist.png differ diff --git a/proposal/images/zip1.2.png b/proposal/images/zip1.2.png new file mode 100644 index 0000000..58c5d9f Binary files /dev/null and b/proposal/images/zip1.2.png differ diff --git a/proposal/presentation2.html b/proposal/presentation2.html new file mode 100644 index 0000000..ef78e61 --- /dev/null +++ b/proposal/presentation2.html @@ -0,0 +1,188 @@ +
+

Eggstrain and Beyond

+

Authors: Connor, Kyle, Sarvesh

+
+
+

Eggstrain - Demo on TPC-H Query 1

+

Our proof of concept is a demo on TPC-H Query 1 using the eggstrain execution engine.

+

The main contribution is our asynchronous framework utilizing tokio and rayon threads.

+
+
+

The End of Eggstrain

+

With this working demo, we have almost completed our initial goal of creating an execution engine with the operators we have previously discussed.

+
    +
  • More work to be more robust and feature-rich
  • +
  • Foundation for a powerful and efficient async execution engine
  • +
  • Need new async bpm to support the engine
  • +
+
+
+

Recap: Buffer Pool Manager

+

A buffer pool manager manages synchronizing data between volatile memory and persistent storage.

+
    +
  • In charge of bringing data from storage into memory in the form of pages
  • +
  • In charge of synchronizing reads and writes to the memory-local page data
  • +
  • In charge of writing data back out to disk so it is synchronized
  • +
+
+
+

Traditional Buffer Pool Manager

+ +

Traditional BPMs will use a global hash table that maps page IDs to memory frames.

+
    +
  • Source: LeanStore: In-Memory Data Management Beyond Main Memory (2018)
  • +
+
+
+

Recap: Blocking I/O

+

Additionally, traditional buffer pool managers will use blocking reads and writes to send data between memory and persistent storage.

+

Blocking I/O is heavily reliant on the Operating System.

+
+

The DBMS can almost always manage memory better than the OS

+
+
    +
  • Source: 15-445 Lecture 6 on Buffer Pools
  • +
+
+
+

Recap: I/O System Calls

+

What happens when we issue a pread() or pwrite() call?

+
    +
  • We stop what we're doing
  • +
  • We transfer control to the kernel
  • +
  • We are blocked waiting for the kernel to finish and transfer control back +
      +
    • A read from disk is probably scheduled somewhere
    • +
    • Something gets copied into the kernel
    • +
    • The kernel copies that something into userspace
    • +
    +
  • +
  • We come back and resume execution
  • +
+
+
+

Blocking I/O for Buffer Pool Managers

+

Blocking I/O is fine for most situations, but might be a bottleneck for a DBMS's Buffer Pool Manager.

+
    +
  • Typically optimizations are implemented to offset the cost of blocking: +
      +
    • Pre-fetching
    • +
    • Scan-sharing
    • +
    • Background writing
    • +
    • O_DIRECT
    • +
    +
  • +
+
+
+

Non-blocking I/O

+

What if we could do I/O without blocking? There exist a few ways to do this:

+
    +
  • libaio
  • +
  • io_uring
  • +
  • SPDK
  • +
  • All of these allow for asynchronous I/O
  • +
+
+
+

io_uring

+ +

This Buffer Pool Manager is going to be built with asynchronous I/O using io_uring.

+
    +
  • Source: What Modern NVMe Storage Can Do, And How To Exploit It... (2023)
  • +
+
+
+

Asynchronous I/O

+

Asynchronous I/O really only works when the programs running on top of it implement cooperative multitasking.

+
    +
  • At a high level, the kernel gets to decide what thread gets to run
  • +
  • Cooperative multitasking allows the program to decide who gets to run
  • +
  • Context switching between tasks is a lightweight maneuver
  • +
  • If one task is waiting for I/O, we can cheaply switch to a different task!
  • +
+
+
+

Eggstrain

+

The key thing here is that our Execution Engine eggstrain fully embraces asynchronous execution.

+
    +
  • Rust has first-class support for asynchronous programs
  • +
  • Using async libraries is almost as simple as plug-and-play
  • +
  • The tokio crate is an easy runtime to get set up
  • +
  • We can easily create a buffer pool manager in the form of a Rust library crate
  • +
+
+
+

Goals

+

The goal of this system is to fully exploit parallelism.

+
    +
  • NVMe drives have gotten really, really fast
  • +
  • Blocking I/O simply cannot match the full throughput of an NVMe drive
  • +
  • They are completely bottle-necked by today's software
  • +
  • If we can fully exploit parallelism in software and hardware, we can get close to matching the speed of in-memory systems, while using persistent storage
  • +
+
+
+

Proposed Design

+

The next slide has a proposed design for a fully asynchronous buffer pool manager. The full (somewhat incomplete) writeup can be found here.

+
    +
  • Heavily inspired by LeanStore +
      +
    • Eliminates the global page table and uses tagged pointers to data
    • +
    +
  • +
  • Even more inspired by this paper: +
      +
    • What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines (2023) +
        +
      • Gabriel Haas and Viktor Leis
      • +
      +
    • +
    +
  • +
  • The goal is to eliminate as many sources of global contention as possible
  • +
+
+
+
+

Some Issues

+
    +
  • There is a scheduler-per-thread, but no scheduler assigning tasks to specific workers (so it does not work with a multithreaded asynchronous runtime)
  • +
  • The proposed design does not have a backend stage to fully synchronize I/O
  • +
  • Eviction is done naively by a single worker thread +
      +
    • Deadlocks!!!
    • +
    +
  • +
  • Will probably switch to polling a list of free frames that gets populated by foreground tasks
  • +
+
+
+

Future Work

+
    +
  • This will definitely not be done by the end of this semester
  • +
  • Which means our execution engine is also not going to be "complete" in that we will not support spill-to-disk
  • +
  • Our contribution is the beginning of an implementation of an asynchronous Buffer Pool Manager in Rust +
      +
    • That can theoretically be plugged into any asynchronous execution engine like eggstrain or even DataFusion
    • +
    +
  • +
+
+
+

Thank you!

+
+
\ No newline at end of file diff --git a/proposal/presentation2.md b/proposal/presentation2.md new file mode 100644 index 0000000..3641042 --- /dev/null +++ b/proposal/presentation2.md @@ -0,0 +1,179 @@ +--- +marp: true +theme: default +paginate: true +--- + +# Eggstrain and Beyond + +## **Authors: Connor, Kyle, Sarvesh** + +--- + +# Eggstrain - Demo on TPC-H Query 1 + +Our proof of concept is a demo on TPC-H Query 1 using the `eggstrain` execution engine. + +The main contribution is our asynchronous framework utilizing tokio and rayon threads. + +--- + +# The End of Eggstrain + +With this working demo, we have almost completed our initial goal of creating an execution engine with the operators we have previously discussed. + +- More work to be more robust and feature-rich +- Foundation for a powerful and efficient async execution engine +- Need new async bpm to support the engine + +--- + +# Recap: Buffer Pool Manager + +A buffer pool manager manages synchronizing data between volatile memory and persistent storage. + +- In charge of bringing data from storage into memory in the form of pages +- In charge of synchronizing reads and writes to the memory-local page data +- In charge of writing data back out to disk so it is synchronized + +--- + +# Traditional Buffer Pool Manager + +![bg right:50% 100%](images/traditional_bpm.png) + +Traditional BPMs will use a global hash table that maps page IDs to memory frames. + +- Source: _LeanStore: In-Memory Data Management Beyond Main Memory (2018)_ + +--- + +# Recap: Blocking I/O + +Additionally, traditional buffer pool managers will use blocking reads and writes to send data between memory and persistent storage. + +Blocking I/O is heavily reliant on the Operating System. + +> The DBMS can almost always manage memory better than the OS + +- Source: 15-445 Lecture 6 on Buffer Pools + +--- + +# Recap: I/O System Calls + +What happens when we issue a `pread()` or `pwrite()` call? + +- We stop what we're doing +- We transfer control to the kernel +- _We are blocked waiting for the kernel to finish and transfer control back_ + - _A read from disk is *probably* scheduled somewhere_ + - _Something gets copied into the kernel_ + - _The kernel copies that something into userspace_ +- We come back and resume execution + +--- + +# Blocking I/O for Buffer Pool Managers + +Blocking I/O is fine for most situations, but might be a bottleneck for a DBMS's Buffer Pool Manager. + +- Typically optimizations are implemented to offset the cost of blocking: + - Pre-fetching + - Scan-sharing + - Background writing + - `O_DIRECT` + +--- + +# Non-blocking I/O + +What if we could do I/O _without_ blocking? There exist a few ways to do this: + +- `libaio` +- `io_uring` +- SPDK +- All of these allow for _asynchronous I/O_ + +--- + +# `io_uring` + +![bg right:50% 90%](images/linux_io.png) + +This Buffer Pool Manager is going to be built with asynchronous I/O using `io_uring`. + +- Source: _What Modern NVMe Storage Can Do, And How To Exploit It... (2023)_ + +--- + +# Asynchronous I/O + +Asynchronous I/O really only works when the programs running on top of it implement _cooperative multitasking_. + +- At a high level, the kernel gets to decide what thread gets to run +- Cooperative multitasking allows the program to decide who gets to run +- Context switching between tasks is a lightweight maneuver +- If one task is waiting for I/O, we can cheaply switch to a different task! + +--- + +# Eggstrain + +The key thing here is that our Execution Engine `eggstrain` fully embraces asynchronous execution. + +- Rust has first-class support for asynchronous programs +- Using `async` libraries is almost as simple as plug-and-play +- The `tokio` crate is an easy runtime to get set up +- We can easily create a buffer pool manager in the form of a Rust library crate + +--- + +# Goals + +The goal of this system is to _fully exploit parallelism_. + +- NVMe drives have gotten really, really fast +- Blocking I/O simply cannot match the full throughput of an NVMe drive +- They are _completely_ bottle-necked by today's software +- If we can fully exploit parallelism in software _and_ hardware, we can get close to matching the speed of in-memory systems, while using persistent storage + +--- + +# Proposed Design + +The next slide has a proposed design for a fully asynchronous buffer pool manager. The full (somewhat incomplete) writeup can be found [here](https://github.com/Connortsui20/async-bpm). + +- Heavily inspired by LeanStore + - Eliminates the global page table and uses tagged pointers to data +- Even more inspired by this paper: + - _What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines (2023)_ + - Gabriel Haas and Viktor Leis +- The goal is to _eliminate as many sources of global contention as possible_ + +--- + +![bg 100%](images/bpm_design.png) + +--- + +# Some Issues + +- There is a scheduler-per-thread, but no scheduler assigning tasks to specific workers (so it does not work with a multithreaded asynchronous runtime) +- The proposed design does not have a backend stage to fully synchronize I/O +- Eviction is done naively by a single worker thread + - Deadlocks!!! +- Will probably switch to polling a list of free frames that gets populated by foreground tasks + +--- + +# Future Work + +- This will definitely not be done by the end of this semester +- Which means our execution engine is also not going to be "complete" in that we will not support spill-to-disk +- Our contribution is the beginning of an implementation of an asynchronous Buffer Pool Manager in Rust + - That can theoretically be plugged into any asynchronous execution engine like `eggstrain` or even DataFusion + +--- + +# **Thank you!**