Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] F2 KV store #922

Open
wants to merge 179 commits into
base: main
Choose a base branch
from
Open

[C++] F2 KV store #922

wants to merge 179 commits into from

Conversation

kkanellis
Copy link
Contributor

This PR F2, an evolution of FASTER key-value store. More info can be found here.

kkanellis and others added 30 commits August 6, 2021 15:36
* [C++] Add force option to record user Delete request

* If force is set to true, then a tombstone will be appended to the log,
  irrespective of whether the hash index contains the record itself.

* [C++] Support for defining a Guid for a session externally

* [C++] Replace checkpoint inline callback definition

... with predefined types.

* [C++] Rmw can be configured to not create record

... if one does not exists inside the log.

* [C++] Implement method for conditionally copying to log tail

* [C++] Use minimum number of mutable pages if value is 0

* [C++] Initial implementation of FASTER hot-cold design

* Currently supports reads, upserts, deletes and RMWs.

* [C++] Fix compilation error

* [C++] Initial tests for hot-cold design

* [C++] Lookup-based hybrid log compaction (microsoft#487)

* [C++] Log scan can now return record address, along with record

* [C++] Add implementation of Address + operator

* [C++] Add method for finding if a record exists in the hybrid log

* Note that if a tombstone record exists, it will return true.

* [C++] Initial implementation of a better log compaction algorithm

* It leverages the hash index to identify live records, and copy them
  to the tail of the log.

* Ensures that if a user performs a concurrent upsert, compaction won't
  overwrite their operation.

* Avoids expensive scan of the entire log -- only the relevent log
  section is read.

* [C++] Remove unnecessary template typenames

* [C++] Fix several issues in compaction code

* [C++] Several bugfixes in log scan iterator

* Now correctly switches to read from next page if the record didn't
  entirely fit in the previous one.

* Fix bug where record address was wrong

* Fix bug where in-disk page wasn't read due to >0 offset in passed
  address.

* [C++] Minor bugfixes in lookup-based compaction

* [C++] Add tests for lookup-based compaction algorithm

* [C++] Fix bug in Addres + operator

* [C++] Add a medium-sized value type for tests

* [C++] Compaction context/entry now stores record address

* [C++] Bugfixes in compaction code

* [C++] Update log compaction tests

* Add tests with where other threads perform concurrent insertions &
  deletions

* Test actual log truncation correctness (using `ShiftBeginAddress`
  method).

* [C++] Fix test compilation error

* [C++] Refactor log compaction code

* [C++] Minor changes

* [C++] Better status handling in RecordExists method

* [C++] Log compaction with multiple threads

* [C++] Unoptimized concurrent page-granularity compaction

* [C++] Fix bug in tests

* [C++] Concurrent compaction \w non-blocking waiting for threads

* [C++] Introduce page- and record-granularity log iterators

* Page-granularity iterator is used with the new lookup-based compaction
  method, while the (older) record-granularity one is used by the (old)
  compaction algorithm.

* The page log iterator can still be optimized further (i.e. avoid
  locking, prefetching, etc).

* [C++] Avoid key/value copying on compaction contexts

* [C++] Improvements on the log compaction method

* + bugfix on sessions start/stop when using multiple threads.

* [C++] Make obsolete write key calls on Read/Exists contexts

* [C++] Add variable-length key tests for log compaction

* [C++] Add delete ops to varlen keys tests

* [C++] Concurrent lock-free log iterator with prefetching

* [C++] Add variable-length value tests for lookup compaction

* [C++] Bugfixes in tests

Co-authored-by: Badrish Chandramouli <badrishc@microsoft.com>
Co-authored-by: Kirk Olynyk <kirk.olynyk@microsoft.com>

* Better design FASTER's of copy to tail method

* Implement hot-cold & cold-cold compaction

* Include RMW in compact lookup tests

* Bugfixes in core FASTER

* Bugfixes & preliminary work for retrying RMW ops in hot-cold

* Minor changes in compact lookup tests

* Rework hot-cold implementation to support retries

* Bugfixes in FASTER RMW & log compaction

* Update hot-cold design tests

* Minor change in RMW

* Minor cleanup & better handling of complete pending requests

* Proper handling of deleted records in hot-cold

`Read` method returns `NOT_FOUND` either if no record was found, or if a
tombstone record was found. While there is no point separating the two
cases in the single log case, in the hot-cold design it is important to
know which is the case.

The most useful use-case for that is for the hot-cold `Read` method: if
a tombstone was found in hot log, there is no need to search the cold
log. In other words, `Read` will go through the cold log only if no record
(normal or tombstone) was found in the hot log.

Thus, FASTER `Read` method can now be configured to return a different
status (i.e. `ABORTED`) if it finds a tombstone, insted of `NOT_FOUND`.
We support this, using an additional optional flag `abort_if_tombstone`
in the Read function prototype. By default this is set to `false` --
only hot-cold design this flag, when a Read is issued on the hot log.

* Update hot-cold tests

* Bugfix to guarrantee progress in both stores

Co-authored-by: Badrish Chandramouli <badrishc@microsoft.com>
Co-authored-by: Kirk Olynyk <kirk.olynyk@microsoft.com>
During log compaction (\w lookup), live records are being copied to the
tail of the log. Once the all live records have been copied, the part
of the log that was just compacted is truncated. However, there is a
slim chance that during the log truncation a pending Read operation will
return NOT_FOUND, even thought a record for this key exists.
Specifically this can happen if a live record is being copied to the
tail of the log, but the Read operation has already checked the log
tail, and has issued one (or more) I/O requests to read disk-resident
records. In this case, if we truncate the log before this Read operation
reaches the live record, the Read will return NOT_FOUND.

In order to handle this undesired behavior we keep track of the number
of truncations after performance log compaction (global). Each Read operation
keeps a local copy of this number in its context. If the Read operation
has reached the end (begin) of the log and has not found a live record,
we check if there a log truncation occured due to a log compaction. If
this is the case, this Read op will retry, in order check the newly
introduced log part. This last part is now supported using the
`min_start_address' argument that can be defined in the Read context.
In this case, the Read operation will not go throught the entire log.
This fixes some spurious error messages, including the following:

`Assertion `idx < size_' failed.'
Fixes a bug that was due to improper calculation of how many bytes to
read from disk.
For write-intensive workloads, it is possible that even during
compaction, the maximum hlog budget can been reached. For example,
this can occur when the rate of ingesting requests to the hot log is
higher then the rate of compacting rates to the cold log.

To fix, we now allow user threads to participate to the compaction
process, only if we reach the (hard) hlog size limit. Note that
background compaction threads are anyways performing only compaction
work. Once the compaction completes, user threads can resume serving
user requests, as before.
We would like to keep the typedefs, even if unused, for clarity
purposes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant