Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-Copy Serialization/Deserialization #5

Open
somethingelseentirely opened this issue May 19, 2024 · 15 comments
Open

Zero-Copy Serialization/Deserialization #5

somethingelseentirely opened this issue May 19, 2024 · 15 comments
Labels
enhancement New feature or request

Comments

@somethingelseentirely
Copy link

somethingelseentirely commented May 19, 2024

The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.

Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.

One might want to implement this via rkyv, but simply providing a safe transmute to and from bytes::Bytes (with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.

@Cydhra
Copy link
Owner

Cydhra commented May 19, 2024

rkyv looks good, and adding that with an optional dependency (because I am quite keen on keeping it zero-dependencies) might be an option. I'll look further into it, because it seems like a nice additon.
Doing it myself by just casting raw data sounds painful though, because while rkyv offloads endianess somewhere into the frontend (i.e. the user has to decide what to do if I read it correctly), I'd have to handle that if I implement it myself (even if I do the same, I still have to keep it in mind). Maybe I am overthinking that though, I don't know.

@Cydhra Cydhra added enhancement New feature or request good first issue Good for newcomers labels May 19, 2024
@Cydhra
Copy link
Owner

Cydhra commented May 19, 2024

When adding a new serialization framework, it's worth thinking about the serialization-breaking change of reducing stack size by changing Vec into Box in immutable data structures.

@somethingelseentirely
Copy link
Author

Endianess is a good point. But I think rkyv handles it somewhat ungracefully, by using feature flags, and I'm not sure what happens when you use two libraries that transitively use both archive_be and archive_le.

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

No, using features is actually convenient for me, because I just disable all features, and let the downstream crate decide.

@somethingelseentirely
Copy link
Author

somethingelseentirely commented May 20, 2024

I think the same late binding could be achieved with generic paramethers though? Without the conflict problem where different pieces of code want different endianness.
To give an example, my use case requires the ability to reproducibly produce these archives in a bit-perfect manner, so that they can be hashed/checksummed. I could imagine a scenario where most of the system actually wants to go with native endianness for performance, but go with be for those particular datastructures.

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

I mean, as long as you don't import serialized data from systems with opposite endianness, using only native endianness shouldn't create any issues, no?

@somethingelseentirely
Copy link
Author

somethingelseentirely commented May 20, 2024

Well if you're using the (de)serialization as way to create a data-exchange/file format (think .jpg not application instance specific .dat), then that format will want to decide on some endianess. In my case it's a file format for knowledge graphs, with the added bonus that you can query it without having to build any indexes first, just mmap and go. So it's always going to be in be.

The Stable Cross-Platform Database File section in the SQLite documentation is probably the best description of that use case. Avoiding breaking changes caused by the way rkyv stores things is also an argument for rolling our own framework agnostic data layout.

Edit: Btw it's also completely fine if such a use-case doesn't align with the project goals 😄

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

Okay, I see where you are coming from, but exporting into pre-defined file formats using zero-copy serialization seems difficult.

For example, let's pretend you write a Wavelet Matrix that looks like the database file and can be directly serialized and deserialized to and from that format. The rank/select data structures still need to be written to file, if zero-copy deserialization is a goal, which will interleave the data format with junk data.
To solve that, the helper data structures needed to be excluded from the Archive format, but then the deserialization process needs to recalculate them, which negates any speedups gained from the zero-copy process.

@somethingelseentirely
Copy link
Author

somethingelseentirely commented May 20, 2024

I'm not sure I'm following. Do you mean undefined data caused by padding / unitialised memory? Rkyv for example zeroes these to get determinism and also avoid accidental memory data leakage.

Rkyv has an unsafe way to deserialize without any checks btw, but the default is having a validation/sanity-check step on read. So it's not just transmute and go.
I also think that it would be ok to buy cheap deserializability with a more expensive serialization step that does sanitation and cleanup things like memory zero-ing, as I think that most use cases for something like this are write-once-read-many.

On a more philosophical level and ignoring the more difficult mutable/growable succinct datastructures for now, I feel that the static/write-once-ness of most succinct data-structures makes them an interesting special case. I'm not certain myself where I would pinpoint "serialization" in their lifecycle; Is it at the point they are constructed from their mutable "builder" precursor, or is it at the point where one actually calls serialize?

Similarly, having sanitation performed only at serialization might be worthwhile, so that people that don't care about serialization don't have to pay for that, but on the other hand it might actually be cheaper to initialize the datastructure in a sanitized state, e.g. by calling alloc_zeroed.

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

Do you mean undefined data caused by padding / unitialised memory?

No, for example RsVec, the bit vector that supports rank and select (which enables Wavelet Trees), looks like this:

pub struct RsVec {
    data: Vec<u64>,
    len: usize,
    blocks: Vec<BlockDescriptor>,
    super_blocks: Vec<SuperBlockDescriptor>,
    select_blocks: Vec<SelectSuperBlockDescriptor>,
    rank0: usize,
    rank1: usize,
}

And I assume your data format only wants the data bits, but does not take the super_blocks and select_blocks into account (because why would the data format specify those). So if you serialize the WaveletTree into your data format, the actual bits in the vector (data) will sit next to all the bits of the supporting structures (blocks, super_blocks, select_blocks), which don't belong in your target format.

There are actually even more problems to this approach:

  • data is a vector of u64. So if you export it to big endian from a little endian platform, your data must be rearranged during serialization and deserialization, which destroys the zero-copy property.
  • this can be solved if the data is transformed into u8 and then stored, because now endianness doesn't matter anymore. However, this cannot be done with blocks, super_blocks, and select_blocks, because those aren't bit vector structures.
  • if the data is transformed into u8, you cannot transform it back into RsVec without copying, so all operations that work on RsVec need to be reimplemented for ArchivedRsVec (which is the one you want to serialize), but now the data looks different

So all in all, this is unfortunately not an easy issue, a lot of code needs to be written to support rkyv with a big endian serializer, and manually implement the operations in a way that you can call them on RsVec and ArchivedRsVec.

@Cydhra Cydhra removed the good first issue Good for newcomers label May 20, 2024
@somethingelseentirely
Copy link
Author

somethingelseentirely commented May 20, 2024

I would question that assumption and ask why shouldn't they be included? They are an essential component of the datastructure and enable the nice time complexities.
So not storing them would be a bit like storing only the leafs of a B-Tree.
More importantly, since succinct datastructures have a space complexity of ${Z+o(Z)}$ that support data is "asymptotically free", i.e. $\lim {support \over data} = 0$ and it costs "nothing" to store it.

The only pitfall is that the support data should be deterministic.

Your points are related to what I tried to express with my previous musings, wondering at which moment the serialization happens.
I don't think that a separation between ArchivedRsVec and RsVec, as Rkyv suggests, is useful as they both have the same immutable-write-once property.
In my mind RsVec is more analogous to an ArchivedBitVec, since a writable BitVec is serialized/archived into a read-only RsVec, with an analogous deserialization from RsVec to BitVec.

With that line of thought, I would make RsVec (but not BitVec) parametric over the endinaness <E>, always store E integers in the RsVec's memory/support structures, and have every integer access go through the appropriate conversion.

I suspect this isn't as expensive as it sounds:

  • on platforms where the parameter is equal to the native type that conversion should simply get compiled away
  • "embedded" architectures like ARM can actually load both endianess (due to their common application in networking devices)
  • in the cases where the CPU has to actually perform a conversion (e.g. my data format on an x86), I'm doubtful that the added BSWAP instruction will actually be measurable, given that all the operations are heavily memory bound
  • having everything munged together into a single contiguous allocation without indirections and bounds checks would probably be faster than the current std::Vecs (albeit [a lot?] less safe)

And yes, this would probably be a major rewrite of RsVec, or create a new version next to it.

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

I would question that assumption and ask why shouldn't they be included?

Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format).

Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns.

But now it seems you don't want to do that.

I don't think that a separation between ArchivedRsVec and RsVec, as Rkyv suggests, is useful as they both have the same immutable-write-once property.

This is not the reason why rkyv suggests this pattern. Zero-copy can only be achieved if the data structure is both immutable, and does not contain pointers, and no allocating data structures (like Vec).

RsVec is immutable, but it does contain pointers (and currently it also contains Vecs).

  • A Vec cannot be constructed by zero-copy deserialization, because it has to own the data (which it will free at when dropped), and it also wants to manage its own allocation. A pointer creates problems during deserialization, because it has to be stored with a relative offset, but when recreating the data structure, it has to be turned into an actual pointer again (so even more work beside mmap).
  • Box can also not be created by zero-copy deserialization because it, too, has to own the data.

Finally, the largest hurdle that ArchivedRsVec tries to overcome is, again, endianness:

If you want to store the vector as big endian and then mmap it into memory, you cannot recreate a RsVec with zero-copy deserialization from it, because RsVec must use the native endianness (which commonly is little endian). So to avoid a copy, you have to keep the data in ArchivedRsVec and handle the endianness by duplicating the functionality of RsVec for that struct.

Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck)

@somethingelseentirely
Copy link
Author

somethingelseentirely commented May 20, 2024

Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format).

No I'm just trying to (de)serialize a bunch of wavelet matrices, but for my own to-be-spec-ed database index format similar to HDT.
The JPG analogy was meant to clarify that it is meant as a standardized interchange format, and not an ad-hoc format that can change with every release, but it's custom nevertheless.

Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns.

Fair point, but I feel like I would have that problem with any implementation, even if I managed to find the most textbook DArray implementation out there. The alternative would be to build my own, but then I'd be having the same problem on top of reinventing the wheel. So it's easier to just go with something that works and then write that into a spec.

I figured, given that I'll have this problem regardless, that I'm just gonna go with the library made in Germany™️ (I'm from Bremen), then it'll at least look nice if we ever co-author a paper, and might give Rust more street-cred in Germany. 🤣

This is not the reason why rkyv suggests this pattern. Zero-copy can only be achieved if the data structure is both immutable, and does not contain pointers, and no allocating data structures (like Vec).

My point was that from an API/user perspective RsVec behaves like an archived version of BitVec, regardless of it's current internal implementation (which would require an additional archival step to get rid of the absolute pointers and to coalesce the allocations).

A Vec cannot be constructed by zero-copy deserialization

Yeah when I said "munged together into a single contiguous allocation" I meant that the reworked implementation would get rid of the Vecs and manage the memory itself.

Finally, the largest hurdle that ArchivedRsVec tries to overcome is, again, endianness:

I was assuming (based on a quick glance and the way other implementations work) that the implementation uses SIMD only to popcntthe bytes of the lowest block/chunk level, which shouldn't be affected by endianness? In which case doing the conversion on the fly for the other levels should have been feasible, but I should probably dig into the code to see where problems might arise.

Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck)

Sure sure, my hunch is that it might be a zero sum game, some perf improvements from removing vector bounds checks, some slowdown from having to do some on the fly endianness conversion. But I'll probably know more once I've properly read through the entire codebase. I'm aware that it would take some major rework. I'm also definitely not asking you to do it, I'd do it myself as part of the wavelet matrix. 😄

@Cydhra
Copy link
Owner

Cydhra commented May 21, 2024

I pushed some changes to a new branch dev_zero_copy.

All functionality of RsVec is now moved to traits that abstract over the data layout. This way it should be possible to generate an ArchivedRsVec (gated behind a crate-feature) that also implements the trait.

As it stands, this constitutes a breaking change, because you now have to import the trait to access methods on RsVec. This could be fixed by renaming the trait methods and then providing convenience functions in RsVec.

@somethingelseentirely
Copy link
Author

Awesome, I'll check it out asap!

@Cydhra Cydhra mentioned this issue Jul 13, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants