-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-Copy Serialization/Deserialization #5
Comments
rkyv looks good, and adding that with an optional dependency (because I am quite keen on keeping it zero-dependencies) might be an option. I'll look further into it, because it seems like a nice additon. |
When adding a new serialization framework, it's worth thinking about the serialization-breaking change of reducing stack size by changing |
Endianess is a good point. But I think rkyv handles it somewhat ungracefully, by using feature flags, and I'm not sure what happens when you use two libraries that transitively use both archive_be and archive_le. |
No, using features is actually convenient for me, because I just disable all features, and let the downstream crate decide. |
I think the same late binding could be achieved with generic paramethers though? Without the conflict problem where different pieces of code want different endianness. |
I mean, as long as you don't import serialized data from systems with opposite endianness, using only native endianness shouldn't create any issues, no? |
Well if you're using the (de)serialization as way to create a data-exchange/file format (think .jpg not application instance specific .dat), then that format will want to decide on some endianess. In my case it's a file format for knowledge graphs, with the added bonus that you can query it without having to build any indexes first, just mmap and go. So it's always going to be in be. The Stable Cross-Platform Database File section in the SQLite documentation is probably the best description of that use case. Avoiding breaking changes caused by the way rkyv stores things is also an argument for rolling our own framework agnostic data layout. Edit: Btw it's also completely fine if such a use-case doesn't align with the project goals 😄 |
Okay, I see where you are coming from, but exporting into pre-defined file formats using zero-copy serialization seems difficult. For example, let's pretend you write a Wavelet Matrix that looks like the database file and can be directly serialized and deserialized to and from that format. The rank/select data structures still need to be written to file, if zero-copy deserialization is a goal, which will interleave the data format with junk data. |
I'm not sure I'm following. Do you mean undefined data caused by padding / unitialised memory? Rkyv for example zeroes these to get determinism and also avoid accidental memory data leakage. Rkyv has an unsafe way to deserialize without any checks btw, but the default is having a validation/sanity-check step on read. So it's not just transmute and go. On a more philosophical level and ignoring the more difficult mutable/growable succinct datastructures for now, I feel that the static/write-once-ness of most succinct data-structures makes them an interesting special case. I'm not certain myself where I would pinpoint "serialization" in their lifecycle; Is it at the point they are constructed from their mutable "builder" precursor, or is it at the point where one actually calls Similarly, having sanitation performed only at serialization might be worthwhile, so that people that don't care about serialization don't have to pay for that, but on the other hand it might actually be cheaper to initialize the datastructure in a sanitized state, e.g. by calling |
No, for example pub struct RsVec {
data: Vec<u64>,
len: usize,
blocks: Vec<BlockDescriptor>,
super_blocks: Vec<SuperBlockDescriptor>,
select_blocks: Vec<SelectSuperBlockDescriptor>,
rank0: usize,
rank1: usize,
} And I assume your data format only wants the There are actually even more problems to this approach:
So all in all, this is unfortunately not an easy issue, a lot of code needs to be written to support rkyv with a big endian serializer, and manually implement the operations in a way that you can call them on |
I would question that assumption and ask why shouldn't they be included? They are an essential component of the datastructure and enable the nice time complexities. The only pitfall is that the support data should be deterministic. Your points are related to what I tried to express with my previous musings, wondering at which moment the serialization happens. With that line of thought, I would make I suspect this isn't as expensive as it sounds:
And yes, this would probably be a major rewrite of |
Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format). Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns. But now it seems you don't want to do that.
This is not the reason why
Finally, the largest hurdle that If you want to store the vector as big endian and then Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck) |
No I'm just trying to (de)serialize a bunch of wavelet matrices, but for my own to-be-spec-ed database index format similar to HDT.
Fair point, but I feel like I would have that problem with any implementation, even if I managed to find the most textbook I figured, given that I'll have this problem regardless, that I'm just gonna go with the library made in Germany™️ (I'm from Bremen), then it'll at least look nice if we ever co-author a paper, and might give Rust more street-cred in Germany. 🤣
My point was that from an API/user perspective
Yeah when I said "munged together into a single contiguous allocation" I meant that the reworked implementation would get rid of the
I was assuming (based on a quick glance and the way other implementations work) that the implementation uses SIMD only to
Sure sure, my hunch is that it might be a zero sum game, some perf improvements from removing vector bounds checks, some slowdown from having to do some on the fly endianness conversion. But I'll probably know more once I've properly read through the entire codebase. I'm aware that it would take some major rework. I'm also definitely not asking you to do it, I'd do it myself as part of the wavelet matrix. 😄 |
I pushed some changes to a new branch All functionality of As it stands, this constitutes a breaking change, because you now have to import the trait to access methods on |
Awesome, I'll check it out asap! |
The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.
Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.
One might want to implement this via
rkyv
, but simply providing a safe transmute to and frombytes::Bytes
(with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.The text was updated successfully, but these errors were encountered: