Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaveletMatrix #4

Open
somethingelseentirely opened this issue May 19, 2024 · 8 comments
Open

WaveletMatrix #4

somethingelseentirely opened this issue May 19, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@somethingelseentirely
Copy link

somethingelseentirely commented May 19, 2024

It would be nice if there was a WaveletMatrix implementation, as it would enable the most common succinct-datastructure applications in text-processing and databases.

@Cydhra
Copy link
Owner

Cydhra commented May 19, 2024

I never used or worked with those, but from skimming the paper, it seems a lot less painful than my attempts at implementing succinct trees. Adding a feature that actually has use cases and is easier to implement definitely sounds more fun.
I am pretty tied up right now, but when I eventually have some free time again, I'll give it a shot.
Do sucds and cseq implement Wavelet Trees or Matrices?

@Cydhra Cydhra added the enhancement New feature or request label May 19, 2024
@somethingelseentirely
Copy link
Author

Yeah, they are both implementing WaveletMatrices. I think the main insight for wavelet matrices comes from the fact that the total number of bits in each level stays constant for a balances wavelet tree (this invariant doesn't hold for compressed versions like huffman trees), which makes them also a lot easier to implement.

I'd be happy to also give an implementation a shot, but I'd probably aready try to build it with zerocopy deserializability in mind (#5) if that's ok with you. Because I need this for an on-disk db storage format.

@Cydhra
Copy link
Owner

Cydhra commented May 20, 2024

Yeah feel free to. Zero copy deserializability is a feature that I'd like, so I don't think there is a reason to keep the two issues disjoint.

@Cydhra
Copy link
Owner

Cydhra commented Jul 13, 2024

Implementing it with zero-copy deserializability in mind is obviously possible, but not really necessary. While it is true, that the data structures are immutable, they do not freely lend themselves to zero-copy (de)serialization.
Zero-copy deserialization has to mind the endianess of the backing system. That is, each zero copy data structure needs to have multiple versions which handle either endianness, with different numeric types. Mapping, for example, a big-endian encoded file into a little-endian machine works, but the data types must be handled differently because of the inverted byte order. In contrast, converting the big-endian structure into little-endian on the fly is also possible, but destroys the zero-copy property. We cannot just write a data structure that is already the archived version, which rkyv usually provides, because of that.

I started implementing a WaveletMatrix on the branch dev_wavelet. The data structure, just like the other data structures in this library is agnostic to zero-copy deserialization. If zero-copy serialization were to be implemented for it, more data types that represent either endianness (among other things like #[repr(C)]) need to be added. The only thing that is possible to already do here, is to add the functionality of the WaveletMatrix via a trait, to later reuse code for the zero copy versions.

Zero copy versions of other data structures, as outlined in #5, need these traits too, inducing potentially breaking changes.

@Cydhra Cydhra mentioned this issue Jul 16, 2024
9 tasks
@somethingelseentirely
Copy link
Author

somethingelseentirely commented Jul 21, 2024

Thanks for pushing this! I carry the wavelet matrix paper around with me in my backpack, but there were other fires that popped up that need to be put out first. 😓

I've also recently encountered the endianness issue while working with candle and model weight storage formats like SafeTensor. The approach of all Rust libraries and storage formats I've encountered seems to be to just assume that all of the relevant platforms are/support little-endian these days.

I would still want to make sure that that is the case and add something like:

#[cfg(not(target_endian = "little"))]
compile_error!("Zero copy is only supported on LE architectures.)

But after some philosophical wrangling with my inner perfectionist, I've concluded that this is a valid approach 😅. There are few BE-only systems supported by Rust, and they are all embedded systems that don't have SIMD and maybe don't even have a file system let alone the ability to mmap files.

Edit:
To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate. Due to native number types already being the correct endian thanks to the above check, they don't need to use the byteorder types, and can just use rusts native number types.

@Cydhra
Copy link
Owner

Cydhra commented Jul 22, 2024

To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate.

Maybe it is easier when using zerocopy rather than rkyv. However, there is unsafe code in vers that may or may not play nicely with the safety assumptions of zerocopy.

I still strongly prefer to have this issue decoupled from zero-copying (despite what I said two months ago), but I'd be happy to see a proof of concept with zerocopy.

@somethingelseentirely
Copy link
Author

Definitely fair. With zerocopy there should be virtually no considerations to make while implementing this, and the struct can be made zero-copy later on. I think the only limitation is that once zero copy is implemented the struct's can't be changed anymore/need to be versioned, but that's not really relevant for implementing the wavelet matrix itself.

@Cydhra
Copy link
Owner

Cydhra commented Jul 22, 2024

That is unfortunately already true because of serde (I think). Changes that need to be made are collected in #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants