Added optional serde support, including an example in the README #59

cisaacson · 2020-06-14T19:18:32Z

Here is the PR for the serde feature. I added a simple example using bincode in the README. There is no test, I could do that easily but it would mean adding another dev dependency, not sure if you wanted that in the codebase.

The version has not been changed, that should be done if you decide this to put it in a release.

WaDelma

Looks like this ought to work, but there is the question on the serialization format. How were you planning to use serialization? Just to get idea on common usage.

README.md

WaDelma · 2020-06-15T07:30:05Z

src/lib.rs

@@ -70,6 +75,7 @@ use util::*;
 /// Note, a `BitSet` is limited by design to only `usize**4` indices.
 /// Adding beyond this limit will cause the `BitSet` to panic.
 #[derive(Clone, Debug, Default)]
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]


Yeah, deriving should work.

However, it is possible to manually implemented the serialization and deserialization so that they use only layer0. This would make serialized result use less memory and make serialization faster, but possibly slow deserialization down. When deserializing we have to do operations for all layers in both implementations and IO is usually the more expensive part, so the custom implementation would most likely be faster in general.

Choosing which to use should be chosen before publishing new version as changing serialization format is a breaking change.

The memory usage difference is that with the derive we use O(n log_{bits in usize} n) and with custom derive we use O(n) memory.

Probably best to benchmark the difference in speed if we want to have the custom implementation.

This is totally up to your team, no doubt a custom serializer would be more compact and faster. In my use case I have a bitset of 100K bits, the size is not as compact as I'd like, about the same as a test with fixedbitset (12K). If there is a way to get it much smaller I would be interested, and if you can guide me on that I may have time to implement it.

So how the data structure works is that the layer0 is essentially fixedbitset.
So if we want to access n:th bit we find which usize in the Vec it belongs to and then mask the right bit out of it.

How hibitset differs is that it has layers above it. The idea is that if there is a 1 in on upper layer at some point, then layer below contains 1's in corresponding usize. This means that if n:th bit is 1 in layer0 then in layer1 there will be 1 in floor(n/64) position and 1 in floor(n/(62^2)) position of layer2 and so on.

What all this means is that in serialization we can just serialize layer0 as is (which means that the serialized form is pretty much the same as in fixedbitset). However, when deserializing, the layers above have to be reconstructed. Most naive way to do this would be just to create empty hibitset with correct size with with_capacity and then insert bits one by one. Less naive way would be to create hibitset with deserialized vector on layer0 and, empty, but right sized, layers above and then go through layer1 checking if corresponding layer0's usize is equal to 0 or not, repeating until on highest layer.

If you want to try your hand on implementing it, I can help you.

@WaDelma Here is a simple implementation of the naive way to do this, now if you can give me some pointers I'll try the layer building.

// Small test to serialize layer0 structure and deserialize by setting each bit let set3 = &mut BitSet::new(); set3.add(3); set3.add(9); set3.add(72); let set3_layer0 = &mut set3.layer0_as_slice(); // Create a new set to deserialize into let result_set3 = &mut BitSet::with_capacity(set3_layer0.len() as u32); let mask = 1u64; println!("mask: {:#066b}", mask); for i in 0..set3_layer0.len() { let u = set3_layer0[i]; println!("hibitset: set3: u: {:#066b}", u); for j in 0..std::mem::size_of::<usize>() * 8 { let bit = if ((u as u64) >> j) & (mask as u64) > 0 { true } else { false }; println!("bit pos: {} bit: {}", i, bit); if bit { result_set3.add((j as u32) + (i as u32 * 64)); } } } for elem in result_set3.iter() { println!("hibitset: elem: {}", elem); } assert_eq!(result_set3, set3); assert!(result_set3.contains(9u32));

I'm sure your method will be much faster so I'd like to try it with some guidance, but this gives me a test to start with that functions properly.

@WaDelma I ran some tests, and using serde and bitpacking the size is only slightly larger than the custom serialization. It was about 12.7K using serde and the custom serialization was just over 12K. It could be because my bitset was not very dense, so maybe it will make a bigger difference if we use your more advanced method. Let me know what you think, I'll wait until you can provide some direction on how to do the layer reconstruction upon deserialization.

@cisaacson, Did you try and see how much fixedbitset uses memory? Just to see if that reduction looks good or not.

It could be possible to do some tricks to make sparse/dense bitmaps to take less space, but those might be format dependent and will make some bitsets use more memory, so not sure about them.

As for updating upper layers:

Let b := bits in usize
If your the length of layer0 is n then the length of layer1 has to be m = ceil(n/b)
This means that for layer1 you can loop from 0 to non-inclusive m.
For each iteration j we need to go through all bits of the corresponding usize.
For each bit i we check if layer0[(j * b) + i ] != 0 to see if the current bit should be 1 or not.
This same works with pretty much all layers.

(Was away, so got to answer this only now)

@WaDelma Thanks that makes sense.

I did some more work and research on this, I don't expect there will be much savings for reasonable size BitSets that are sparse over the serde method. Just layer0 for 100K bits is about 12.5K, which is what you would expect, I don't have exact stats any more but serde was about 12.7K. For my use case serde will work for sure.

I also did some homework on sparse bitsets and how to compress them. This is a challenging area (Roaring has done a tremendous amount of work on this). My use case is definitely random and sparse, which makes it virtually impossible to get good compression. Your library is better than Roaring in terms of space by about 25%, because once you have a random sparse BitSet things like run-length encoding only slow it down and take more space. I'm very happy with Hibitset, we will definitely be using it. It's ideal for our use case, but serializing to a smaller size does not seem too feasible right now.

Building the layers as you state would be faster using the layer0 serialization, but again I don't know if it is worth it over using the serde implementation.

Let me know what you think, I am good for now with what you have provided and the enhancements I proposed for serde.

Yeah, its maybe better to start with serde generated implementation and then maybe optimize it later if necessary.

WaDelma

LGTM

README.md

WaDelma · 2020-08-09T09:37:05Z

bors r+

Sorry for leaving this hanging.

WaDelma · 2020-08-09T09:38:08Z

bors r+

lets try again :D

WaDelma · 2020-08-09T09:49:22Z

I seem to still be on list of people bors responds to. Is it dead or what has happened?

cisaacson · 2020-08-09T13:50:24Z

Thanks, I hope it is useful to others. Let me know if you need me to do anything else.

Added optional serde support, including an example in the README

f4ea0fa

WaDelma reviewed Jun 15, 2020

View reviewed changes

cisaacson added 2 commits June 15, 2020 07:20

Remove unused code markers on README

34d1155

Remove unnecessary imports in README

7490d2c

WaDelma approved these changes Jun 23, 2020

View reviewed changes

README.md Show resolved Hide resolved

cisaacson requested a review from WaDelma June 23, 2020 22:23

WaDelma approved these changes Aug 9, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added optional serde support, including an example in the README #59

Added optional serde support, including an example in the README #59

cisaacson commented Jun 14, 2020

WaDelma left a comment

WaDelma Jun 15, 2020

WaDelma Jun 15, 2020

WaDelma Jun 15, 2020

cisaacson Jun 15, 2020

WaDelma Jun 15, 2020

cisaacson Jun 17, 2020 •

edited

Loading

cisaacson Jun 20, 2020

WaDelma Jun 21, 2020 •

edited

Loading

cisaacson Jun 21, 2020

WaDelma Jun 23, 2020

WaDelma left a comment

WaDelma commented Aug 9, 2020

WaDelma commented Aug 9, 2020

WaDelma commented Aug 9, 2020

cisaacson commented Aug 9, 2020

Added optional serde support, including an example in the README #59

Are you sure you want to change the base?

Added optional serde support, including an example in the README #59

Conversation

cisaacson commented Jun 14, 2020

WaDelma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cisaacson Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WaDelma Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WaDelma left a comment

Choose a reason for hiding this comment

WaDelma commented Aug 9, 2020

WaDelma commented Aug 9, 2020

WaDelma commented Aug 9, 2020

cisaacson commented Aug 9, 2020

cisaacson Jun 17, 2020 •

edited

Loading

WaDelma Jun 21, 2020 •

edited

Loading