Tracker: FlatGFA data structure optimization #184

sampsyo · 2024-05-18T15:42:49Z

A next big phase for FlatGFA is to try implementing different data-layout optimizations and measure their impact on data size, parsing performance, and various operators' performance. We're actually already a little bit "behind" on measuring performance because the baseline version uses a couple of minor representation tricks:

Use u32s for all indices instead of usizes.
Pack handles into 32 bits (31 bits for the segment reference, 1 bit for the orientation).
Pack CIGAR alignment operations into 32 bits (1 byte for the operation, 3 bytes for the length).

Here are some next things to try:

Try smaller indices in general. Do we really need 32 bits for most indices? It would be really cool if, depending on the size of the data, we could customize the size of indices… without disrupting the processing code too much.
Further pack CIGAR alignment ops into 16 bits. It seems like 1 byte is probably enough for most lengths? We should check.
Sub-byte nucleotide sequence compression (e.g., 3 bits per bp).
Better alignment. We currently use #[repr(packed)] for basically everything, which means (a) more unaligned accesses, and (b) more copying to deal with those unaligned accesses. By strategically padding things out to word-sized chunks, we might sacrifice some space for memory performance.
Smaller Spans. A Span is currently a start/end pair (two u32s). This is clearly overkill for most cases because the previous entry's end is always the current entry's start… so, we could either (a) just store the start or end, and require two lookups in the case of random access, or (b) just store the size, and sacrifice random access altogether. Maybe we should try both!
Optimize for sequential segment names. Currently, every Segment stores its name… but, in most GFAs we encounter, these names are serial integers, so the name is always equal to the index plus one. In this situation, we should avoid storing the name altogether; whenever code needs it, we should just return id+1. We'll still want a fallback for GFAs with non-sequential names.

In a pretty different category, we also have the idea of stealing odgi's steps-by-node index. This is a different category because it is a fairly holistic change and not a low-level representation trick. I think we should only try it when/if we find evidence that it will really matter for something?

Something that will come up quickly when looking into these is that various optimizations will be beneficial in only some circumstances, or perhaps even possible in only some circumstances. So we will rapidly end up with multiple options for the representation. I'm not sure what to do about this—we would like to avoid duplicating code, so maybe there is some Rust trait trickery that can allow customization, or maybe we need macros, or maybe we need to switch to full-fledged code generation. Can't wait to find out what we need along the way!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracker: FlatGFA data structure optimization #184

Tracker: FlatGFA data structure optimization #184

sampsyo commented May 18, 2024

Tracker: FlatGFA data structure optimization #184

Tracker: FlatGFA data structure optimization #184

Comments

sampsyo commented May 18, 2024