Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracker: FlatGFA data structure optimization #184

Open
3 of 9 tasks
sampsyo opened this issue May 18, 2024 · 0 comments
Open
3 of 9 tasks

Tracker: FlatGFA data structure optimization #184

sampsyo opened this issue May 18, 2024 · 0 comments

Comments

@sampsyo
Copy link
Collaborator

sampsyo commented May 18, 2024

A next big phase for FlatGFA is to try implementing different data-layout optimizations and measure their impact on data size, parsing performance, and various operators' performance. We're actually already a little bit "behind" on measuring performance because the baseline version uses a couple of minor representation tricks:

  • Use u32s for all indices instead of usizes.
  • Pack handles into 32 bits (31 bits for the segment reference, 1 bit for the orientation).
  • Pack CIGAR alignment operations into 32 bits (1 byte for the operation, 3 bytes for the length).

Here are some next things to try:

  • Try smaller indices in general. Do we really need 32 bits for most indices? It would be really cool if, depending on the size of the data, we could customize the size of indices… without disrupting the processing code too much.
  • Further pack CIGAR alignment ops into 16 bits. It seems like 1 byte is probably enough for most lengths? We should check.
  • Sub-byte nucleotide sequence compression (e.g., 3 bits per bp).
  • Better alignment. We currently use #[repr(packed)] for basically everything, which means (a) more unaligned accesses, and (b) more copying to deal with those unaligned accesses. By strategically padding things out to word-sized chunks, we might sacrifice some space for memory performance.
  • Smaller Spans. A Span is currently a start/end pair (two u32s). This is clearly overkill for most cases because the previous entry's end is always the current entry's start… so, we could either (a) just store the start or end, and require two lookups in the case of random access, or (b) just store the size, and sacrifice random access altogether. Maybe we should try both!
  • Optimize for sequential segment names. Currently, every Segment stores its name… but, in most GFAs we encounter, these names are serial integers, so the name is always equal to the index plus one. In this situation, we should avoid storing the name altogether; whenever code needs it, we should just return id+1. We'll still want a fallback for GFAs with non-sequential names.

In a pretty different category, we also have the idea of stealing odgi's steps-by-node index. This is a different category because it is a fairly holistic change and not a low-level representation trick. I think we should only try it when/if we find evidence that it will really matter for something?

Something that will come up quickly when looking into these is that various optimizations will be beneficial in only some circumstances, or perhaps even possible in only some circumstances. So we will rapidly end up with multiple options for the representation. I'm not sure what to do about this—we would like to avoid duplicating code, so maybe there is some Rust trait trickery that can allow customization, or maybe we need macros, or maybe we need to switch to full-fledged code generation. Can't wait to find out what we need along the way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant