Feature request: creates a distinct color for each sequence in the input file #27

Malfoy · 2024-03-29T13:50:15Z

Hi @jermp !
Could you provide some option similar to Themisto one to handle one fasta file that contain many sequence that should be considered of different colors?

-e, --sequence-colors Default if the input has just a single sequence
file. Creates a distinct color 0,1,2,... for
each sequence in the input.

jermp · 2024-04-03T07:32:52Z

Hi @Malfoy, yes, this could be done in principle. I have to understand what support (if any) from GGCAT we have for this though. I do not think we are going to implement it soon since our focus is currently on a different aspect, but happy to collaborate for this feature if you or your students are willing to try it. Please, let me know!

Best,
-Giulio

jnalanko · 2024-06-11T08:09:58Z

Chiming in: In Themisto, this option uses my old construction algorithm that uses my own external memory sorting algorithm. It's very slow compared to ggcat.

While ggcat does not directly support this the last time I checked, in principle it would be possible to split the input into one file per sequence, and feed that to ggcat. But this might create millions of files, so this is not ideal.

jermp · 2024-06-11T08:21:54Z

Thanks Jarno for your input. Yes, splitting each file into several sequences would be the way to go but, as you said, looks like an overkill. I think one approach would be to keep the indexes as they are and add some metadata (compressed?) to indicate the sequence-ids, and not just the color. A sort of hierarchical color scheme, where we have super-colors (the original colors) and colors (now, the sequence ids). Does it make sense? CC: @rob-p

rob-p · 2024-06-14T05:09:32Z

Hi all! @jermp; I think that this will be an increasingly common use-case, so it's worthwhile to figure out a principled way to do it. If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

In fact, we have a use-case right now where I think fulgor (meta-colored dBG) would be perfect, but all of our input is middle-length sequences in a single file. We want to build an index on one file with ~1M sequences, and another with ~26M sequences, and there's no way we want to split those into 1 file / sequence just to build the index (happy to explain this specific use-case in more detail over e-mail / chat if you'd like).

jermp · 2024-06-15T09:10:06Z

If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

The idea would be to have something even more efficient. Splitting everything would result in many small lists and would be hard to compress unless they have some special properties (which they might have since they come from the same file, hence "correlated" somehow). Always happy to chat, of course!

jermp added the enhancement New feature or request label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: creates a distinct color for each sequence in the input file #27

Feature request: creates a distinct color for each sequence in the input file #27

Malfoy commented Mar 29, 2024

jermp commented Apr 3, 2024

jnalanko commented Jun 11, 2024

jermp commented Jun 11, 2024

rob-p commented Jun 14, 2024 •

edited

Loading

jermp commented Jun 15, 2024 •

edited

Loading

Feature request: creates a distinct color for each sequence in the input file #27

Feature request: creates a distinct color for each sequence in the input file #27

Comments

Malfoy commented Mar 29, 2024

jermp commented Apr 3, 2024

jnalanko commented Jun 11, 2024

jermp commented Jun 11, 2024

rob-p commented Jun 14, 2024 • edited Loading

jermp commented Jun 15, 2024 • edited Loading

rob-p commented Jun 14, 2024 •

edited

Loading

jermp commented Jun 15, 2024 •

edited

Loading