Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: creates a distinct color for each sequence in the input file #27

Open
Malfoy opened this issue Mar 29, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@Malfoy
Copy link

Malfoy commented Mar 29, 2024

Hi @jermp !
Could you provide some option similar to Themisto one to handle one fasta file that contain many sequence that should be considered of different colors?

-e, --sequence-colors Default if the input has just a single sequence
file. Creates a distinct color 0,1,2,... for
each sequence in the input.

@jermp jermp added the enhancement New feature or request label Apr 3, 2024
@jermp
Copy link
Owner

jermp commented Apr 3, 2024

Hi @Malfoy, yes, this could be done in principle. I have to understand what support (if any) from GGCAT we have for this though. I do not think we are going to implement it soon since our focus is currently on a different aspect, but happy to collaborate for this feature if you or your students are willing to try it. Please, let me know!

Best,
-Giulio

@jnalanko
Copy link

Chiming in: In Themisto, this option uses my old construction algorithm that uses my own external memory sorting algorithm. It's very slow compared to ggcat.

While ggcat does not directly support this the last time I checked, in principle it would be possible to split the input into one file per sequence, and feed that to ggcat. But this might create millions of files, so this is not ideal.

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

Thanks Jarno for your input. Yes, splitting each file into several sequences would be the way to go but, as you said, looks like an overkill. I think one approach would be to keep the indexes as they are and add some metadata (compressed?) to indicate the sequence-ids, and not just the color. A sort of hierarchical color scheme, where we have super-colors (the original colors) and colors (now, the sequence ids). Does it make sense? CC: @rob-p

@rob-p
Copy link
Collaborator

rob-p commented Jun 14, 2024

Hi all! @jermp; I think that this will be an increasingly common use-case, so it's worthwhile to figure out a principled way to do it. If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

In fact, we have a use-case right now where I think fulgor (meta-colored dBG) would be perfect, but all of our input is middle-length sequences in a single file. We want to build an index on one file with ~1M sequences, and another with ~26M sequences, and there's no way we want to split those into 1 file / sequence just to build the index (happy to explain this specific use-case in more detail over e-mail / chat if you'd like).

@jermp
Copy link
Owner

jermp commented Jun 15, 2024

If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

The idea would be to have something even more efficient. Splitting everything would result in many small lists and would be hard to compress unless they have some special properties (which they might have since they come from the same file, hence "correlated" somehow). Always happy to chat, of course!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants