Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enter at postbinning stage #623

Open
prototaxites opened this issue Jun 3, 2024 · 9 comments
Open

Enter at postbinning stage #623

prototaxites opened this issue Jun 3, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@prototaxites
Copy link
Contributor

Description of feature

Decided I wanted to try and bin my data using Vamb, which isn't in the pipeline yet. Would be a useful feature to be able to supply a csv (or directory?) of bins and jump directly into the bin QC/taxonomy/annotation steps!

Might try my hand at this one if I find a bit of time, but I suspect that it will be finnicky depending on what exact metadata we will want to tag onto the input bins and that might need some discussion.

@prototaxites prototaxites added the enhancement New feature or request label Jun 3, 2024
@jfy133
Copy link
Member

jfy133 commented Jun 3, 2024

You can try, but tbh it likely would be faster to just add vamb to the pipeline 🤣🤣🤣

@prototaxites
Copy link
Contributor Author

Having looked at Vamb, as it (ideally) require concatenating all assemblies and renaming contigs along a complicated scheme - I think it's going to play havoc with any system that's comparing bins using contig names (DAS_Tool and Tiara)... 😅

@jfy133
Copy link
Member

jfy133 commented Jun 4, 2024

Uuugghhhhh

@jfy133
Copy link
Member

jfy133 commented Jun 4, 2024

I guess we will need to make a metadata file to track them or something and covert headers back?

@jfy133
Copy link
Member

jfy133 commented Jun 4, 2024

"Concatenate the FASTA files together while making sure all contig headers stay unique"

If that's all it's doing, might be a reasonable thing to do upstream immediately after assembly anyway thinking about it...

@prototaxites
Copy link
Contributor Author

Furthermore, if you want to use binsplitting (and you should!), your contig headers must be of the format {Samplename}{Separator}{X}, such that the part of the string before the first occurrence of {Separator} gives a name of the sample it originated from. For example, you could call contig number 115 from sample number 9 "S9C115", where "S9" would be {Samplename}, "C" is {Separator} and "115" is {X}.

So it's a little more complicated! I'm not sure if renaming all the contigs initially is the best solution disk-space wise - as we just create a copy of all assemblies with different headers for a tool that we (potentially) might not choose to run...

Not to mention mapping the reads to the concatenated assembly, and then parsing that separately through the depths workflow 🫢

@jfy133
Copy link
Member

jfy133 commented Jun 4, 2024

Furthermore, if you want to use binsplitting (and you should!), your contig headers must be of the format {Samplename}{Separator}{X}, such that the part of the string before the first occurrence of {Separator} gives a name of the sample it originated from. For example, you could call contig number 115 from sample number 9 "S9C115", where "S9" would be {Samplename}, "C" is {Separator} and "115" is {X}.

So it's a little more complicated! I'm not sure if renaming all the contigs initially is the best solution disk-space wise - as we just create a copy of all assemblies with different headers for a tool that we (potentially) might not choose to run...

Not to mention mapping the reads to the concatenated assembly, and then parsing that separately through the depths workflow 🫢

Ugh ok.

It's weird though as earlier the documentation implies you don't have to do all of that?

I don't have a good suggestion then 😅, sounds like it'll all be painful one way or another...

@maxibor
Copy link
Member

maxibor commented Jun 26, 2024

As a stopgap measure, I've written mgenotatte to do just that: genome QC, dereplication, and taxonomic annotation
https://github.com/maxibor/mgenottate

@prototaxites
Copy link
Contributor Author

prototaxites commented Jun 26, 2024

As a stopgap measure, I've written mgenotatte to do just that: genome QC, dereplication, and taxonomic annotation https://github.com/maxibor/mgenottate

Ah, that's cool! In the end I just ended up forking mag, deleting the first part of the main workflow and dropping bins in via directory input: https://github.com/prototaxites/mag/tree/bin_entry

Also, I have a separate pipeline for metagenome gene annotation that is just a couple of characters different in name from yours: https://github.com/prototaxites/mgannotate 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants