Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: best way to merge resulting a3m msa DBs #900

Open
EvanKomp opened this issue Nov 11, 2024 · 3 comments
Open

Question: best way to merge resulting a3m msa DBs #900

EvanKomp opened this issue Nov 11, 2024 · 3 comments

Comments

@EvanKomp
Copy link

Expected Behavior

mmseqs concatdbs ... --preserve-keys with a3m MSA db inputs should produce a single file that is effectively a bash concat of them.

Current Behavior

"Empty msa1! Skipping entry" occurs for MSAs in first file other than first entry

eg.
`>1
MI>6
MLAGLLLAGPALTPMASATPGPLYRNPHASVSSRVDDLLKRMSLDDKVGQMTQAERGAVTPDQAAALKLGSLLSGGGSVPAGNTPNGWADMVDSYQKAAVSTPLGIPTIYGVDAVHGHNNVYGATIFPHNIGLGAANNPRLVEKIGRATALEVAGTGPQWDFSPCLCVARDDRWGRTYESFGESPRDAVANASAITGLQGHGLGEKPGSVLATAKHYVGDGGTTNGVDQGNTEISERELRQIHLPPFREAIDRGVGSVMISFSSFQGVRMHAQKYLITDVLKKELRFSGLVISDYNAINQIDGQEGFTPEEVRLSVNAGIDMFMVPWDAPQFIAYLKAEVEAGRVPTARIDDANRRILAEKFKLGLFEHPYTDRSLQKTFGSKEHRELARQAVRESQVLLKNDGVLPLAKKNNKIFVAGKNANDIGNQAGGWTLTWQGQSGPVIPGTTILDGLKSGAGKGTTVTYDRAGDGIDGSYQVAVAVVGETPYAEGQGDRPNGFGLDAEDLATIAKLKSSGVPVVVVTVSGRPLDIAAQLPQFDGLVAAWLPGSEGAGVADVLYGDYNPTGKLTFSWPASATQEPVNVGDGKKALYPYGFGLRYRR

UniRef100_UPI00118103EF 941 0.642 3.409E-303 3 599 601 14 606 817
...
`

Is the output.

I have confirmed that the MSAs are present in the input files and have more than one sequence.

Any guidance appreciated.

Ultimately I have hundreds of a3ms to combine... a result of splitdbs and running jobs on separate nodes. mergedbs requires a query input, an output, and then the list of dbs to merge - this is contrived for this use case because each I have many query files, unless I am (likely) misunderstanding.

@EvanKomp
Copy link
Author

EDIT: I think I screwed something up somewhere as I was able to get mergedbs to work for this use case.

@EvanKomp
Copy link
Author

Reopening because it is REALLY slow, eg

mmseqs mergedbs data/2/env_data_db/env_data_db data/2/msas/final_combined.a3m data/2/msas/split_10/final.a3m data/2/msas/split_4/final.a3m data/2/msas/split_0/final.a3m data/2/msas/split_16/final.a3m data/2/msas/split_5/final.a3m data/2/msas/split_11/final.a3m data/2/msas/split_20/final.a3m data/2/msas/split_3/final.a3m data/2/msas/split_1/final.a3m data/2/msas/split_2/final.a3m data/2/msas/split_18/final.a3m --compressed 0

Note that this is but a subset of the files I will eventually be merging. Each final.a3m has ~150k MSAs in it.

I recognize that this is a sh*t ton of data, but am not sure why this particular step would take so long - shouldn't it basically be just a concatenate of the a3m files then recomputing the .index?

Thanks for wisdom.

@EvanKomp EvanKomp reopened this Nov 21, 2024
@milot-mirdita
Copy link
Member

This looks correct, if you have root you can install/use iotop to see how much write throughput mmseqs2 is having. I guess you are limited by the target storage system write bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants