Question: best way to merge resulting a3m msa DBs #900

EvanKomp · 2024-11-11T23:17:29Z

Expected Behavior

mmseqs concatdbs ... --preserve-keys with a3m MSA db inputs should produce a single file that is effectively a bash concat of them.

Current Behavior

"Empty msa1! Skipping entry" occurs for MSAs in first file other than first entry

eg.
`>1
MI>6
MLAGLLLAGPALTPMASATPGPLYRNPHASVSSRVDDLLKRMSLDDKVGQMTQAERGAVTPDQAAALKLGSLLSGGGSVPAGNTPNGWADMVDSYQKAAVSTPLGIPTIYGVDAVHGHNNVYGATIFPHNIGLGAANNPRLVEKIGRATALEVAGTGPQWDFSPCLCVARDDRWGRTYESFGESPRDAVANASAITGLQGHGLGEKPGSVLATAKHYVGDGGTTNGVDQGNTEISERELRQIHLPPFREAIDRGVGSVMISFSSFQGVRMHAQKYLITDVLKKELRFSGLVISDYNAINQIDGQEGFTPEEVRLSVNAGIDMFMVPWDAPQFIAYLKAEVEAGRVPTARIDDANRRILAEKFKLGLFEHPYTDRSLQKTFGSKEHRELARQAVRESQVLLKNDGVLPLAKKNNKIFVAGKNANDIGNQAGGWTLTWQGQSGPVIPGTTILDGLKSGAGKGTTVTYDRAGDGIDGSYQVAVAVVGETPYAEGQGDRPNGFGLDAEDLATIAKLKSSGVPVVVVTVSGRPLDIAAQLPQFDGLVAAWLPGSEGAGVADVLYGDYNPTGKLTFSWPASATQEPVNVGDGKKALYPYGFGLRYRR

UniRef100_UPI00118103EF 941 0.642 3.409E-303 3 599 601 14 606 817
...
`

Is the output.

I have confirmed that the MSAs are present in the input files and have more than one sequence.

Any guidance appreciated.

Ultimately I have hundreds of a3ms to combine... a result of splitdbs and running jobs on separate nodes. mergedbs requires a query input, an output, and then the list of dbs to merge - this is contrived for this use case because each I have many query files, unless I am (likely) misunderstanding.

The text was updated successfully, but these errors were encountered:

EvanKomp · 2024-11-11T23:50:24Z

EDIT: I think I screwed something up somewhere as I was able to get mergedbs to work for this use case.

EvanKomp · 2024-11-21T22:42:51Z

Reopening because it is REALLY slow, eg

mmseqs mergedbs data/2/env_data_db/env_data_db data/2/msas/final_combined.a3m data/2/msas/split_10/final.a3m data/2/msas/split_4/final.a3m data/2/msas/split_0/final.a3m data/2/msas/split_16/final.a3m data/2/msas/split_5/final.a3m data/2/msas/split_11/final.a3m data/2/msas/split_20/final.a3m data/2/msas/split_3/final.a3m data/2/msas/split_1/final.a3m data/2/msas/split_2/final.a3m data/2/msas/split_18/final.a3m --compressed 0

Note that this is but a subset of the files I will eventually be merging. Each final.a3m has ~150k MSAs in it.

I recognize that this is a sh*t ton of data, but am not sure why this particular step would take so long - shouldn't it basically be just a concatenate of the a3m files then recomputing the .index?

Thanks for wisdom.

milot-mirdita · 2024-11-22T05:44:46Z

This looks correct, if you have root you can install/use iotop to see how much write throughput mmseqs2 is having. I guess you are limited by the target storage system write bandwidth.

EvanKomp closed this as completed Nov 11, 2024

EvanKomp reopened this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: best way to merge resulting a3m msa DBs #900

Question: best way to merge resulting a3m msa DBs #900

EvanKomp commented Nov 11, 2024

EvanKomp commented Nov 11, 2024

EvanKomp commented Nov 21, 2024

milot-mirdita commented Nov 22, 2024

Question: best way to merge resulting a3m msa DBs #900

Question: best way to merge resulting a3m msa DBs #900

Comments

EvanKomp commented Nov 11, 2024

Expected Behavior

Current Behavior

EvanKomp commented Nov 11, 2024

EvanKomp commented Nov 21, 2024

milot-mirdita commented Nov 22, 2024