-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: ColabFold search pipeline for very large query, less sensitivity acceptable #897
Comments
25 million MSAs is quite considerable. Following is a brain dump of some things to consider. Sorry it’s not super organized. You can scale down sensitivity. You can consider searching only against the uniref and omitting the colabfold db. I think a batch size of 50k-100k queries is probably going to work to be sufficient. I have also recently tweaked the setup_databases.sh script to not store the databases in a compressed form, this will increase RAM use but avoid constant decompression. However, I’ve not commited this change yet. Storage space is likely going to be one of the bigger issues, since 25M MSA will be quite large. You might want to extract the mmseqs shell commands from the colabfold_search python script (https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py) so you have an easier time tweaking them. One option might be to run all searches in batches, but not run the final result2msa and unpackdb steps yet, as they are comparatively cheap and the filesize increases a lot from the internal result format to the MSA a3m format. You can also run another mmseqs module here (filterdb --extract-lines) to return only a maximum number hits to be converted to a MSA. Currently, colabfold_search materializes each MSA as an individual file to the file system (through the unpackdb) call. We have been meaning to make colabfold_batch work directly with the internal database format. I assume you don’t actually want to ran 25M structure predictions and instead want to run something else that is faster. You might want to implement something that reads from the MMseqs2 databases directly, this will save a lot of filesystem headache. |
@milot-mirdita Thanks for your response! This is good stuff. I hadn't considered the point about data size, though I do have a number of TB to work with. Re. batching, is there a built in module to chunk up the query db or should I manually split my fasta and make seperate DBs? I assume Here is my current list of commands in hard coded format:
As it stands, I am not converting from DB to a3m until after the filter (with diff =256) so maybe MSA sizes will be okay? Any thoughts? What I changed from colabfold default, but plan to "search" over for MSA quality and run time: Thanks again. EDIT: Since I am doing an iterative search, would we expect an indexed and in memory target DB to speed up search? I am gathering no based on other conversations given that my use case is large batches and not single queries? EDIT2: Re. the purpose - correct, I have no need to run structure prediction. The MSAs are going into MSATransformer, so there is no point for them to be extremely large or contain remote fragments. |
Context
I need to create MSAs for a very large set of protein sequences: about 25 million.
I was planning to use the ColabFold workflow. I figured that this would be achievable given the nonlinear scaling to large query sets. That being said, extremely remote alignments are not necessary for my use case in the same way they they are helpful for structure prediction. I am looking for relative small MSAs (no more than 256 sequences) of diverse sets that do not have small fragments eg. high coverage.
I had intended to run some scaling tests over query size as well as parameters (first thoughts being sensitivity, max_seqs, align eval, max_except), as well as not using the metagenomic database.
I figured I would first chat with experts and save some compute carbon before doing this. Is there any params I am missing? Any that would be good to change from default and forget? Am I totally off in thinking my job is accessible with a 104 thread compute note and a week of runtime?
The text was updated successfully, but these errors were encountered: