MGnify clustering

Pipeline finding Mgnify/UniProt clusters to increase the metagenomic coverage in Pfam

Requirements

Python 3.6 and above with Cx_Oracle package MMseqs2 (https://github.com/soedinglab/MMseqs2)

Config file with the variables as specified in clustering_model.cfg

Clustering

The pipeline is divided in 2 steps, both executed from cluster.sh:

Clustering
Getting clusters with more than 2 sequences including MGnify and UniProt sequences where the representative sequence isn't found in Pfam

Usage: bsub -q production-rh74 -M 600000 -R "rusage[mem=600000]" -oo clustering_full_bidir.log -J cluster_full_uni -Pbigmem -n 16 ./cluster.sh clustering.cfg

Buiding family

This is the next step after the clustering, allowing the automatic build of good quality Pfam families. It runs a series of perl scripts in order to get a good quality family. It needs to be executed on an interactive shell, incompatible with bsub command.

Generation of the multiple sequence alignment for the clusters selected in the previous step
Alignment against pfamseq database
Keeping only non redundant at 80% identity sequences
Removing gappy columns from N and C terminus
Removing partial sequences, i.e. those with a gap character at start of end of alignment
Running pfbuild to generate missing files
Searching for PDB references
Searching for SwissProt information
Searching for matching species
Adding DUF identifier
Adding family identifier
Running final checks

Usage: python generate_alignments.py -i file_containing_stats_sorted_by_cluster_size_reverse [-d yes (delete previous files)] [-b family to start building from] [-n number of families to build]

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_families.py		build_families.py
cluster.sh		cluster.sh
clustering_model.cfg		clustering_model.cfg
complete_desc_file.py		complete_desc_file.py
filter_partial.py		filter_partial.py
find_incomplete_desc.py		find_incomplete_desc.py
generate_alignments.py		generate_alignments.py
get_stats.py		get_stats.py
update_uniprotkb.sh		update_uniprotkb.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MGnify clustering

Requirements

Clustering

Buiding family

About

Releases

Packages

Languages

License

ProteinsWebTeam/mgnify-clustering

Folders and files

Latest commit

History

Repository files navigation

MGnify clustering

Requirements

Clustering

Buiding family

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages