Skip to content

Preparation

Frédéric Mahé edited this page Nov 27, 2022 · 4 revisions

Prepare amplicon fasta files

To facilitate the use of swarm, we provide examples of shell commands that can be use to format and check the input fasta file (warning, this may not be suitable for very large files). The amplicon clipping step (adaptor and primer removal) and the filtering step are not discussed here.

Linearization

Amplicons written on two lines are easier to manipulate: one line for the fasta header, one line for the sequence (tested with GNU Awk 4).

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' amplicons.fasta > amplicons_linearized.fasta

Dereplication

To speed up the clustering process, strictly identical amplicons should be merged. This step is not mandatory, but it yields an abundance value for each amplicon which is necessary to refine the clusters produced by swarm. Dereplication is also an important time saver, especially for highly redundant high-throughput sequencing surveys.

grep -v "^>" amplicons_linearized.fasta | \
grep -v [^ACGTacgt] | sort -d | uniq -c | \
while read abundance sequence ; do
    hash=$(printf "${sequence}" | sha1sum)
    hash=${hash:0:40}
    printf ">%s_%d_%s\n" "${hash}" "${abundance}" "${sequence}"
done | sort -t "_" -k2,2nr -k1.2,1d | \
sed -e 's/\_/\n/2' > amplicons_linearized_dereplicated.fasta

Amplicons containing characters other than "ACGT" are discarded. The dereplicated amplicons receive a unique name (160-bit hash values), and are sorted by decreasing number of copies and by hash values (to guarantee a stable sorting). The use of a hashing function provides an easy way to compare sets of amplicons. If two amplicons from two different sets have the same hash code, it means that the sequences they represent are identical.

Launch swarm

If you want swarm to partition your dataset with the finest resolution (a local number of differences d = 1, with elimination of potential chained clusters and secondary grafting of small clusters) on a quadricore CPU:

./swarm -d 1 -f -t 4 amplicons.fasta > amplicons.swarms

See the user manual (man page and PDF) for details on swarm's options and parameters.