-
Notifications
You must be signed in to change notification settings - Fork 23
Preparation
To facilitate the use of swarm, we provide examples of shell commands that can be use to format and check the input fasta file (warning, this may not be suitable for very large files). The amplicon clipping step (adaptor and primer removal) and the filtering step are not discussed here.
Amplicons written on two lines are easier to manipulate: one line for the fasta header, one line for the sequence (tested with GNU Awk 4).
awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' amplicons.fasta > amplicons_linearized.fasta
To speed up the clustering process, strictly identical amplicons should be merged. This step is not mandatory, but it yields an abundance value for each amplicon which is necessary to refine the clusters produced by swarm. Dereplication is also an important time saver, especially for highly redundant high-throughput sequencing surveys.
grep -v "^>" amplicons_linearized.fasta | \
grep -v [^ACGTacgt] | sort -d | uniq -c | \
while read abundance sequence ; do
hash=$(printf "${sequence}" | sha1sum)
hash=${hash:0:40}
printf ">%s_%d_%s\n" "${hash}" "${abundance}" "${sequence}"
done | sort -t "_" -k2,2nr -k1.2,1d | \
sed -e 's/\_/\n/2' > amplicons_linearized_dereplicated.fasta
Amplicons containing characters other than "ACGT" are discarded. The dereplicated amplicons receive a unique name (160-bit hash values), and are sorted by decreasing number of copies and by hash values (to guarantee a stable sorting). The use of a hashing function provides an easy way to compare sets of amplicons. If two amplicons from two different sets have the same hash code, it means that the sequences they represent are identical.
If you want swarm to partition your dataset with the finest resolution (a local number of differences d = 1, with elimination of potential chained clusters and secondary grafting of small clusters) on a quadricore CPU:
./swarm -d 1 -f -t 4 amplicons.fasta > amplicons.swarms
See the user manual (man page and PDF) for details on swarm's options and parameters.