Get fasta sequences for all amplicons in a cluster

Swarm's README page indicates how to obtain a fasta file containing all the amplicons in a given cluster, for all clusters.

INPUT_SWARM="amplicons.swarms"
INPUT_FASTA="amplicons.fasta"
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
mkdir "${OUTPUT_FOLDER}"
while read cluster ; do
    tr " " "\n" <<< "${cluster}" | sed -e 's/^/>/' > "${AMPLICONS}"
    seed=$(head -n 1 "${AMPLICONS}")
    grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
        sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${seed/>/}.fasta"
done < "${INPUT_SWARM}"
rm "${AMPLICONS}"

Potentially, that's a lot of files. Let's see how to be more restrictive. Please note that the examples below assume you used swarm with the option -s to get statistics on your clusters, and that your fasta sequences are not wrapped.

only the n largest clusters

INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
NUMBER_OF_CLUSTERS=3
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)

# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
    rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"

# sort clusters by number of reads, pick the top ones
sort -k2,2nr "${INPUT_STATS}" | \
    head -n "${NUMBER_OF_CLUSTERS}" | \
    cut -f 2,3 | \
    while read NUMBER_OF_READS SEED ; do
        # list the amplicons grouped with that seed
        grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
            tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
        # skip if the seed is not found
        [[ -s "${AMPLICONS}" ]] || continue
        # get the fasta sequences
        grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
            sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
        # reinitialize the list of amplicons
        printf "" > "${AMPLICONS}"
    done
rm "${AMPLICONS}"

only clusters containing at least n reads and less than m reads

INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
MAX_READS=100000
MIN_READS=10000
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)

# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
    rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"

# sort and select clusters by number of reads
sort -k2,2nr "${INPUT_STATS}" | \
    awk \
        -v MIN_READS="${MIN_READS}" \
        -v MAX_READS="${MAX_READS}" \
        '{if ($2 < MAX_READS && $2 >= MIN_READS) {print $2, $3}}' | \
    while read NUMBER_OF_READS SEED ; do
        # list the amplicons grouped with that seed
        grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
            tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
        # skip if the seed is not found
        [[ -s "${AMPLICONS}" ]] || continue
        # get the fasta sequences
        grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
            sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
        # reinitialize the list of amplicons
        printf "" > "${AMPLICONS}"
    done
rm "${AMPLICONS}"

only clusters present in a list

INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
SEED_LIST="targets.list"
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)

# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
    rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"

# read a list of seeds and select clusters
grep --no-group-separator \
     -F -f "${SEED_LIST}" "${INPUT_STATS}" | \
    cut -f 2,3 | \
    while read NUMBER_OF_READS SEED ; do
        # list the amplicons grouped with that seed
        grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
            tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
        # skip if the seed is not found
        [[ -s "${AMPLICONS}" ]] || continue
        # get the fasta sequences
        grep --no-group-separator \
             -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" \
             > "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
        # reinitialize the list of amplicons
        printf "" > "${AMPLICONS}"
    done
rm "${AMPLICONS}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get fasta sequences for all amplicons in a cluster

only the n largest clusters

only clusters containing at least n reads and less than m reads

only clusters present in a list

Clone this wiki locally