-
Notifications
You must be signed in to change notification settings - Fork 23
Get fasta sequences for all amplicons in a cluster
Frédéric Mahé edited this page Nov 27, 2022
·
1 revision
Swarm's README page indicates how to obtain a fasta file containing all the amplicons in a given cluster, for all clusters.
INPUT_SWARM="amplicons.swarms"
INPUT_FASTA="amplicons.fasta"
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
mkdir "${OUTPUT_FOLDER}"
while read cluster ; do
tr " " "\n" <<< "${cluster}" | sed -e 's/^/>/' > "${AMPLICONS}"
seed=$(head -n 1 "${AMPLICONS}")
grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${seed/>/}.fasta"
done < "${INPUT_SWARM}"
rm "${AMPLICONS}"
Potentially, that's a lot of files. Let's see how to be more restrictive. Please note that the examples below assume you used swarm with the option -s
to get statistics on your clusters, and that your fasta sequences are not wrapped.
INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
NUMBER_OF_CLUSTERS=3
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"
# sort clusters by number of reads, pick the top ones
sort -k2,2nr "${INPUT_STATS}" | \
head -n "${NUMBER_OF_CLUSTERS}" | \
cut -f 2,3 | \
while read NUMBER_OF_READS SEED ; do
# list the amplicons grouped with that seed
grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
# skip if the seed is not found
[[ -s "${AMPLICONS}" ]] || continue
# get the fasta sequences
grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
# reinitialize the list of amplicons
printf "" > "${AMPLICONS}"
done
rm "${AMPLICONS}"
INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
MAX_READS=100000
MIN_READS=10000
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"
# sort and select clusters by number of reads
sort -k2,2nr "${INPUT_STATS}" | \
awk \
-v MIN_READS="${MIN_READS}" \
-v MAX_READS="${MAX_READS}" \
'{if ($2 < MAX_READS && $2 >= MIN_READS) {print $2, $3}}' | \
while read NUMBER_OF_READS SEED ; do
# list the amplicons grouped with that seed
grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
# skip if the seed is not found
[[ -s "${AMPLICONS}" ]] || continue
# get the fasta sequences
grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
# reinitialize the list of amplicons
printf "" > "${AMPLICONS}"
done
rm "${AMPLICONS}"
INPUT_SWARM="amplicons.swarms"
INPUT_STATS="amplicons.stats"
INPUT_FASTA="amplicons.fasta"
SEED_LIST="targets.list"
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
# clean/create a folder
[[ -d "${OUTPUT_FOLDER}" ]] && \
rm -r "${OUTPUT_FOLDER}"
mkdir -p "${OUTPUT_FOLDER}"
# read a list of seeds and select clusters
grep --no-group-separator \
-F -f "${SEED_LIST}" "${INPUT_STATS}" | \
cut -f 2,3 | \
while read NUMBER_OF_READS SEED ; do
# list the amplicons grouped with that seed
grep -m 1 "^${SEED}" "${INPUT_SWARM}" | \
tr " " "\n" | sed -e 's/^/>/' > "${AMPLICONS}"
# skip if the seed is not found
[[ -s "${AMPLICONS}" ]] || continue
# get the fasta sequences
grep --no-group-separator \
-A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" \
> "./${OUTPUT_FOLDER}/${SEED}_${NUMBER_OF_READS}.fasta"
# reinitialize the list of amplicons
printf "" > "${AMPLICONS}"
done
rm "${AMPLICONS}"