-
Notifications
You must be signed in to change notification settings - Fork 6
How to process a filtered QIIME fasta
Imagine your collaborator has given you a file lots_o_poo.fna
, which contains lines like
>806rcbc0_0 M02171:6:000000000-A6FUV:1:1101:21715:4030 1:N:0:1 orig_bc=GGAGACAAGGGA new_bc=GGAGACAAGGGA bc_diffs=0 TACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCAGGGCTCAACCTTGGAACTGCATT
This is a quality-filtered QIIME fasta file. The first line identifies the sample name (806rcbc0), the ID for this read in that sample (the 0 after the _), and a whole bunch of other information. The second line is the actual sequence.
Before boarding the Smile Train, you'll need to create a file q.fst
that has usearch-style labels (>sample=sample_name;X
where X
means that this is the X-th read in sample sample_name
). For this example, it would look like >sample=806rcbc0;0
. The next read from this sample will have an ID line >sample=806rcbc0;1
.
There is a script SmileTrain/util/qiime_to_st_labels.py
that can do this for you! Make sure you are not on the head node (maybe qsub -I
) and then run
/path/to/SmileTrain/util/qiime_to_st_labels.py /path/to/lots_o_poo.fna > /path/to/q.fst
Take a look to see that q.fst
looks the way you expect.
Et voilà.