-
Notifications
You must be signed in to change notification settings - Fork 6
Preflight checks
Before you start processing your data, you might try a couple things. Imagine you have a fastq foo.fastq
you're interested in processing.
SmileTrain requires files to be in Illumina 1.8 format. You can check the filetype with path/to/SmileTrain/check_format.py foo.fastq
. If the file is in Illumina 1.3 format, you'll need to use convert_fastq.py
to change format.
At any point in processing, you might wonder about the read length distribution. tools/read_length_distribution.py --fasta foo.fasta
or --fastq foo.fastq
will tell you how many reads are of each length.
How far out do you have good quality reads? tools/fastq_quality_report.py foo.fastq
will give a horizontal boxplot showing the distribution of qualities at each position along the reads.
For example, here I'm looking at some reads that are 250 bp long. They look OK at the beginning but pretty bad at the end.
$ ~/lib/SmileTrain/tools/fastq_quality_report.py for.fastq
|"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ|
| ---=0= | 1
| ---=0= | 2
| -0= | 3
| -0= | 4
...
| ===============================0==== | 247
| ==============================0===== | 248
| ==============================0===== | 249
| 0============= | 250
If you plan on merging forward and reverse reads, you might want to check that the forward and reverse reads intersect, that is, that there are matching forward and reverse reads. Try path/to/SmileTrain/check_intersect.py forward.fastq reverse.fastq
.
You can use path/to/SmileTrain/tools/check_barcodes.py foo.fastq barcodes.txt
to make sure that your barcodes actually appear in the fastq. If they aren't there, you might try reverse complementing your barcodes with tools/reverse_complement_barcode_map.py
.
You should check that your primer actually appears in your fastq, but the first few reads tend to be of poor quality. tools/check_primers.py foo.fastq ACGTACGT
will look through every Nth entry in foo.fastq
for the primer ACGTACGT
and tell you about the success rate. If you can't find it, you might consider reverse-complementing the primer. (tools/reverse_complement.py --seq ACGTACGT
would help there.)
Here's an example where I wanted to double-check if I had a primer or its reverse complement:
$ ~/lib/SmileTrain/tools/check_primers.py for.fastq TGGTCGTGCCAGCMGCCGCGGTAA
of 25000 entries, taking ~0.0001, or every 10-th
Primer removal:
3070 successes and 2193 failures (58% success rate)
2.222471 seconds elapsed
$ ~/lib/SmileTrain/tools/check_primers.py --reverse_complement for.fastq TGGTCGTGCCAGCMGCCGCGGTAA
of 25000 entries, taking ~0.0001, or every 10-th
Primer removal:
0 successes and 2500 failures (0% success rate)
1.660770 seconds elapsed
Looks like I had it right!