Skip to content

Preflight checks

swo edited this page Sep 17, 2014 · 4 revisions

Before you start processing your data, you might try a couple things. Imagine you have a fastq foo.fastq you're interested in processing.

Check the fastq format

SmileTrain requires files to be in Illumina 1.8 format. You can check the filetype with path/to/SmileTrain/check_format.py foo.fastq. If the file is in Illumina 1.3 format, you'll need to use convert_fastq.py to change format.

Check the read length distribution

At any point in processing, you might wonder about the read length distribution. tools/read_length_distribution.py --fasta foo.fasta or --fastq foo.fastq will tell you how many reads are of each length.

Check your fastq quality

How far out do you have good quality reads? tools/fastq_quality_report.py foo.fastq will give a horizontal boxplot showing the distribution of qualities at each position along the reads.

For example, here I'm looking at some reads that are 250 bp long. They look OK at the beginning but pretty bad at the end.

$ ~/lib/SmileTrain/tools/fastq_quality_report.py for.fastq  
|"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ|
|                            ---=0=       | 1
|                            ---=0=       | 2
|                               -0=       | 3
|                               -0=       | 4
...
| ===============================0====    | 247
| ==============================0=====    | 248
| ==============================0=====    | 249
| 0=============                          | 250

Check for intersection

If you plan on merging forward and reverse reads, you might want to check that the forward and reverse reads intersect, that is, that there are matching forward and reverse reads. Try path/to/SmileTrain/check_intersect.py forward.fastq reverse.fastq.

Check for barcodes

You can use path/to/SmileTrain/tools/check_barcodes.py foo.fastq barcodes.txt to make sure that your barcodes actually appear in the fastq. If they aren't there, you might try reverse complementing your barcodes with tools/reverse_complement_barcode_map.py.

Check for primers

You should check that your primer actually appears in your fastq, but the first few reads tend to be of poor quality. tools/check_primers.py foo.fastq ACGTACGT will look through every Nth entry in foo.fastq for the primer ACGTACGT and tell you about the success rate. If you can't find it, you might consider reverse-complementing the primer. (tools/reverse_complement.py --seq ACGTACGT would help there.)

Here's an example where I wanted to double-check if I had a primer or its reverse complement:

$ ~/lib/SmileTrain/tools/check_primers.py for.fastq TGGTCGTGCCAGCMGCCGCGGTAA
of 25000 entries, taking ~0.0001, or every 10-th
Primer removal:
3070 successes and 2193 failures (58% success rate)
2.222471 seconds elapsed

$ ~/lib/SmileTrain/tools/check_primers.py --reverse_complement for.fastq TGGTCGTGCCAGCMGCCGCGGTAA
of 25000 entries, taking ~0.0001, or every 10-th
Primer removal:
0 successes and 2500 failures (0% success rate)
1.660770 seconds elapsed

Looks like I had it right!