Filter out sequences in Illumina HiSeq/MiSeq gzipped FASTQ files (single or paired-end) that match a reference genome, e.g., PhiX (Control Libraries)
- For removal of contamination it is recommended that you use other programs, e.g., JGI's BBDuk "Kmer filtering" feature as part of Data Preprocessing
filterfastq.pl
was originally written as a proof of concept in 2015 (and used in production on a limited basis) due to Illumina's elimination of the dedicated PhiX control lane (#8) on their HiSeq 3000/4000 instruments- e.g., Sequencing a non-indexed sample (whole genome) with a large PhiX spike-in (e.g., 10%) may have required filtering of the final set of FASTQ reads
filterfastq.pl
uses NCBI's command-line BLAST (blastn
) to compare sequences in a semi-serial fashion, i.e., sets of 4 million sequences, using multipleblastn
threads, against the reference genome (as a BLAST database), which can take a long time- e.g., Approximately 11.75 days to process 264,724,686 single end reads, which had a recorded PhiX spike-in at 0.6%; 0.521% reads were filtered out by the script
- Reference genome to filter against (in FASTA format), e.g., phiX174 (Phi X 174 bacteriophage)
- NCBI BLAST+ (
makeblastdb
andblastn
) - FASTX-Toolkit (
fastq_to_fasta
) - GNU core utilities (e.g.,
split
) - GNU sed
$ ./filterfastq.pl -h
Usage: ./filterfastq.pl -r reference_genome.fa -i fastq_input_dir -o fastq_output_dir [-t threads] [-e evalue]
-r, --reference
FASTA reference genome for sequences to be filtered against
-i, --input
Directory containing the gzipped FASTQ file(s) to be filtered
Sets of paired-end reads need to be in the same directory and are
identified by their filenames containing either *_R1_* and *_R2_*
-o, --output
Directory to store the filtered FASTQ and intermediary files
Script will create output directory but not parents, e.g., mkdir -p
-t, --threads
Optional number of threads for blastn to use (default is 1)
-e, --evalue
Optional evalue for blastn to use (default is 1e-15)
-q, --quiet
Do not write script progress to standard output
-h, --help
This usage information.