meta-pipe

Metagenomic pipeline

Usage:

pipeline.py [-d|--inputdir] [-i|-inputfilename] [-o|--outputdir] [-n|--outputfilename] [-p|--phredcutoff 18] [-b|--bwa 1] [-a|--anchor 1] [-r|--derep] [-m|--derepmin 2]

Command Line Arguments

Example:

pipeline.py -d ./input/ -i 142404_0_input -o ./output -n 142405_out -p 18 -b 0 -a 0-r 1 -m 1
pipeline.py --inputdir ./input/ --inputfilename 142404_0_input --outputdir ./output --outputfilename 142405_out --phredcutoff 18 --derep 1 --derepmin 1

-d|--inputdir input directory.

-i|--inputfilename name of input file. Should be a .fastq file. Ex. if the file is named "test.fastq", input "test"

-o|--outputdir output directory.

-n|--outputfilename base name of all output files.

-p|--phredcutoff (used in SolexaQA++ dynamictrim) Phred quality score (between 0 and 41) at which base-calling error is considered too high

-b|--bwa (used in SolexaQA++ dynamictrim) use BWA trimming algorithm (1 for true/0 for false)

-a|--anchor (used in SolexaQA++ dynamictrim) Reads will only be trimmed from the 3' end (1 for true/0 for false)

-r|--derep (used in dereplication) Type of duplicates to filter. Allowed values are 1, 2, 3, 4 and 5. Use integers for multiple selections (e.g. 124 to use type 1, 2 and 4). The order does not matter. Option 2 and 3 will set 1 and option 5 will set 4 as these are subsets of the other option. 1 (exact duplicate), 2 (5' duplicate), 3 (3' duplicate), 4 (reverse complement exact duplicate), 5 (reverse complement 5'/3' duplicate)

-m|--derepmin (used in dereplication) This option specifies the number of allowed duplicates. If you want to remove sequence duplicates that occur more than x times, then you would specify x+1 as the -derep_min values. For examples, to remove sequences that occur more than 5 times, you would specify -derep_min 6. This option can only be used in combination with -derep 1 and/or 4 (forward and/or reverse exact duplicates).

Before you run analysis

Make sure to run this script on a genius by running 'ssh -X super-genius'

Modules that need to be added on coyote ('module add [module name]'):

SolexaQA++
rmblast
rmblast
bmtools
srprism
prinseq-lite
bowtie2
python/2.7.8
numpy/1.6.2

Analysis Process

Trimming - done using SolexaQA++ dynamictrim.
Dereplication - done using prinseq-lite
Filtering - done using BMtagger
Identifying organisms - done using metaphlan2

For an initial input of 4548554 sequences, the complete analysis took 78 minutes.

Output files

6 files with the name you provided in -n|--outputfilename with theses suffixes,

_1_trim.fastq
_2_dereplication_good.fastq
_2_dereplication_bad.fastq
_3_screening.fastq
_4_organism.fastq (contains the ID of each sequence and the corresponding GI number)
_4_organism (a percent breakdown of organisms)

in addition to a file called numberOfSequences.txt, which contains the number of sequences left after each step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

meta-pipe

Command Line Arguments

Before you run analysis

Analysis Process

Output files

Files

README.md

Latest commit

History

README.md

File metadata and controls

meta-pipe

Command Line Arguments

Before you run analysis

Analysis Process

Output files