-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
88 lines (46 loc) · 2.34 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
READ ME
This software will help to automate the bulk downloading of SRA files from NCBI to a SLURM cluster using the SRA Toolkit.
STEP 1
Create a conda environment called 'sra_env' with python 3 & install the SRA Toolkit
conda create --name sra_env python=3
conda activate sra_env
conda install -c bioconda sra-tools
STEP 2
Place all scripts in the same directory & run the setup.
python sra_pipe.py -setup
STEP 3
Move the accession list file to the accessions directory.
~/sra_download/accessions/SraAccList.txt
Example list format:
SRR0000001
SRR0000002
SRR0000003
...
N.B. at the time of writing it was possible to easily obtain such lists via https://www.ncbi.nlm.nih.gov/sra/ by searching for samples & then clicking "Send to" -> "File" -> "Accession list" -> "Create File" in the drop down menu.
STEP 4
Run stage 1 to download sra files via SRA Toolkit's prefetch command.
python sra_pipe.py -stage 1
N.B. sometimes downloads will encounter a "Cannot KStreamRead" error; submitting with the optional "-buddy" flag will also submit sra_buddy.py that regularly monitors tasks & automatically re-submits those that fail due to this.
STEP 5
Run stage 2 to convert sra files to fastq files via SRA Toolkit's fasterq-dump command with default settings (equivalent to --split-3 & --skip-technical). A fastq extension will also be added to any extension-less files & all fastqs will then be compressed via gunzip.
python sra_pipe.py -stage 2
CHECKING TASK PROGRESS / ACCOUNTING DATA
As tasks are being processed the progress of output files & SLURM accounting data can be monitored via the appropriate flags.
python sra_pipe.py -stage 1 -progress
python sra_pipe.py -stage 1 -accounting
ADDITIONAL SETTINGS
Various other settings can also be changed from their defaults:
FLAG DEFAULT
-overwrite overwrite existing files
-pipe submit pipeline itself so new tasks can be submitted from the hpcc when possible (useful when many accessions)
-u <name> find current user
-p </path/to/> find home directory
-d <name> sra_download as working directory
-a <name> SraAccList.txt as accession list file
-e <name> sra_env as conda environment name
-l <number> 100 parellel task limit
-pa <name> defq partition (specific to the UoN hpcc augusta)
-no <number> 1 node
-nt <number> 1 ntask
-me <number[k|m|g|t]> 4g memory
-wt <HH:MM:SS> 24:00:00 walltime