forked from johnomics/RADtools
-
Notifications
You must be signed in to change notification settings - Fork 0
/
RADmanual.tex
523 lines (325 loc) · 24.1 KB
/
RADmanual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
\documentclass[a4paper]{article}
\title{RADtools: Software for RAD Sequencing, Version 1.2.4}
\author{John Davey\\{\tt john.davey@ed.ac.uk}}
\begin{document}
\maketitle
\section{Overview}
This manual describes RADtools, a set of command-line software tools for processing and analysing RAD Sequencing data. The tools are designed to process de novo RAD data, that is, data from species without a reference genome. Together, the tools are intended to form a pipeline from raw Illumina reads to candidate genetic markers.
The three core tools are RADpools, RADtags and RADmarkers. RADpools separates raw Illumina reads into separate pools according to the Multiplex IDentifier (MID) sequences given to each individual sample in the RAD library. RADtags clusters the reads for each pool into candidate RAD tags for that pool. RADmarkers clusters tags across all pools into candidate loci with alleles. The fourth tool, RADMIDs, can be used to design a set of MIDs for use in RAD adapters.
\section{Installation}
RADtools should run on any platform that has Perl 5.10 or above available. They have been tested on Mac OS X 10.5 and up and Ubuntu 10.4. The tools have low memory requirements but will benefit from having multiple cores available. As they are only Perl scripts underneath, they are likely to be very slow with recent (>100m read) data sets.
RADpools and RADtags process multiple pools in parallel using the CPAN module \verb|Parallel::ForkManager|. Please install this module before using RADtools. If you have configured CPAN, this should just be a case of entering
\begin{verbatim}
cpan Parallel::ForkManager
\end{verbatim}
RADpools and RADtags also make use of the common module file RADtools.pm. Please ensure this file is in your Perl path (ie in your @INC list). For example, you could add the directory containing this file to the PERL5LIB environment variable:
\begin{verbatim}
export PERL5LIB=$PERL5LIB:RADtools_dirname
\end{verbatim}
where \verb|RADtools_dirname| is the name of the directory containing the RADtools.pm file. Add this line to your profile (for example, \verb|~/.bash_profile| on systems running the bash shell) to add this directory to your Perl path every time you open a shell.
\section{General options}
All tools have standard options for information.
\begin{verbatim}
-h, --help - Print a brief help message
--usage - Print concise usage
--man - Print the manual page
--version - Print version number
\end{verbatim}
\section{RADpools}
RADpools takes as input raw Illumina reads and a list of pools and creates a directory containing a set of files, one for each pool, each file containing the reads for that pool. It outputs all qualities in Sanger format.
\subsection{Basic execution}
RADpools requires at least one Illumina read file in FASTQ format as input and a pools file, listing the pools in the data and the MIDs associated with the pools.
For example, suppose we are studying a population of four Beatles. We have used four different adapters, one for each Beatle, each with a different MID. Our pools file, called Beatles.pools, will contain the following lines:
\begin{verbatim}
John ATATC
Paul CAACT
George GACTA
Ringo TATAC
\end{verbatim}
Each line contains the name of a pool and its corresponding MID, separated by a space.
We can then run RADpools as follows:
\begin{verbatim}
RADpools -i s_1_1_sequence.fastq -d Beatles
\end{verbatim}
RADpools only accepts one FASTQ file per pool (although it will accept paired end reads - see next section). Reads from multiple lanes should be concatenated into one FASTQ file.
RADpools will create a directory called Beatles in the current directory and write a file for each pool into this directory. It will also write out invalid reads to a datestamped file that records reasons for rejecting the reads. For example, our Beatles data should be processed into files like this:
\begin{verbatim}
Beatles
- Beatles_20100825_150000_invalid.txt
- John.reads
- Paul.reads
- George.reads
- Ringo.reads
\end{verbatim}
Reads can be rejected for several reasons:
\begin{itemize}
\item FASTQ: read is not in valid FASTQ format
\item RAD: not enough high quality sequence to find a MID and restriction site
\item Res: restriction site is not valid
\item MID: MID is not found in pools file
\item Short: high quality sequence shorter than trim length
\end{itemize}
RADpools always fuzzy matches restriction sites. One basepair error is allowed, including Ns. The same can be done for MIDs using the \verb|-f| option.
RADpools will process P2 MIDs (barcodes at the start of the paired end read). Specify valid P1:P2 combinations by separating them with colons:
\begin{verbatim}
John ATATC:ATATC
Paul CAACT:CAACT
George GACTA:GACTA
Ringo TATAC:TATAC
\end{verbatim}
\subsection{Basic options}
\begin{itemize}
\item \verb|-v, --verbose|
Output status messages during processing of reads, documenting how many reads have been processed, accepted and rejected.
\item \verb|-p, --paired|
RADpools can accept paired end reads with this option, eg:
\begin{verbatim}
RADpools -i s_1_1_sequence.fastq -p s_1_2_sequence.fastq -d Beatles
\end{verbatim}
\item \verb|-e, --enzyme|
By default, RADpools looks for SbfI overhangs (TGCAGG). Use this option to specify a different enzyme overhang. Note that the overhang should be used, not the whole sequence, as it is the overhang sequence that is present in the reads. For example, EcoRI has sequence G*AATTC; RADpools will search for EcoRI-cut RAD reads with
\begin{verbatim}
RADpools -i s_1_1_sequence.fastq -d Beatles -e AATTC
\end{verbatim}
RADpools understands IUPAC codes, so ambiguous enzymes such as SgrAI (CRCCGGYG) can be used.
\item \verb|-m, --max_processes|
By default, RADpools will run on a single core, but before it completes it must sort all of the pools files it creates. This can be done in parallel. If you have multiple cores available, using this option is highly recommended.
\end{itemize}
\subsection{Further options}
\begin{itemize}
\item \verb|-t, --trim|
Trim all reads to this length. Trimming is done after MID and restriction site have been removed from read, and the length of these sequences is not included in the trim length. By default, the entire read is retained.
\item \verb|-q, --quality|
When a base with quality lower than this value is found, throw the rest of the read away. If the read is now shorter than the specified trim length, the whole read is thrown away. Default is -5, so no reads are thrown away (-5 for historical compatibility with old Illumina reads).
\item \verb|-f, --fuzzy_MIDs|
If a MID sequence in a read contains an error, the read is usually thrown away. With this option, these reads will be accepted and assigned to the nearest pool. If the MID could be assigned to more than one pool, a new pool is created, named after all the possible pools for the ambiguous MID.
\item \verb|-s, --sanger|
RADpools only loads FASTQ format, but can process Illumina or Sanger qualities. If your input file is in fastq-sanger format, use this option.
\item \verb|-o, --output_fastq|
Output FASTQ format, rather than READS format. Use this option if you want to align your reads to a reference genome rather than use the rest of the RADtools pipeline (although RADtags will load FASTQ too).
\end{itemize}
\section{RADtags}
RADtags takes as input the directory of files created by RADpools and generates candidate RAD tags for each pool.
\subsection{Basic execution}
By default, RADtags loads reads files output by RADpools, but it can also load FASTQ format.
RADtags can be run from within a directory containing the reads files generated by RADpools with no options.
\begin{verbatim}
RADtags
\end{verbatim}
RADtags will produce a tags file for each reads file. If it was run within the Beatles directory, that directory would then contain the following:
\begin{verbatim}
Beatles
- Beatles_20100825_150000_invalid.txt
- John.reads
- John.tags
- Paul.reads
- Paul.tags
- George.reads
- George.tags
- Ringo.reads
- Ringo.tags
\end{verbatim}
To load FASTQ data, specify the \verb|-z| option. The \verb|-s| option is also required when loading FASTQ data:
\begin{verbatim}
RADtags -z -s TGCAGG
\end{verbatim}
RADtags assumes input files are in fastq-sanger format. If you have fastq-illumina FASTQ data, please specify the \verb|-i| option.
RADtags finds FASTQ files as it does reads files. If you have paired end data, then the filenames must be in the following format:
\begin{verbatim}
samplename_1.fastq
samplename_2.fastq
\end{verbatim}
RADtags will match up files with data in this format but otherwise will process each file as if it were a separate sample.
\subsection{Basic options}
\begin{itemize}
\item \verb|-v, --verbose|
Verbose output. Show names of files being processed.
\item \verb|-f, --file_extension|
By default, RADtags loads all files that end with `.reads'. If -z is specified, RADtags will search for files ending in `.fastq'. Any other extension can be specified with this option. Separators such as `.' must be specified if present.
\item \verb|-d, --directory|
By default, RADtags searches for files in the current directory. To specify another directory, use this option. For example, to load the read files from the Beatles directory, run the following:
\begin{verbatim}
RADtags -d Beatles
\end{verbatim}
\item \verb|-s, --site|
RADpools retains the restriction site overhang on the front of every read, as this sequence can be useful for aligning reads to a reference genome. This option is required for loading FASTQ files so that FASTQ files can be sorted correctly. By default, the restriction site is retained even after sorting; set -t to remove it.
\item \verb|-t, --trim_site|
Trims the restriction site from the start of each read.
\item \verb|-i, --illumina|
By default, RADtags expects quality scores to be in fastq-sanger format. Use this option to load fastq-illumina data. Without this option, fastq-illumina data will produce incorrect output that will not be processed correctly by RADmarkers. This option is ignored if data is in `.reads' format, where Sanger qualities are assumed. Loading fastq-sanger data with this option turned on will produce incorrect output.
\item \verb|-c, --cluster_distance|
The number of mismatches allowed for a tag to be added to a cluster. RADtags will cluster together tags that are less than or equal to -c bases apart, but will start new clusters if a new tag has more than -c mismatches with the canonical tag in the current cluster. Default 5.
\item \verb|-m, --max_processes|
By default, RADtags will run on a single core, processing each tags file in series. To process tags files in parallel, increase the value of this option. If you have multiple cores available, using this option is highly recommended.
\end{itemize}
\subsection{Further options}
RADtags currently produces somewhat noisy output by default, because it is not yet clear what filters are appropriate for this data. However, several options are provided as blunt filters if the noise is getting in the way of the signal.
\begin{itemize}
\item \verb|-q, --quality_threshold|
RADtags clusters together groups of similar unique sequences and then calls particular uniques as real tags. When comparing uniques to assess their similarity, only bases with quality scores above this threshold are used (default 20). Bases with qualities below the threshold are considered to be errors and are ignored.
\item \verb|-r, --read_threshold|
Tags with less than this many reads associated with them will be thrown away (default 2). If your tags have high coverage, and there are many low count tags cluttering the output, increase this threshold.
\end{itemize}
\section{RADmarkers}
RADmarkers takes as input the set of files produced by RADtags containing candidate RAD tags and generates candidate RAD loci and alleles across all RAD pools, calling SNPs and assessing tag qualities in the process. RADmarkers will cluster together clusters from different pools where those clusters share a particular tag sequence. RADmarkers will also look for mismatches to tag sequences if the \verb|-m| option is used.
\subsection{Basic execution}
RADmarkers can be run from within a directory containing the tags files generated by RADtags with no options:
\begin{verbatim}
RADmarkers
\end{verbatim}
The tool will output the tags to standard output. To save the output, pipe it to a file, eg:
\begin{verbatim}
RADmarkers > RADmarkers.output
\end{verbatim}
\subsection{Basic options}
\begin{itemize}
\item \verb|-e, --end_tagfiles|
By default, RADmarkers loads all files that end with `.tags'. Use this option to change this.
\item \verb|-d, --directory|
By default, RADmarkers searches for files in the current directory. To specify another directory, use this option. For example, to load the tag files from the Beatles directory, run the following:
\begin{verbatim}
RADmarkers -d Beatles
\end{verbatim}
RADmarkers also loads the corresponding pools file to preserve the order of individuals in the pools file in its output. It will look in the directory given by -d and the directory one level above for a pools file.
\item \verb|-v, --verbose|
Output status messages and summary statistics during run.
\item \verb|-s, --snpsout|
Output SNPs for clusters containing more than one tag.
\item \verb|-q, --qualsout|
Output qualities for each tag, for each pool that features the tag.
\item \verb|-f, --fragments|
By default, RADmarkers outputs read counts, but where paired end data is available, if this option is set RADmarkers will use the fragment counts output by RADtags instead.
\end{itemize}
\subsection{Further options}
\begin{itemize}
\item \verb|-t, --tag_count_threshold|
Remove tags with counts below this threshold. Default is 0, ie include everything. This is likely to result in very noisy data, so increase this threshold to reduce the noise. There is a tradeoff, because each pool will have a different number of reads, and so the threshold does not mean the same thing for each pool. Be careful not to raise this so high that real tags are getting thrown away in pools with few reads. Better normalisation of the count data across pools is forthcoming.
\item \verb|-m, --mismatches|
By default, RADmarkers will join clusters across multiple pools where two clusters share an identical tag (-m is 0). By increasing the value of this option, RADmarkers will also search for mismatches in the tag sequences. This will cluster together many alleles that otherwise would be separated because they don't all appear in one individual, but using values of more than 3 is not recommended. Many tags will be absorbed into repeat regions, and processing can take a very long time.
\item \verb|-i, --include_singletons|
Tags that are present in only one pool are rejected by default, as they are likely to be errors and unlikely to be very useful. Set this option to include these tags.
\item \verb|-o, --old_output|
Produces the older output format, now deprecated.
\end{itemize}
\section{RADMIDs}
RADMIDs generates a set of MIDs of a certain length with a certain number of differences between each pair of MIDs, for use in RAD adapters. MIDs are generated with several differences between any pair so that they can be distinguished bioinformatically even if the MID for a particular read contains a sequencing error.
\subsection{Basic execution}
\begin{verbatim}
RADMIDs
\end{verbatim}
Running RADMIDs with no options will generate all 5bp MIDs with 3bp difference between each pair of MIDs and write them out to a file called mids.out.
\subsection{Options}
\begin{itemize}
\item \verb|-o, --outfile|
Name of output file (default mids.out).
\item \verb|-l, --length|
Length of generated MIDs (default 5).
\item \verb|-d, --difference|
Number of differences between any pair of MIDs (default 3).
\end{itemize}
\section{Output formats}
At present, the RADtools use a number of ad hoc text output formats suitable for hacking with Unix tools. RADmarkers output can now be loaded directly into R or Excel. Suggestions for improvements are welcome.
\subsection{RADpools}
Each read file output by RADpools contains one read per line in the following format:
\begin{verbatim}
<read1_sequence> <read1_quality> <read2_sequence> <read2_quality>
\end{verbatim}
Here, read2 refers to paired end data. Once both ends have been joined together, there is little use for the FASTQ headers, and so they are not preserved. Storing one read per line makes it easy to filter, sort and summarise each pool in many different ways with Unix tools, tasks which are mostly unpleasant with FASTQ data.
If FASTQ output is desired, use the -o option. Separate files will be output for read 1 and read 2.
RADpools outputs all qualities in fastq-sanger format, regardless of input format.
\subsection{RADtags}
Each tags file output by RADtags contains tag sequences and counts in the following format:
\begin{verbatim}
<tag1_sequence> <tag1_quality> <read_count> <fragment_count>
<tag1_pair1_sequence> <tag1_pair1_read_count>
<tag1_pair2_sequence> <tag1_pair2_read_count>
...
<tag2_sequence> <tag2_quality> <read_count> <fragment_count>
<tag2_pair1_sequence> <tag2_pair1_read_count>
<tag2_pair2_sequence> <tag2_pair2_read_count>
...
<tag3_sequence> <tag1_quality> <read_count> <fragment_count>
<tag1_pair1_sequence> <tag1_pair1_read_count>
<tag1_pair2_sequence> <tag1_pair2_read_count>
...
...
\end{verbatim}
Tags are displayed with median qualities for each base and a number of reads associated with each tag (including error-corrected reads). The fragment count is the number of unique paired end sequences found for each tag, the assumption being that duplicate paired ends are the result of PCR amplification and so are not representative of DNA quantity in the original sample. Paired ends for each tag are found directly after the tag, each one indented by 8 spaces.
Tags are clustered together by sequence similarity. Clusters are separated by empty lines. In the above example, tag1 and tag2 are in the same cluster, whereas tag3 is in a different cluster. For `tag' and `cluster', we would like to say `candidate allele' and `candidate locus' (although this is not strictly accurate, as there may be multiple variations in a single tag, and the clustering is not so good that we can be sure a particular tag comes from just one region of the genome, or that a group of tags are from the same region).
\subsection{RADmarkers}
The RADmarkers output is in tab separated values (tsv) format, suitable for loading in R or Excel. Each record represents a particular tag, or candidate allele, and has the following fields:
\begin{itemize}
\item \verb|ClusterID| - unique ID for a cluster of tags
\item \verb|ClusterTags| - the number of tags in this cluster
\item \verb|SegPattern| - the presence/absence segregation pattern of this tag across all individuals
\item \verb|Tag| - the sequence of the tag
\end{itemize}
Following these fields, the read or fragment counts for each individual are given.
This format can be loaded into R with, for example,
\begin{verbatim}
read.delim("filename")
\end{verbatim}
\subsubsection{Old RADmarkers output}
The RADmarkers output used to be considerably less user-friendly, but also incorporated SNP and quality information that may be useful. It can be accessed using the \verb|-o| option. Below is the original description of this output format.
The clustering in the RADmarkers output is somewhat complex, so let's start with the core objects, the tags. With no options added, RADmarkers will output tags like this:
\begin{verbatim}
John Paul George Ringo
<tag1_sequence> 20.00 - 10.00 15.00
<tag2_sequence> 22.00 17.00 - -
<tag3_sequence> - - 19.00 25.00
\end{verbatim}
This shows that both tags 1 and 2 are present in John, with read counts 20 and 22 respectively, but tag1 is not present in Paul and tag2 is not present in George and Ringo, etc. Clusters are separated by a blank line.
With options selected for outputting SNPs and qualities, the first cluster above would appear like this:
\begin{verbatim}
John Paul George Ringo
<tag1_sequence > 20.00 - 10.00 15.00
<tag1_qual_John > John
<tag1_qual_George> George
<tag1_qual_Ringo > Ringo
<tag2_sequence > 22.00 17.00 - -
<tag1_qual_John > John
<tag1_qual_Paul > Paul
<snp1_tag1var > 20.00 - 10.00 15.00
<snp1_tag2var > 22.00 17.00 - -
\end{verbatim}
Here, a summary of qualities for each individual is output for every tag sequence. Rather than output the full median Illumina qualities, quality scores are summarised as follows: ` ' means $Q>30$, `?' means $Q<30$ and `!' means $Q<20$. This allows simple visual inspection of the qualities to see if a tag is credible and if particular SNPs are likely to be real.
SNPs are reported on their own lines as single characters, with counts for each SNP given separately, as shown above.
Moving to a higher level of abstraction, clusters of tags are grouped by segregation pattern, by the presence or absence of each tag in each pool. For example, the segregation pattern for tag1 above would be \verb|1011|, for tag2, \verb|1100|, and so for the cluster as a whole, the segregation pattern would be \verb|1011 1100|.
Clusters with a particular segregation pattern are output as follows:
\begin{verbatim}
<segregation_pattern> <cluster_count>
<cluster1_tag1>
<cluster1_tag2>
<cluster2_tag1>
<cluster2_tag2>
...
\end{verbatim}
The number of clusters featuring a particular segregation pattern is output, followed by each cluster with the given segregation pattern, in the format shown above.
At the top level of abstraction, segregation patterns are grouped by the number of tags in each cluster. For example, the cluster containing tag1 and tag2 above has two tags, whereas the cluster containing tag3 has just one tag. If there were, say, four other tags with tag3's segregation pattern, six other tags with tag1/tag2's segregation pattern, and overall there were 500 1 tag clusters and 300 2 tag clusters, these clusters would be output as follows (ignoring SNP and quality output):
\begin{verbatim}
1 500
...
0011 5
...
<tag3_sequence> - - 19.00 25.00
...
...
2 300
...
1011 1100 7
...
<tag1_sequence> 20.00 - 10.00 15.00
<tag2_sequence> 22.00 17.00 - -
...
...
\end{verbatim}
This permits searching for particular segregation patterns of interest (according to phenotypes, for example) using Unix tools, and for tags with a particular segregation pattern to be browsed easily.
These output formats have evolved ad hoc for handling individual RAD projects and are clearly unsuitable for many purposes. They will be replaced in time; however, it is not clear what the most suitable format is, given the complexity of the data and the wide range of RAD applications. If you have suggestions or requirements for RAD output, please let me know.
\section{Contact details}
Please send comments, suggestions, questions and bug reports to John Davey, \verb|john.davey@ed.ac.uk|. If sending a bug report, please send the command you ran, any error messages produced and a sample of your input data.
The RADtools algorithms are described in more detail in the following paper:
Baxter SW, Davey JW, Johnston JS, Shelton AM, Heckel DG, Jiggins CD, Blaxter ML (2011). Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS ONE 6(4):e19315.
UK RAD Sequencing wiki: \verb|https://www.wiki.ed.ac.uk/display/RADSequencing/|.
UK RAD Sequencing mailing list: \verb|rad-sequencing@jiscmail.ac.uk|.
RADtools github repository: \verb|git://github.com/johnomics/RADtools.git|.
\end{document}