-
Notifications
You must be signed in to change notification settings - Fork 1
/
metabarcodingtextbook2.en.html
3201 lines (3093 loc) · 269 KB
/
metabarcodingtextbook2.en.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html prefix="dcterms: http://purl.org/dc/terms/">
<head>
<title>Metabarcoding and DNA barcoding for Ecologists: Sequence analysis</title>
<!--Generated on Sat Jun 22 08:40:06 2019 by LaTeXML (version 0.8.3) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on June 22, 2019.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="LaTeXML.css" type="text/css">
<link rel="stylesheet" href="ltx-book.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document">
<h1 class="ltx_title ltx_title_document">Metabarcoding and DNA barcoding for Ecologists: Sequence analysis</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Akifumi S. Tanabe
</span></span>
</div>
<div class="ltx_date ltx_role_creation">June 22, 2019</div>
<section id="Chx1" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">Preface</h2>
<div id="Chx1.p1" class="ltx_para">
<p class="ltx_p">This book is distributed under a Creative Commons Attribution-ShareAlike 4.0 International License.
You can copy, redistribute, display this text if you designate the authorship.
You can also modify this text and distribute the modified version if you designate the authorship and apply this license or compatible license to the modified version.
To view a copy of this license, visit
<br class="ltx_break"><a href="https://creativecommons.org/licenses/by-sa/4.0/" title="" class="ltx_ref ltx_href">https://creativecommons.org/licenses/by-sa/4.0/</a>
<br class="ltx_break">or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.</p>
</div>
<div id="Chx1.p2" class="ltx_para">
<p class="ltx_p">I hope that this text helps you.
I am grateful to Dr. Hirokazu Toju (Center for Ecological Research, Kyoto University), Dr. Satoshi Nagai (National Research Institute of Fisheries Science, Japan Fisheries Research and Education Agency), Dr. Hiroki Yamanaka (Ryukoku University), and you.</p>
</div>
</section>
<section id="Chx2" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">Legends</h2>
<div id="Chx2.p1" class="ltx_para">
<p class="ltx_p">In this text, the input commands to terminals and display outputs are described as below.</p>
</div>
<div id="Chx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># comments</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> command argument1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">argument2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">argument3↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">output of command</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> command argument1 argument2 argument3↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">output of command</span></span>
</span>
</div>
<div id="Chx2.p3" class="ltx_para">
<p class="ltx_p">In the above example, the same commands <span class="ltx_text ltx_font_typewriter">command argument1 argument2 argument3</span> were executed twice.
The outputs <span class="ltx_text ltx_font_typewriter">output of command</span> were displayed after execution.
The characters between # and line feed were comments and needless to input.
<span class="ltx_text ltx_font_typewriter">></span> and space of line head indicate the prompt of terminal.
Do not type these characters.
↓ means the end of input commands and arguments and needless to input, but you need to type Enter key to input line feed.
I use line feed within commands or arguments for viewability.
Such line feed is led by <span class="ltx_text ltx_font_typewriter">\</span>.
Therefore, the line feeds led by <span class="ltx_text ltx_font_typewriter">\</span> do not mean the end of commands or arguments, or designation to input Enter key.
Involuntary line feeds may be generated by word wrap function depending on your read environment, but do not mean the end of commands or arguments, or designation to input Enter key.</p>
</div>
<div id="Chx2.p4" class="ltx_para">
<p class="ltx_p">The file content is shown as below in this text.</p>
</div>
<div id="Chx2.p5" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| The content of first line</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| The content of second line</span></span>
</p>
</div>
<div id="Chx2.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">|</span> and space of line head indicate the line head in the file, do not exist in the file and needless to input these characters.
This code is written to help you to distinguish true line feeds and involuntary line feeds.</p>
</div>
</section>
<section id="Ch0" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 0 </span>Installing softwares and preparing analysis environment</h2>
<div id="Ch0.p1" class="ltx_para">
<p class="ltx_p">In this text, I assume Debian GNU/Linux 9 (stretch) (hereafter Debian) or Ubuntu Linux 18.04 LTS (hereafter Ubuntu) as operating system.
If you use Windows PC, please install Debian or Ubuntu.
Cygwin or Windows Subsystem for Linux provided for Windows 10 can be used for the following analysis, but the programs run much more slowly.
You can use CD, DVD or USB memory to boot installer of Linux.
If your PC has only one storage device, you need to reduce Windows partition by using partition resizer software such as EaseUS Partition Master or using a partition resize function contained in the installer.
You can also use newly added internal storage devices or external storage devices connected by USB.
There are several variations of Ubuntu, and I recommend Xubuntu rather than normal Ubuntu.</p>
</div>
<div id="Ch0.p2" class="ltx_para">
<p class="ltx_p">Debian and Ubuntu can be installed to Mac.
If there is no enough space, you need to resize OSX partition with the aid of Disk Utility or add storage device.
The rEFIt or rEFInd boot selecter may be required to boot Debian, Ubuntu or the installer of them on Mac.
If you install rEFIt or rEFInd to your Mac, you can boot the installer of Debian or Ubuntu from CD, DVD or USB memory.
Do not delete existing partition of OSX.
If you have enough free space, you don’t need to use Disk Utility to resize existing partition.
You can install Debian or Ubuntu to external storage devices on Mac.</p>
</div>
<div id="Ch0.p3" class="ltx_para">
<p class="ltx_p">I assume Intel64/AMD64 (x86_64) CPU machine as analysis environment.
The other CPU machine can be used for analysis, but you need to solve problems by yourself.
The 64 bits version of Debian or Ubuntu is also required because 32 bits version cannot use large memory.</p>
</div>
<section id="Ch0.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">0.1 </span>Installation of Claident, Assams, databases, and the other required programs</h3>
<div id="Ch0.S1.p1" class="ltx_para">
<p class="ltx_p">Run the following commands in terminal or console as the user that can use <span class="ltx_text ltx_font_typewriter">sudo</span>.
Then, all of the required softwares will be installed.
The installer will ask password to you when <span class="ltx_text ltx_font_typewriter">sudo</span> is used.</p>
</div>
<div id="Ch0.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> rm -r workingdirectory↓</span></span>
</span>
</div>
<div id="Ch0.S1.p3" class="ltx_para">
<p class="ltx_p">By default, the softwares will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>.
In the installation, you will see <span class="ltx_text ltx_font_typewriter">Permission denied</span> error and the installer ask password to you.
If the installer continue after password input, you don’t need to care about the error.
The installer try to install without <span class="ltx_text ltx_font_typewriter">sudo</span> at first and the installation output the above error.
Then, the installer try to install using <span class="ltx_text ltx_font_typewriter">sudo</span>.</p>
</div>
<div id="Ch0.S1.p4" class="ltx_para">
<p class="ltx_p">If you need proxy to connect the internet, execute the following commands to set environment variables before execution of the installer.</p>
</div>
<div id="Ch0.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export http_proxy=http://server.address:portnumber↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export https_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export ftp_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export all_proxy=$http_proxy↓</span></span>
</span>
</div>
<div id="Ch0.S1.p6" class="ltx_para">
<p class="ltx_p">If the proxy requires username and password, execute the following commands instead of the above commands.</p>
</div>
<div id="Ch0.S1.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export http_proxy=http://username:password@server.address:portnumber↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export https_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export ftp_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export all_proxy=$http_proxy↓</span></span>
</span>
</div>
<section id="Ch0.S1.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.1 </span>Upgrading to new version</h4>
<div id="Ch0.S1.SS1.p1" class="ltx_para">
<p class="ltx_p">If you want to upgrade all of the softwares and the databases, run the same commands as initial installation.
By this procedure, Assams, Claident, PEAR, VSEARCH, Metaxa and ITSx will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>, and NCBI BLAST+, BLAST databases for molecular identification, taxonomy databases and the other required programs will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local/share/claident</span>.
NCBI BLAST+ and BLAST databases used by Claident can co-exist system wide installation of NCBI BLAST+ and BLAST databases.</p>
</div>
<div id="Ch0.S1.SS1.p2" class="ltx_para">
<p class="ltx_p">You can disable a part of upgrade like below.</p>
</div>
<div id="Ch0.S1.SS1.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Assams</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .assams↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Claident</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .claident↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of PEAR</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .pear↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of VSEARCH</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .vsearch↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of NCBI BLAST+</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .blast↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of sff_extract</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .sffextract↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of HMMer</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .hmmer↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of MAFFT</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .mafft↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Metaxa</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .metaxa↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ITSx</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .itsx↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘overall’’ BLAST and taxonomy databases</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .overall↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘Claident Databases for UCHIME’’</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .cdu↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘rdp’’ reference database for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .rdp↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘silva’’ reference databases for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .silva↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘unite’’ reference databases for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> touch .unite↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> rm -r workingdirectory↓</span></span>
</span>
</div>
</section>
<section id="Ch0.S1.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.2 </span>Installing to non-default path</h4>
<div id="Ch0.S1.SS2.p1" class="ltx_para">
<p class="ltx_p">If you install the softwares based on the above procedure, the softwares will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>.
The executable commands will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local/bin</span>.
You can change these install path for coexistence with the other programs such as older versions like below.</p>
</div>
<div id="Ch0.S1.SS2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export PREFIX=install_path↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> rm -r workingdirectory↓</span></span>
</span>
</div>
<div id="Ch0.S1.SS2.p3" class="ltx_para">
<p class="ltx_p">In this case, the following commands need to be executed before analysis.</p>
</div>
<div id="Ch0.S1.SS2.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> export PATH=install_path/bin:$PATH↓</span></p>
</div>
<div id="Ch0.S1.SS2.p5" class="ltx_para">
<p class="ltx_p">You can omit above command if the above command is added to <span class="ltx_text ltx_font_typewriter">~/.bash_profile</span> or <span class="ltx_text ltx_font_typewriter">~/.bashrc</span>.</p>
</div>
</section>
<section id="Ch0.S1.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.3 </span>How to install multiple versions in a computer</h4>
<div id="Ch0.S1.SS3.p1" class="ltx_para">
<p class="ltx_p">If you install Claident and the other softwares to default install path of a computer to which Claident was already installed, all softwares will be overwritten.
As noted above, multiple versions of Claident can coexist if you install Claident to non-default path.
Note that a configuration file <span class="ltx_text ltx_font_typewriter">.claident</span> placed at a home directory of login user (<span class="ltx_text ltx_font_typewriter">/home/username</span>) or <span class="ltx_text ltx_font_typewriter">/etc/claident</span> cannot coexist at the same path.
You need to replace this file before changing the version of Claident.
The configuration file at the home directory of login user will be used preferentially.
To use multiple version, I recommend to make user account for each version and to install Claident to the home directory of each user.
Then, the version of Claident can be switched by switching login user.</p>
</div>
</section>
</section>
</section>
<section id="Ch1" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 1 </span>Sequencing of multiple samples by next-generation sequencers</h2>
<div id="Ch1.p1" class="ltx_para">
<p class="ltx_p">In this chaper, I explain brief overview of tagged multiplex sequencing method by Roche GS series sequencers, Ion PGM and Illumina MiSeq.
These sequencers can read over-400bp contiguously and are suitable for metabarcoding and DNA barcoding.
Note that MiSeq requires concatenation of paired-end reads.
Therefore, PCR amplicons should be 500bp or shorter (400bp is recommended) in order to concatenate paired-end reads.
Forward and reverse reads can be analyzed separetely, but I cannot recommend such analysis because reverse reads are usually low quality.</p>
</div>
<div id="Ch1.p2" class="ltx_para">
<p class="ltx_p">The next-generation sequencers output extremely large amount of nucleotide sequences in single run.
Running costs of single run is much higher than Sanger method-based sequencers.
To use such sequencers efficiently, multiplex sequencing method was developed.
Multiplex identifier tag sequences are added to target sequences to identify the sample of origin, and the multiple tagged samples are mixed and sequenced in single run in this method.
This method can extremely reduce per-sample sequencing costs.
Multiplex identifier tag is also called as “barcode”.
However, nucleotide sequence for DNA barcoding is called as “barcode sequence”.
This is very confusing and “multiplex identifier tag” is too long.
Thus, I call multiplex identifier tag sequence as just “tag” in this text.
Please notice that tag is often called as “index”.</p>
</div>
<div id="Ch1.p3" class="ltx_para">
<p class="ltx_p">In the following analysis, chimera sequences constructed in PCR and erroneous sequences potentially causes misinterpretation of analysis results.
If multiple PCR replicates are prepared, tagged and sequenced separately, shared sequences among all replicates can be considered as nonchimeric and less erroneous.
This is because there are huge number of sequence combinations and joint points but no error sequence pattern is only one for one true sequence and nonchimeric and no error sequences likely to be observed at all replicates.
Program cannot remove chimeras and errors enough but we can expect that the combination of PCR replicates and program improves removal efficiency of chimeras and errors.
After removal of chimeras and errors, the number of sequences of PCR replicates can be summed up and used in subsequent analysis.</p>
</div>
<section id="Ch1.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1.1 </span>PCR using tag- and adapter-jointed-primers</h3>
<div id="Ch1.S1.p1" class="ltx_para">
<p class="ltx_p">In order to add tag to amplicon, PCR using tag-jointed primer is the easiest way.
This method requires a set of tag-jointed primers.
In addition, library preparation kits for next-generation sequencers usually presume that the adapter sequences specified by manufacturers are added to the both end of target sequences.
Thus, the following tag- and adapter-jointed primer is used for PCR.</p>
</div>
<div id="Ch1.S1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter] ― [tag] ― [specific primer] ― 3’</span></p>
</div>
<div id="Ch1.S1.p3" class="ltx_para">
<p class="ltx_p">If this kind of primers are used for the both forward and reverse primers, the following amplicon sequences will be constructed.</p>
</div>
<div id="Ch1.S1.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter-F] ― [tag-F] ― [specific primer-F] ― [target sequence] ― [specific primer-R (reverse complement)] ― [tag-R (reverse complement)] ― [adapter-R (reverse complement)] ― 3’</span></p>
</div>
<div id="Ch1.S1.p5" class="ltx_para">
<p class="ltx_p">In the case of single-end read, tag-F leads specific primer-F and target sequence in the sequence data.</p>
</div>
<div id="Ch1.S1.p6" class="ltx_para">
<p class="ltx_p">The supplement of <cite class="ltx_cite ltx_citemacro_citet">Hamady <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib7" title="" class="ltx_ref">2008</a>)</cite> may be useful for picking tag sequences.
In the case of single-end sequencing, 3’-side tag is not required, and tagless primer can be used for PCR.
In the case of paired-end sequencing, single index (tag) can be applied, but dual index (tag) is recommended for detecting unlikely tag combinations which means that forward and reverse sequences are mispaired.</p>
</div>
<div id="Ch1.S1.p7" class="ltx_para">
<p class="ltx_p">Using above primer sets for PCR, primers anneal to templates in “Y”-formation, and the amplicon sequences which have tags and adapters for both ends will be constructed.
Then, the amplicon solutions are mixed in the same concentration and sequenced based on manufacturer’s protocol.
Spectrophotometer (including Nanodrop) is inappropreate for the measurement of the concentration of solution because measurement of dsDNA using spectrophotometer is likely to be affected by the other contaminants.
I recommend Qubit (ThermoFisher) for measurement of dsDNA concentration.
Quantitative PCR-based method can also be recommended but it’s expensive and more time-consuming.</p>
</div>
<div id="Ch1.S1.p8" class="ltx_para">
<p class="ltx_p">Primer annealing position sequence can also be used for recognizing the sample of origin.
Therefore, the sequences of multiple loci, for example plant <span class="ltx_text ltx_font_italic">rbcL</span> and <span class="ltx_text ltx_font_italic">matK</span>, from same sample set tagged by same tag set can be multiplexed and sequenced.
Of course, the sequences of multiple loci can also be recognized by themselves.
Smaller number of cycles and longer extension time were recommended for PCR.
Because the required amount of DNA for sequence sample preparation is not so high, the larger number of cycles of PCR amplification is not needed.
The larger number of cycles and shorter extension time generates more incompletely extended amplicon sequences and the incompletely extended amplicon sequences are re-extend using different template sequences in next cycle.
Such sequences are called as “chimeric DNA”.
Chimeric DNAs causes a discovery of non-existent novel species or a overestimation of species diversity.
To reduce chimeric DNA construction, using high-fidelity DNA polymerase such as Phusion (Finnzymes) or KOD (TOYOBO) is effective.
<cite class="ltx_cite ltx_citemacro_citet">Stevens <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib17" title="" class="ltx_ref">2013</a>)</cite> reported that slowing cooling-down from denaturation temperature to annealing temperature reduced chimeric DNA construction.
If your thermal cycler can change cooling speed, slowing cooling-down from denaturation temperature to annealing temperature can be recommended.
Chimeric DNA sequences can also be eliminated by computer programs after sequencing.
Because chimera removal by programs is incomplete and the nonchimeric sequences shrink, we cannot do better than reduce chimeric DNA construction.</p>
</div>
<div id="Ch1.S1.p9" class="ltx_para">
<p class="ltx_p">In the case of hardly amplifiable templates, using Ampdirect Plus (Shimadzu) for PCR buffer or crushing by homogenizer or beads before DNA extraction is recommended.
Deep freezing before crushing can also be recommended.
Removal of polyphenols or polysaccharides might be required if your sample contain those chemicals.
If PCR amplification using tag- and adapter-jointed-primers fail, try two-step PCR that consist of primary PCR (20–30 cycles) using primers without tags and adapters, purification of amplicons by ExoSAP-IT, and secondary PCR (5–10cycles) using amplicons of primary PCR as templates and tag- and adapter-jointed-primers.</p>
</div>
<section id="Ch1.S1.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1.1 </span>Decreasing costs by interim adapters</h4>
<div id="Ch1.S1.SS1.p1" class="ltx_para">
<p class="ltx_p">Tag- and adaper-jointed-primers are very long and expensive.
In addition, we need to buy tag- and adaper-jointed-primers for each locus.
To reduce cost of tag- and adaper-jointed-primers, interim adapter-jointed primers and two-step PCR is useful.
The following primer set is used in primary PCR.</p>
</div>
<div id="Ch1.S1.SS1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [interim adapter] ― [specific primer] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p3" class="ltx_para">
<p class="ltx_p">This PCR product have interim adapter sequences at the both ends.
This PCR product is used as template in secondary PCR after purification.
The following primer set is used in secondary PCR.</p>
</div>
<div id="Ch1.S1.SS1.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter specified by manufacturer] ― [tag] ― [interim adapter] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p5" class="ltx_para">
<p class="ltx_p">This two-step PCR enables us to reuse secondary PCR primers.
However, this two-step PCR may increase PCR errors and PCR amplification biases, and decrease target sequence lengths.
Note that final PCR product is constructed as the following style.</p>
</div>
<div id="Ch1.S1.SS1.p6" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter-F specified by manufacturer] ― [tag-F] ― [interim adapter-F] ― [specific primer-F] ― [target sequence] ― [specific primer-R (reverse complement)] ― [interim adapter-R (reverse complement)] ― [tag-R (reverse complement)] ― [adapter-R (reverse complement) specified by manufacturer] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p7" class="ltx_para">
<p class="ltx_p">Illumina’s multiplex sequencing method <cite class="ltx_cite ltx_citemacro_citep">(Illumina corporation, <a href="#bib.bib9" title="" class="ltx_ref">2013</a>)</cite> using Nextera XT Index Kit is same as the above method.
In the dual-index paired-end sequencing based on this method, the first read start from behind of interim adapter-F (i.e. head of specific primer-F) to target sequence.
The second read start from behind of interim adapter-R and contains tag-R (index1) sequence.
The third read start from behind of adapter-F and contains tag-F (index2) sequence.
The last read start from ahead of interim adapter-R (i.e. tail of specific primer-R) to target sequence.
The first, second, third and last reads are saved as <span class="ltx_text ltx_font_typewriter">*_R1_*.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R2_*.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R3_*.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">*_R4_*.fastq.gz</span>, respectively.
The first, second and third reads are same strand, but last read is reverse strand.
Because the sequencing primers for the first and the last reads are targeting interim adapter-F and interim adapter-R, respectively, the first and the last reads contains the sequences of specific primer-F and specific primer-R, respectively.
Thus, the target sequences contained in the first and the last reads are shrinked.
If the length of the target sequence is 500 bp or longer, there might be no overlap and paired-end reads cannot be concatenate.
If specific primer-F and specific primer-R are used as sequencing primers for the first and the last reads, you can exclude the sequences of specific primer-F and specific primer-R from the first and the last reads.
However, the following quality improvement method by insertion of N cannot be applied in such case.</p>
</div>
</section>
<section id="Ch1.S1.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1.2 </span>Quality improvement by insertion of N</h4>
<div id="Ch1.S1.SS2.p1" class="ltx_para">
<p class="ltx_p">On the Illumina platform, luminescence of syntheses of DNA on a flowcell is detected by optical sensor.
PCR amplicons of metagenomes are single locus and much more homogeneous than genome shotgun or RNA-seq library sequences.
In such case, neighboring sequences on a flowcell is difficult to distinguish one from the other.
In addition, if the nucleotide of the most sequences (especially first 12 nucleotides) are the same and nonluminescence, the Illumina platform sequencer will determined as failure and crash.
To avoid this problem, insertion of <span class="ltx_text ltx_font_typewriter">NNNNNN</span> between specific primer and interim adapter is effective.
<span class="ltx_text ltx_font_typewriter">NNNNNN</span> of the head of sequences enables sequencers to distinguish neighboring sequences and prevent black out, and the sequencing quality therefore will be improved <cite class="ltx_cite ltx_citemacro_citep">(Nelson <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib15" title="" class="ltx_ref">2014</a>)</cite>.
The varied length of <span class="ltx_text ltx_font_typewriter">NNNNNN</span> causes artificial frameshift and also effective <cite class="ltx_cite ltx_citemacro_citep">(Fadrosh <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib6" title="" class="ltx_ref">2014</a>)</cite>.
PhiX control can be reduced by using the above methods, and the application sequences will increase.</p>
</div>
</section>
</section>
</section>
<section id="Ch2" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 2 </span>Preprocessing of nucleotide sequence data</h2>
<div id="Ch2.p1" class="ltx_para">
<p class="ltx_p">Roche GS series sequencers and Ion PGM output raw sequencing data as <span class="ltx_text ltx_font_typewriter">*.sff</span>.
Illumina platform sequencers output <span class="ltx_text ltx_font_typewriter">*.fastq</span> files.
In this chapter, the procedures of demultiplexing, quality-trimming and quality-filtering.
The <span class="ltx_text ltx_font_typewriter">clsplitseq</span> command of Claident is recommended for demultiplexing because the programs provided by manufacturer ignores the quality of tag positions.
The following commands should be executed in the terminal or console.
Fundamental knowledge of terminal operations is required.
If you are unfamiliar with terminal operations, you need to become understandable about the contents of appendix <a href="#A2" title="Appendix B Terminal command examples ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">B</span></a>.</p>
</div>
<section id="Ch2.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.1 </span>Importing sequence data deposited to SRA/DRA/ERA or demultiplexed FASTQ</h3>
<div id="Ch2.S1.p1" class="ltx_para">
<p class="ltx_p">Claident assumes <span class="ltx_text ltx_font_typewriter">SequenceID__RunID__TagID__PrimerID</span> for definition lines of sequences, and <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID</span> for file names (without extension).
Therefore, the sequence data deposited to SRA/DRA/ERA or demultiplexed FASTQ cannot be used as is.
The <span class="ltx_text ltx_font_typewriter">climportfastq</span> of Claident can convert such data.
If your data is paired-end, you need to concatenate and filter the sequences before conversion (see section <a href="#Ch2.S3.SS3" title="2.3.3 Concatenating forward and reverse sequences ‣ 2.3 For Illumina platform sequences ‣ Chapter 2 Preprocessing of nucleotide sequence data ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.3.3</span></a>).
The following plain text file is required for conversion.</p>
</div>
<div id="Ch2.S1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName1 RunID__TagID__PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName2 RunID__TagID__PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName3 RunID__TagID__PrimerID</span></span>
</p>
</div>
<div id="Ch2.S1.p3" class="ltx_para">
<p class="ltx_p">Dummy RunID and PrimerID is acceptable.
PrimerID need to be the same among the sample used the same primer set.
TagID need to be different among the different sample files.
TagID can be the same as the sequence file name.</p>
</div>
<div id="Ch2.S1.p4" class="ltx_para">
<p class="ltx_p">After the above file was prepared, execute <span class="ltx_text ltx_font_typewriter">climportfastq</span> like the following and the above file should be given as an input file.</p>
</div>
<div id="Ch2.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> climportfastq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S1.p6" class="ltx_para">
<p class="ltx_p">Then, you can find converted files in the output folder.
If your sequence data is single-end, quality filtering explained in section <a href="#Ch2.S2.SS3" title="2.2.3 Trimming low quality tail and filtering low quality sequences ‣ 2.2 For Roche GS series sequencers and Ion PGM ‣ Chapter 2 Preprocessing of nucleotide sequence data ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2.3</span></a> is recommended.</p>
</div>
</section>
<section id="Ch2.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.2 </span>For Roche GS series sequencers and Ion PGM</h3>
<section id="Ch2.S2.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.1 </span>Converting SFF to FASTQ</h4>
<div id="Ch2.S2.SS1.p1" class="ltx_para">
<p class="ltx_p">First of all, conversion of raw SFF format file to FASTQ file is needed like the following.</p>
</div>
<div id="Ch2.S2.SS1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> sff_extract -c inputfile(SFF)↓</span></p>
</div>
<div id="Ch2.S2.SS1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">-c</span> argument enables trimming of <span class="ltx_text ltx_font_typewriter">TCAG</span> at the head of sequences.
If you add <span class="ltx_text ltx_font_typewriter">TCAG</span> to the head of tag sequences, do not use this argument.
Assuming your SFF file name is <span class="ltx_text ltx_font_typewriter">HOGEHOGE.sff</span>, <span class="ltx_text ltx_font_typewriter">HOGEHOGE.fastq</span> will be saved as FASTQ file.
<span class="ltx_text ltx_font_typewriter">HOGEHOGE.xml</span> will also be generated, but this is not required.
The output sequences have tag sequences at the beginning, followed by primer-F and target sequences, and primer-R (reverse complement) at the end.
Note that all sequences are not completely read from the beginning to the end, the incomplete sequences are included.
The <span class="ltx_text ltx_font_typewriter">sff_extract</span> command is used in this book, but any other programs which can clip <span class="ltx_text ltx_font_typewriter">TCAG</span> at the beginning can be used.
If the SFF to FASTQ converter program cannot clip <span class="ltx_text ltx_font_typewriter">TCAG</span> at the beginning, adding <span class="ltx_text ltx_font_typewriter">TCAG</span> to the beginning of tag sequences to give to <span class="ltx_text ltx_font_typewriter">clsplitseq</span> also works well, but the quality values will be strictly checked.</p>
</div>
</section>
<section id="Ch2.S2.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.2 </span>Demultiplexing of sequences</h4>
<div id="Ch2.S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The FASTQ file that contain the sequences from multiple samples need to be demultiplexed based on tag sequences and primer sequences before the subsequent analysis.
To do this process, a FASTA file which contain tag sequences and another FASTA file which contain primer-F sequences are required.</p>
</div>
<div id="Ch2.S2.SS2.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >TagID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [tag sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >examplesample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p3" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p4" class="ltx_para">
<p class="ltx_p">Degenerate codes of nucleotides are not allowed for tag sequences, but those are allowed for primer sequences.
Both of tag and primer FASTQ files can contain multiple sequences.
If you use interim adapter explained in section <a href="#Ch1.S1.SS1" title="1.1.1 Decreasing costs by interim adapters ‣ 1.1 PCR using tag- and adapter-jointed-primers ‣ Chapter 1 Sequencing of multiple samples by next-generation sequencers ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1.1</span></a>, primer sequences should be written like the following.</p>
</div>
<div id="Ch2.S2.SS2.p5" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [interim adapter][primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| TGATACTCGATACGTACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p6" class="ltx_para">
<p class="ltx_p">Thus, the sequences between tag and target sequences should be written in primer FASTA file.</p>
</div>
<div id="Ch2.S2.SS2.p7" class="ltx_para">
<p class="ltx_p">All the above files are prepared, the following command can demultiplex nucleotide sequences to each sample FASTQ file.</p>
</div>
<div id="Ch2.S2.SS2.p8" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS2.p9" class="ltx_para">
<p class="ltx_p">RunID must differ among different sequencing runs.
RunID is given by sequencer in many cases, you can use such sequencer generated RunID.
RunID is usually contained in sequence file name or sequence name in sequence file, but the naming rules are different among sequencing platforms.
Therefore, <span class="ltx_text ltx_font_typewriter">clsplitseq</span> requires RunID given by user.
<span class="ltx_text ltx_font_typewriter">--minqualtag</span> is an argument that specifies minimum quality threshold of tag position sequences.
If 1 or more lower quality nucleotide than this threshold value is contained by a sequence, such sequence will be omitted from output sequences.
27 for minimum quality threshold is proposed by <cite class="ltx_cite ltx_citemacro_citet">Kunin <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib11" title="" class="ltx_ref">2010</a>)</cite> for 3’-tail trimming of the sequences of Roche GS series sequencers.
The different value might be more suitable for the other sequencers.
In many cases, 30 is used for minimum quality threshold and can be recommended.</p>
</div>
<div id="Ch2.S2.SS2.p10" class="ltx_para">
<p class="ltx_p">If multiplex sequencing technique is not used, <span class="ltx_text ltx_font_typewriter">--tagfile</span> argument can be omitted.
However, just omit of <span class="ltx_text ltx_font_typewriter">--tagfile</span> generates incompatible FASTQ files for Claident.
In such case, you should add identifier (dummy is acceptable) of tag sequences using <span class="ltx_text ltx_font_typewriter">--indexname=TagID</span> argument.</p>
</div>
<div id="Ch2.S2.SS2.p11" class="ltx_para">
<p class="ltx_p">The tag and primer position sequences are trimmed from the output sequences.
Tag position sequence match is evaluated exactly and strictly.
There are no arguments to tolerate a mismatch.
Primer position sequence is aligned based on Needleman-Wunsch algorithm and evaluated allowing 14% of mismatches (the threshold can be changed).
The output files are named as <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.fastq.gz</span> and saved in the output folder.
<span class="ltx_text ltx_font_typewriter">clsplitseq</span> can use multiple CPUs for faster processing.
If your computer have 4 CPU cores, 4 should be speficied for <span class="ltx_text ltx_font_typewriter">--numthreads</span> argument.
Note that operating system and/or writing speed of storage devices might limit processing speed.
By default, the output files are compressed by GZIP.
Therefore, decompression is required to read/write by incompatible programs with gzipped FASTQ files.
The commands of Claident used below can treat gzipped FASTQ files.</p>
</div>
<div id="Ch2.S2.SS2.p12" class="ltx_para">
<p class="ltx_p">Before submission of manuscripts, sequence data need to be deposited to public database such as DDBJ Sequence Read Archive (DRA).
Gzipped FASTQ files in this step can be used for the data deposition.</p>
</div>
<section id="Ch2.S2.SS2.SSSx1" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">If you sequenced a number of samples by multiple sequencing runs</h5>
<div id="Ch2.S2.SS2.SSSx1.p1" class="ltx_para">
<p class="ltx_p">Multiple demultiplexing by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> are required.
However, <span class="ltx_text ltx_font_typewriter">clsplitseq</span> cannot write already existing folder by default.
The secondary run of <span class="ltx_text ltx_font_typewriter">clsplitseq</span> requires <span class="ltx_text ltx_font_typewriter">--append</span> argument like below.</p>
</div>
<div id="Ch2.S2.SS2.SSSx1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
</section>
<section id="Ch2.S2.SS2.SSSx2" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">If your tag sequence lengths are unequal</h5>
<div id="Ch2.S2.SS2.SSSx2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">clsplitseq</span> assumes that all tag sequence lengths are equal for faster processing.
The unequal length tags must be splitted to multiple tag sequence files and multiple demultiplexing runs of <span class="ltx_text ltx_font_typewriter">clcplitseq</span> are required as the following.</p>
</div>
<div id="Ch2.S2.SS2.SSSx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
</section>
<section id="Ch2.S2.SS2.SSSx3" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">Recognition and elimination of reverse primer positions</h5>
<div id="Ch2.S2.SS2.SSSx3.p1" class="ltx_para">
<p class="ltx_p">In the above procedure, reverse primer position and subsequent sequences are not eliminated.
Reverse primer position and subsequent sequences are artificial and should be eliminated if possible.
To do so, reverse primer sequence file like the following is required.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| TCAGTCAGTCAGTCAGTCAG</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p3" class="ltx_para">
<p class="ltx_p">Multiple reverse primers can written in this file.
Note that the N-th reverse primer sequence is assumed to associate with the N-th forward primer sequence.
Therefore, the different number of primer sequences between forward and reverse primer sequence files causes an error.
If there are the samples whose forward or reverse primer sequence is same but the other primer sequence is different, both combinations of forward and reverse primer sequences need to be written as different primers in the files.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p4" class="ltx_para">
<p class="ltx_p">After the preparation of the above file, perform <span class="ltx_text ltx_font_typewriter">clsplitseq</span> as the following.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=ForwardPrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverseprimerfile=ReversePrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reversecomplement \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS2.SSSx3.p6" class="ltx_para">
<p class="ltx_p">In this processing, reverse-complement sequence of reverse primer is searched based on Needleman-Wunsch algorithm allowing 15% (this value can be changed) of mismatches and reverse primer position and subsequent sequence is eliminated in addition to the above process.
If reverse-complement sequence of reverse primer is not found and the other requirement is fullfilled, the sequence will be saved to output file by default.
The <span class="ltx_text ltx_font_typewriter">--needreverseprimer</span> argument is required to filter out the sequence which does not contain reverse-complement sequence of reverse primer.</p>
</div>
</section>
</section>
<section id="Ch2.S2.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.3 </span>Trimming low quality tail and filtering low quality sequences</h4>
<div id="Ch2.S2.SS3.p1" class="ltx_para">
<p class="ltx_p">FASTQ sequences have read quality information.
The low quality 3’-tail can be trimmed and the low quality sequences can be filtered out based on the quality values.
The <span class="ltx_text ltx_font_typewriter">clfilterseq</span> command can perform such processing as the following.</p>
</div>
<div id="Ch2.S2.SS3.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=350 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxlen=400 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS3.p3" class="ltx_para">
<p class="ltx_p">The values of <span class="ltx_text ltx_font_typewriter">--minqual</span> and <span class="ltx_text ltx_font_typewriter">--minquallen</span> indicate the minimum threshold of read quality value and size of sliding window, respectively.
The above command trims 3’-tail positions until 3 bp long sequence whose read quality is 27 or higher in all 3 positions are observed.
In addition, trimmed sequences shorter than <span class="ltx_text ltx_font_typewriter">--minlen</span> will be filtered out and trimmed sequences longer than <span class="ltx_text ltx_font_typewriter">--maxlen</span> will be trimmed to <span class="ltx_text ltx_font_typewriter">--maxlen</span>.
The remaining sequences containing <span class="ltx_text ltx_font_typewriter">--maxplowqual</span> or more rate of lower quality positions than <span class="ltx_text ltx_font_typewriter">--minqual</span> will also be filtered out.
The output is a file by default, but can be saved to the file in the new folder using <span class="ltx_text ltx_font_typewriter">--output=folder</span> argument.
The output file name is same as the input file name in this case.
If you want to save the output files to the existing folder, add <span class="ltx_text ltx_font_typewriter">--append</span> argument.</p>
</div>
<div id="Ch2.S2.SS3.p4" class="ltx_para">
<p class="ltx_p">If you want to apply <span class="ltx_text ltx_font_typewriter">clfilterseq</span> to the all files in the output folder of <span class="ltx_text ltx_font_typewriter">clsplitseq</span>, run the following command.</p>
</div>
<div id="Ch2.S2.SS3.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> for f in OutputFolderOfclsplitseq/*.fastq.gz↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output=folder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=350 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxlen=400 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
</section>
</section>
<section id="Ch2.S3" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.3 </span>For Illumina platform sequences</h3>
<section id="Ch2.S3.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.1 </span>Converting from BCL to FASTQ</h4>
<div id="Ch2.S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The analysis software of Illumina platform sequences can demultiplex sequencing reads, but ignores read quality of tag positions.
Therefore, the sequences have low quality tag positions possibly saved to demultiplexed FASTQ.
To filering out such sequences, pre-demultiplexed FASTQ files are required and can be converted from BCL files with the aid of bcl2fastq.
There are 1.x and 2.x series of bcl2fastq and both series can be used for Claident.
However, the sequencers may be compatible to either 1.x or 2.x, you need to select proper version.
Pre-demultiplexed FASTQ can be demultiplexed by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> in Claident.
See appendix to install bcl2fastq.</p>
</div>
<div id="Ch2.S3.SS1.p2" class="ltx_para">
<p class="ltx_p">To convert BCL to FASTQ, run data folder (superjacent folder of BaseCalls folder) need to be copied to the PC installed bcl2fastq.
If there is <span class="ltx_text ltx_font_typewriter">SampleSheet.csv</span> in run data folder, this file must be renamed or deleted.</p>
</div>
<div id="Ch2.S3.SS1.p3" class="ltx_para">
<p class="ltx_p">In the case of bcl2fastq 1.x, the following commands make FASTQ files from BCL files of 8 bp dual indexed 300PE sequencing data.</p>
</div>
<div id="Ch2.S3.SS1.p4" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd RunDataFolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> configureBclToFastq.pl \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq-cluster-count 0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--use-bases-mask Y300n,Y8,Y8,Y300n \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--input-dir BaseCalls \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output-dir outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> cd outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> make -j4↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS1.p5" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">--fastq-cluster-count 0</span> argument disable large output file splitting.
The <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y300n,Y8,Y8,Y300n</span> is an argument to save forward 300 bp read (last base is trimmed), 8 bp index 1 (reverse-complement of tag-R), 8 bp index 2 (tag-F) and reverse 300 bp read (last base is trimmed) to <span class="ltx_text ltx_font_typewriter">*_R1_001.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R2_001.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R3_001.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">*_R4_001.fastq.gz</span>, respectively.
The value of <span class="ltx_text ltx_font_typewriter">--use-bases-mask</span> argument need to be changed for the other sequencing settings.
For 6 bp single indexed 250SE and 8 bp dual indexed 300SE sequencing data, <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y250n,Y6</span> and <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y300n,Y8,Y8</span> should be suitable, respectively.
<span class="ltx_text ltx_font_typewriter">make -j4</span> executes the conversion using 4 CPUs.
The output files will be compressed by GZIP.
The extension <span class="ltx_text ltx_font_typewriter">.gz</span> of output files indicates that the file is compressed by GZIP.
Claident is compliant with gzipped FASTQ files and decompression is not required.</p>
</div>
<div id="Ch2.S3.SS1.p6" class="ltx_para">
<p class="ltx_p">In the case of bcl2fastq 2.x, perform the following command.</p>
</div>
<div id="Ch2.S3.SS1.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> bcl2fastq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--processing-threads NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--create-fastq-for-index-reads \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--use-bases-mask Y300n,I8,I8,Y300n \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runfolder-dir RunDataFolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output-dir outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS1.p8" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">--processing-threads</span>, <span class="ltx_text ltx_font_typewriter">--use-bases-mask</span> and <span class="ltx_text ltx_font_typewriter">--runfolder-dir</span> indicate the number of processor used in conversion, masking option (almost same as 1.x but index length must be given as <span class="ltx_text ltx_font_typewriter">I[number]</span> instead of <span class="ltx_text ltx_font_typewriter">Y[number]</span>) and run data folder, respectively.</p>
</div>
</section>
<section id="Ch2.S3.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.2 </span>Demultiplexing of sequences</h4>
<div id="Ch2.S3.SS2.p1" class="ltx_para">
<p class="ltx_p">FASTA files containing tag (index) sequences and primer sequences like the following are needed for demultiplexing.
FASTA files containing secondary tag (index) sequences and reverse primer sequences are also required for paired-end sequencing data.</p>
</div>
<div id="Ch2.S3.SS2.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >TagID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [tag sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >examplesample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S3.SS2.p3" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| >exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S3.SS2.p4" class="ltx_para">
<p class="ltx_p">Degenerate code is not allowed for tag sequences, but can be used in primer sequences.
Multiple tags and primers can be written in the files, but the N-th reverse tag/primer sequence is assumed to associate with the N-th forward tag/primer sequence.
Therefore, the different number of tag/primer sequences between forward and reverse tag/primer sequence files causes an error.
If there are the samples whose forward or reverse tag/primer sequence is same but the other tag/primer sequence is different, both combinations of forward and reverse tag/primer sequences need to be written as different tags/primers in the files.
If you added <span class="ltx_text ltx_font_typewriter">N</span> in front of primer, <span class="ltx_text ltx_font_typewriter">N</span> need to be added in primer sequence.
If your <span class="ltx_text ltx_font_typewriter">N</span> length is unequal, only the longest <span class="ltx_text ltx_font_typewriter">N</span> should be written in the file.</p>
</div>
<div id="Ch2.S3.SS2.p5" class="ltx_para">
<p class="ltx_p">All the required files prepared, the following command demultiplex sequences to each sample file.</p>
</div>
<div id="Ch2.S3.SS2.p6" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--index1file=Index1Sequence(tag-Rrevcomp)File \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--index2file=Index2Sequence(tag-F)File \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=ForwardPrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverseprimerfile=ReversePrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberofCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile4 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS2.p7" class="ltx_para">
<p class="ltx_p">The input files should be specified in the order of forward read file, index1 read file, index2 read file and reverse read file.
The <span class="ltx_text ltx_font_typewriter">--index1file</span> and <span class="ltx_text ltx_font_typewriter">--index2file</span> arguments requires the FASTA sequence files of index 1 (reverse-complement of tag-R) and index 2 (tag-F), respectively.
By default, the acceptable mismatches are 14% and 15% for forward and reverse primers, respectively.
If you added <span class="ltx_text ltx_font_typewriter">N</span> in front of primer, the <span class="ltx_text ltx_font_typewriter">--truncateN=enable</span> argument need to be given.
This argument enables exclusion of <span class="ltx_text ltx_font_typewriter">N</span> of primer and matched positions of sequences in calculation of the rate of mismatches.
Therefore, only the longest <span class="ltx_text ltx_font_typewriter">N</span> is required to find <span class="ltx_text ltx_font_typewriter">N</span>-added primer even if the length of <span class="ltx_text ltx_font_typewriter">N</span> is unequal.
After the processing, the number of sequences in demultiplexed files should be compared with those in demultiplexed files generated by Illumina softwares.
Correctly demultiplexed files should contain fewer sequences than demultiplexed files generated by Illumina softwares.
If you used specific primers for sequencing primers, forward and reverse sequences do not contain specific primer positions.
In such cases, <span class="ltx_text ltx_font_typewriter">--primerfile</span> and <span class="ltx_text ltx_font_typewriter">--reverseprimerfile</span> arguments are not required, but <span class="ltx_text ltx_font_typewriter">--primername=PrimerID</span> argument need to be given for converting sequence names as compliant with Claident.
Dummy PrimerID is acceptable but no PrimerID is not.</p>
</div>
<div id="Ch2.S3.SS2.p8" class="ltx_para">
<p class="ltx_p">If you do not perform multiplex sequencing using tag/index, <span class="ltx_text ltx_font_typewriter">--index1file</span> and <span class="ltx_text ltx_font_typewriter">--index2file</span> arguments are unneeded, but <span class="ltx_text ltx_font_typewriter">--indexname=TagID</span> argument must be given for converting sequence names as compliant with Claident.
Dummy TagID is acceptable but no TagID is not.</p>
</div>
<div id="Ch2.S3.SS2.p9" class="ltx_para">
<p class="ltx_p">After demultiplexing, <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.forward.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.reverse.fastq.gz</span> will be generated.
These gzipped FASTQ files can be used for data deposition to sequence read archive sites such as DDBJ Sequence Read Archive (DRA).
In deposition process to DRA, it is required that the sequence lengths are equal or not.
Because primer position sequences that can be unequal lengths even if only one primer set was used are eliminated from demultiplexed sequence files, do not specify that the sequence lengths are equal.</p>
</div>
</section>
<section id="Ch2.S3.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.3 </span>Concatenating forward and reverse sequences</h4>
<section id="Ch2.S3.SS3.SSSx1" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">In the case of overlapped paired-end</h5>
<div id="Ch2.S3.SS3.SSSx1.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clconcatpair</span> command in Claident can be used for concatenating overlapped paired-end sequence data.
The <span class="ltx_text ltx_font_typewriter">clconcatpair</span> concatenate forward and reverse sequences based on overlap positions using VSEARCH by the following command.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=OVL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx1.p3" class="ltx_para">
<p class="ltx_p">This command finds <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span> in inputfolder, and concatenate the pairs automatically.
Gzipped <span class="ltx_text ltx_font_typewriter">.gz</span> and/or bzip2ed <span class="ltx_text ltx_font_typewriter">.bz2</span> files are also be found and concatenated.
Concatenated sequence files will be generated as <span class="ltx_text ltx_font_typewriter">*.fastq.gz</span> in outputfolder.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p4" class="ltx_para">
<p class="ltx_p">If input file names are not compliant with <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span>, the following command can be used for concatenating a pair of files.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=OVL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx1.p6" class="ltx_para">
<p class="ltx_p">The forward and reverse sequence FASTQ files should be given as inputfile1 and inputfile2, respectively.
Addition of <span class="ltx_text ltx_font_typewriter">.gz</span> or <span class="ltx_text ltx_font_typewriter">.bz2</span> is required for output file compression.</p>
</div>
</section>
<section id="Ch2.S3.SS3.SSSx2" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">In the case of non-overlapped paired-end</h5>
<div id="Ch2.S3.SS3.SSSx2.p1" class="ltx_para">
<p class="ltx_p">If there are no overlaps between forward and reverse sequences, quality-trimming and quality-filtering using <span class="ltx_text ltx_font_typewriter">clfilterseq</span> like the following should be performed at first.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=100 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p3" class="ltx_para">
<p class="ltx_p">The values of <span class="ltx_text ltx_font_typewriter">--minqual</span> and <span class="ltx_text ltx_font_typewriter">--minquallen</span> indicate the minimum threshold of read quality value and size of sliding window, respectively.
The above command trims 3’-tail positions until 3 bp long sequence whose read quality is 30 or higher in all 3 positions are observed.
In addition, trimmed sequences shorter than <span class="ltx_text ltx_font_typewriter">--minlen</span> will be filtered out.
The remaining sequences containing <span class="ltx_text ltx_font_typewriter">--maxplowqual</span> or more rate of lower quality positions than <span class="ltx_text ltx_font_typewriter">--minqual</span> will also be filtered out.
In this process, filtering out one of the sequence of a pair, the other sequence of the pair will also be filtered out.
The output will be generated as the same name files in outputfolder.
If you want to output to existing folder, you need to add <span class="ltx_text ltx_font_typewriter">--append</span> argument.
To apply the above command to all the pairs of <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span> in the current folder, execute the following commands.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p4" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> for f in ‘ls *.forward.fastq.gz | grep -P -o ’^[^\.]+’‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=100 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p5" class="ltx_para">
<p class="ltx_p">After the quality-trimming and quality-filtering like above, perform sequence concatenation with the aid of <span class="ltx_text ltx_font_typewriter">clconcatpair</span> like below.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p6" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">> clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=NON \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p7" class="ltx_para">
<p class="ltx_p">In this process, the forward and reverse sequences like the following are assumed as input.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p8" class="ltx_para">
<span class="ltx_inline-block ltx_framed_left" style="border-color: #000000;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― forward sequence ― 3’</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― reverse sequence ― 3’</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p9" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clconcatpair --mode=NON</span> command will concatenate these sequence pairs and make sequences like the following.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p10" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― reverse sequence (reverse-complement) ― ACGTACGTACGTACGT ― forward sequence ― 3’</span></p>
</div>