metabarcodingtextbook2.en.html

<!DOCTYPE html><html prefix="dcterms: http://purl.org/dc/terms/">
<head>
<title>Metabarcoding and DNA barcoding for Ecologists: Sequence analysis</title>
<!--Generated on Sat Jun 22 08:40:06 2019 by LaTeXML (version 0.8.3) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on June 22, 2019.-->

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="LaTeXML.css" type="text/css">
<link rel="stylesheet" href="ltx-book.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document">
<h1 class="ltx_title ltx_title_document">Metabarcoding and DNA barcoding for Ecologists: Sequence analysis</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Akifumi S. Tanabe
</span></span>
</div>
<div class="ltx_date ltx_role_creation">June 22, 2019</div>

<section id="Chx1" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">Preface</h2>

<div id="Chx1.p1" class="ltx_para">
<p class="ltx_p">This book is distributed under a Creative Commons Attribution-ShareAlike 4.0 International License.
You can copy, redistribute, display this text if you designate the authorship.
You can also modify this text and distribute the modified version if you designate the authorship and apply this license or compatible license to the modified version.
To view a copy of this license, visit 
<br class="ltx_break"><a href="https://creativecommons.org/licenses/by-sa/4.0/" title="" class="ltx_ref ltx_href">https://creativecommons.org/licenses/by-sa/4.0/</a>
<br class="ltx_break">or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.</p>
</div>
<div id="Chx1.p2" class="ltx_para">
<p class="ltx_p">I hope that this text helps you.
I am grateful to Dr. Hirokazu Toju (Center for Ecological Research, Kyoto University), Dr. Satoshi Nagai (National Research Institute of Fisheries Science, Japan Fisheries Research and Education Agency), Dr. Hiroki Yamanaka (Ryukoku University), and you.</p>
</div>
</section>
<section id="Chx2" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">Legends</h2>

<div id="Chx2.p1" class="ltx_para">
<p class="ltx_p">In this text, the input commands to terminals and display outputs are described as below.</p>
</div>
<div id="Chx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># comments</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; command argument1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">argument2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">argument3↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">output of command</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; command argument1 argument2 argument3↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">output of command</span></span>
</span>
</div>
<div id="Chx2.p3" class="ltx_para">
<p class="ltx_p">In the above example, the same commands <span class="ltx_text ltx_font_typewriter">command argument1 argument2 argument3</span> were executed twice.
The outputs <span class="ltx_text ltx_font_typewriter">output of command</span> were displayed after execution.
The characters between # and line feed were comments and needless to input.
<span class="ltx_text ltx_font_typewriter">&gt;</span> and space of line head indicate the prompt of terminal.
Do not type these characters.
↓ means the end of input commands and arguments and needless to input, but you need to type Enter key to input line feed.
I use line feed within commands or arguments for viewability.
Such line feed is led by <span class="ltx_text ltx_font_typewriter">\</span>.
Therefore, the line feeds led by <span class="ltx_text ltx_font_typewriter">\</span> do not mean the end of commands or arguments, or designation to input Enter key.
Involuntary line feeds may be generated by word wrap function depending on your read environment, but do not mean the end of commands or arguments, or designation to input Enter key.</p>
</div>
<div id="Chx2.p4" class="ltx_para">
<p class="ltx_p">The file content is shown as below in this text.</p>
</div>
<div id="Chx2.p5" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| The content of first line</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| The content of second line</span></span>
</p>
</div>
<div id="Chx2.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">|</span> and space of line head indicate the line head in the file, do not exist in the file and needless to input these characters.
This code is written to help you to distinguish true line feeds and involuntary line feeds.</p>
</div>
</section>
<section id="Ch0" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 0 </span>Installing softwares and preparing analysis environment</h2>

<div id="Ch0.p1" class="ltx_para">
<p class="ltx_p">In this text, I assume Debian GNU/Linux 9 (stretch) (hereafter Debian) or Ubuntu Linux 18.04 LTS (hereafter Ubuntu) as operating system.
If you use Windows PC, please install Debian or Ubuntu.
Cygwin or Windows Subsystem for Linux provided for Windows 10 can be used for the following analysis, but the programs run much more slowly.
You can use CD, DVD or USB memory to boot installer of Linux.
If your PC has only one storage device, you need to reduce Windows partition by using partition resizer software such as EaseUS Partition Master or using a partition resize function contained in the installer.
You can also use newly added internal storage devices or external storage devices connected by USB.
There are several variations of Ubuntu, and I recommend Xubuntu rather than normal Ubuntu.</p>
</div>
<div id="Ch0.p2" class="ltx_para">
<p class="ltx_p">Debian and Ubuntu can be installed to Mac.
If there is no enough space, you need to resize OSX partition with the aid of Disk Utility or add storage device.
The rEFIt or rEFInd boot selecter may be required to boot Debian, Ubuntu or the installer of them on Mac.
If you install rEFIt or rEFInd to your Mac, you can boot the installer of Debian or Ubuntu from CD, DVD or USB memory.
Do not delete existing partition of OSX.
If you have enough free space, you don’t need to use Disk Utility to resize existing partition.
You can install Debian or Ubuntu to external storage devices on Mac.</p>
</div>
<div id="Ch0.p3" class="ltx_para">
<p class="ltx_p">I assume Intel64/AMD64 (x86_64) CPU machine as analysis environment.
The other CPU machine can be used for analysis, but you need to solve problems by yourself.
The 64 bits version of Debian or Ubuntu is also required because 32 bits version cannot use large memory.</p>
</div>
<section id="Ch0.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">0.1 </span>Installation of Claident, Assams, databases, and the other required programs</h3>

<div id="Ch0.S1.p1" class="ltx_para">
<p class="ltx_p">Run the following commands in terminal or console as the user that can use <span class="ltx_text ltx_font_typewriter">sudo</span>.
Then, all of the required softwares will be installed.
The installer will ask password to you when <span class="ltx_text ltx_font_typewriter">sudo</span> is used.</p>
</div>
<div id="Ch0.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; rm -r workingdirectory↓</span></span>
</span>
</div>
<div id="Ch0.S1.p3" class="ltx_para">
<p class="ltx_p">By default, the softwares will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>.
In the installation, you will see <span class="ltx_text ltx_font_typewriter">Permission denied</span> error and the installer ask password to you.
If the installer continue after password input, you don’t need to care about the error.
The installer try to install without <span class="ltx_text ltx_font_typewriter">sudo</span> at first and the installation output the above error.
Then, the installer try to install using <span class="ltx_text ltx_font_typewriter">sudo</span>.</p>
</div>
<div id="Ch0.S1.p4" class="ltx_para">
<p class="ltx_p">If you need proxy to connect the internet, execute the following commands to set environment variables before execution of the installer.</p>
</div>
<div id="Ch0.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export http_proxy=http://server.address:portnumber↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export https_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export ftp_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export all_proxy=$http_proxy↓</span></span>
</span>
</div>
<div id="Ch0.S1.p6" class="ltx_para">
<p class="ltx_p">If the proxy requires username and password, execute the following commands instead of the above commands.</p>
</div>
<div id="Ch0.S1.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export http_proxy=http://username:password@server.address:portnumber↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export https_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export ftp_proxy=$http_proxy↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export all_proxy=$http_proxy↓</span></span>
</span>
</div>
<section id="Ch0.S1.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.1 </span>Upgrading to new version</h4>

<div id="Ch0.S1.SS1.p1" class="ltx_para">
<p class="ltx_p">If you want to upgrade all of the softwares and the databases, run the same commands as initial installation.
By this procedure, Assams, Claident, PEAR, VSEARCH, Metaxa and ITSx will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>, and NCBI BLAST+, BLAST databases for molecular identification, taxonomy databases and the other required programs will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local/share/claident</span>.
NCBI BLAST+ and BLAST databases used by Claident can co-exist system wide installation of NCBI BLAST+ and BLAST databases.</p>
</div>
<div id="Ch0.S1.SS1.p2" class="ltx_para">
<p class="ltx_p">You can disable a part of upgrade like below.</p>
</div>
<div id="Ch0.S1.SS1.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Assams</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .assams↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Claident</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .claident↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of PEAR</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .pear↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of VSEARCH</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .vsearch↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of NCBI BLAST+</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .blast↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of sff_extract</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .sffextract↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of HMMer</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .hmmer↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of MAFFT</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .mafft↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of Metaxa</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .metaxa↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ITSx</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .itsx↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘overall’’ BLAST and taxonomy databases</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .overall↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘Claident Databases for UCHIME’’</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .cdu↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘rdp’’ reference database for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .rdp↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘silva’’ reference databases for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .silva↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># disable upgrade of ‘‘unite’’ reference databases for chimera detection</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; touch .unite↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># execute upgrade</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; rm -r workingdirectory↓</span></span>
</span>
</div>
</section>
<section id="Ch0.S1.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.2 </span>Installing to non-default path</h4>

<div id="Ch0.S1.SS2.p1" class="ltx_para">
<p class="ltx_p">If you install the softwares based on the above procedure, the softwares will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local</span>.
The executable commands will be installed to <span class="ltx_text ltx_font_typewriter">/usr/local/bin</span>.
You can change these install path for coexistence with the other programs such as older versions like below.</p>
</div>
<div id="Ch0.S1.SS2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; mkdir -p ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ~/workingdirectory↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export PREFIX=install_path↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installClaident_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installOptions_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget https://www.claident.org/installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sh installUCHIMEDB_Debian.sh↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd ..↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; rm -r workingdirectory↓</span></span>
</span>
</div>
<div id="Ch0.S1.SS2.p3" class="ltx_para">
<p class="ltx_p">In this case, the following commands need to be executed before analysis.</p>
</div>
<div id="Ch0.S1.SS2.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; export PATH=install_path/bin:$PATH↓</span></p>
</div>
<div id="Ch0.S1.SS2.p5" class="ltx_para">
<p class="ltx_p">You can omit above command if the above command is added to <span class="ltx_text ltx_font_typewriter">~/.bash_profile</span> or <span class="ltx_text ltx_font_typewriter">~/.bashrc</span>.</p>
</div>
</section>
<section id="Ch0.S1.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">0.1.3 </span>How to install multiple versions in a computer</h4>

<div id="Ch0.S1.SS3.p1" class="ltx_para">
<p class="ltx_p">If you install Claident and the other softwares to default install path of a computer to which Claident was already installed, all softwares will be overwritten.
As noted above, multiple versions of Claident can coexist if you install Claident to non-default path.
Note that a configuration file <span class="ltx_text ltx_font_typewriter">.claident</span> placed at a home directory of login user (<span class="ltx_text ltx_font_typewriter">/home/username</span>) or <span class="ltx_text ltx_font_typewriter">/etc/claident</span> cannot coexist at the same path.
You need to replace this file before changing the version of Claident.
The configuration file at the home directory of login user will be used preferentially.
To use multiple version, I recommend to make user account for each version and to install Claident to the home directory of each user.
Then, the version of Claident can be switched by switching login user.</p>
</div>
</section>
</section>
</section>
<section id="Ch1" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 1 </span>Sequencing of multiple samples by next-generation sequencers</h2>

<div id="Ch1.p1" class="ltx_para">
<p class="ltx_p">In this chaper, I explain brief overview of tagged multiplex sequencing method by Roche GS series sequencers, Ion PGM and Illumina MiSeq.
These sequencers can read over-400bp contiguously and are suitable for metabarcoding and DNA barcoding.
Note that MiSeq requires concatenation of paired-end reads.
Therefore, PCR amplicons should be 500bp or shorter (400bp is recommended) in order to concatenate paired-end reads.
Forward and reverse reads can be analyzed separetely, but I cannot recommend such analysis because reverse reads are usually low quality.</p>
</div>
<div id="Ch1.p2" class="ltx_para">
<p class="ltx_p">The next-generation sequencers output extremely large amount of nucleotide sequences in single run.
Running costs of single run is much higher than Sanger method-based sequencers.
To use such sequencers efficiently, multiplex sequencing method was developed.
Multiplex identifier tag sequences are added to target sequences to identify the sample of origin, and the multiple tagged samples are mixed and sequenced in single run in this method.
This method can extremely reduce per-sample sequencing costs.
Multiplex identifier tag is also called as “barcode”.
However, nucleotide sequence for DNA barcoding is called as “barcode sequence”.
This is very confusing and “multiplex identifier tag” is too long.
Thus, I call multiplex identifier tag sequence as just “tag” in this text.
Please notice that tag is often called as “index”.</p>
</div>
<div id="Ch1.p3" class="ltx_para">
<p class="ltx_p">In the following analysis, chimera sequences constructed in PCR and erroneous sequences potentially causes misinterpretation of analysis results.
If multiple PCR replicates are prepared, tagged and sequenced separately, shared sequences among all replicates can be considered as nonchimeric and less erroneous.
This is because there are huge number of sequence combinations and joint points but no error sequence pattern is only one for one true sequence and nonchimeric and no error sequences likely to be observed at all replicates.
Program cannot remove chimeras and errors enough but we can expect that the combination of PCR replicates and program improves removal efficiency of chimeras and errors.
After removal of chimeras and errors, the number of sequences of PCR replicates can be summed up and used in subsequent analysis.</p>
</div>
<section id="Ch1.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1.1 </span>PCR using tag- and adapter-jointed-primers</h3>

<div id="Ch1.S1.p1" class="ltx_para">
<p class="ltx_p">In order to add tag to amplicon, PCR using tag-jointed primer is the easiest way.
This method requires a set of tag-jointed primers.
In addition, library preparation kits for next-generation sequencers usually presume that the adapter sequences specified by manufacturers are added to the both end of target sequences.
Thus, the following tag- and adapter-jointed primer is used for PCR.</p>
</div>
<div id="Ch1.S1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter] ― [tag] ― [specific primer] ― 3’</span></p>
</div>
<div id="Ch1.S1.p3" class="ltx_para">
<p class="ltx_p">If this kind of primers are used for the both forward and reverse primers, the following amplicon sequences will be constructed.</p>
</div>
<div id="Ch1.S1.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter-F] ― [tag-F] ― [specific primer-F] ― [target sequence] ― [specific primer-R (reverse complement)] ― [tag-R (reverse complement)] ― [adapter-R (reverse complement)] ― 3’</span></p>
</div>
<div id="Ch1.S1.p5" class="ltx_para">
<p class="ltx_p">In the case of single-end read, tag-F leads specific primer-F and target sequence in the sequence data.</p>
</div>
<div id="Ch1.S1.p6" class="ltx_para">
<p class="ltx_p">The supplement of <cite class="ltx_cite ltx_citemacro_citet">Hamady <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib7" title="" class="ltx_ref">2008</a>)</cite> may be useful for picking tag sequences.
In the case of single-end sequencing, 3’-side tag is not required, and tagless primer can be used for PCR.
In the case of paired-end sequencing, single index (tag) can be applied, but dual index (tag) is recommended for detecting unlikely tag combinations which means that forward and reverse sequences are mispaired.</p>
</div>
<div id="Ch1.S1.p7" class="ltx_para">
<p class="ltx_p">Using above primer sets for PCR, primers anneal to templates in “Y”-formation, and the amplicon sequences which have tags and adapters for both ends will be constructed.
Then, the amplicon solutions are mixed in the same concentration and sequenced based on manufacturer’s protocol.
Spectrophotometer (including Nanodrop) is inappropreate for the measurement of the concentration of solution because measurement of dsDNA using spectrophotometer is likely to be affected by the other contaminants.
I recommend Qubit (ThermoFisher) for measurement of dsDNA concentration.
Quantitative PCR-based method can also be recommended but it’s expensive and more time-consuming.</p>
</div>
<div id="Ch1.S1.p8" class="ltx_para">
<p class="ltx_p">Primer annealing position sequence can also be used for recognizing the sample of origin.
Therefore, the sequences of multiple loci, for example plant <span class="ltx_text ltx_font_italic">rbcL</span> and <span class="ltx_text ltx_font_italic">matK</span>, from same sample set tagged by same tag set can be multiplexed and sequenced.
Of course, the sequences of multiple loci can also be recognized by themselves.
Smaller number of cycles and longer extension time were recommended for PCR.
Because the required amount of DNA for sequence sample preparation is not so high, the larger number of cycles of PCR amplification is not needed.
The larger number of cycles and shorter extension time generates more incompletely extended amplicon sequences and the incompletely extended amplicon sequences are re-extend using different template sequences in next cycle.
Such sequences are called as “chimeric DNA”.
Chimeric DNAs causes a discovery of non-existent novel species or a overestimation of species diversity.
To reduce chimeric DNA construction, using high-fidelity DNA polymerase such as Phusion (Finnzymes) or KOD (TOYOBO) is effective.
<cite class="ltx_cite ltx_citemacro_citet">Stevens <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib17" title="" class="ltx_ref">2013</a>)</cite> reported that slowing cooling-down from denaturation temperature to annealing temperature reduced chimeric DNA construction.
If your thermal cycler can change cooling speed, slowing cooling-down from denaturation temperature to annealing temperature can be recommended.
Chimeric DNA sequences can also be eliminated by computer programs after sequencing.
Because chimera removal by programs is incomplete and the nonchimeric sequences shrink, we cannot do better than reduce chimeric DNA construction.</p>
</div>
<div id="Ch1.S1.p9" class="ltx_para">
<p class="ltx_p">In the case of hardly amplifiable templates, using Ampdirect Plus (Shimadzu) for PCR buffer or crushing by homogenizer or beads before DNA extraction is recommended.
Deep freezing before crushing can also be recommended.
Removal of polyphenols or polysaccharides might be required if your sample contain those chemicals.
If PCR amplification using tag- and adapter-jointed-primers fail, try two-step PCR that consist of primary PCR (20–30 cycles) using primers without tags and adapters, purification of amplicons by ExoSAP-IT, and secondary PCR (5–10cycles) using amplicons of primary PCR as templates and tag- and adapter-jointed-primers.</p>
</div>
<section id="Ch1.S1.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1.1 </span>Decreasing costs by interim adapters</h4>

<div id="Ch1.S1.SS1.p1" class="ltx_para">
<p class="ltx_p">Tag- and adaper-jointed-primers are very long and expensive.
In addition, we need to buy tag- and adaper-jointed-primers for each locus.
To reduce cost of tag- and adaper-jointed-primers, interim adapter-jointed primers and two-step PCR is useful.
The following primer set is used in primary PCR.</p>
</div>
<div id="Ch1.S1.SS1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [interim adapter] ― [specific primer] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p3" class="ltx_para">
<p class="ltx_p">This PCR product have interim adapter sequences at the both ends.
This PCR product is used as template in secondary PCR after purification.
The following primer set is used in secondary PCR.</p>
</div>
<div id="Ch1.S1.SS1.p4" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter specified by manufacturer] ― [tag] ― [interim adapter] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p5" class="ltx_para">
<p class="ltx_p">This two-step PCR enables us to reuse secondary PCR primers.
However, this two-step PCR may increase PCR errors and PCR amplification biases, and decrease target sequence lengths.
Note that final PCR product is constructed as the following style.</p>
</div>
<div id="Ch1.S1.SS1.p6" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― [adapter-F specified by manufacturer] ― [tag-F] ― [interim adapter-F] ― [specific primer-F] ― [target sequence] ― [specific primer-R (reverse complement)] ― [interim adapter-R (reverse complement)] ― [tag-R (reverse complement)] ― [adapter-R (reverse complement) specified by manufacturer] ― 3’</span></p>
</div>
<div id="Ch1.S1.SS1.p7" class="ltx_para">
<p class="ltx_p">Illumina’s multiplex sequencing method <cite class="ltx_cite ltx_citemacro_citep">(Illumina corporation, <a href="#bib.bib9" title="" class="ltx_ref">2013</a>)</cite> using Nextera XT Index Kit is same as the above method.
In the dual-index paired-end sequencing based on this method, the first read start from behind of interim adapter-F (i.e. head of specific primer-F) to target sequence.
The second read start from behind of interim adapter-R and contains tag-R (index1) sequence.
The third read start from behind of adapter-F and contains tag-F (index2) sequence.
The last read start from ahead of interim adapter-R (i.e. tail of specific primer-R) to target sequence.
The first, second, third and last reads are saved as <span class="ltx_text ltx_font_typewriter">*_R1_*.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R2_*.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R3_*.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">*_R4_*.fastq.gz</span>, respectively.
The first, second and third reads are same strand, but last read is reverse strand.
Because the sequencing primers for the first and the last reads are targeting interim adapter-F and interim adapter-R, respectively, the first and the last reads contains the sequences of specific primer-F and specific primer-R, respectively.
Thus, the target sequences contained in the first and the last reads are shrinked.
If the length of the target sequence is 500 bp or longer, there might be no overlap and paired-end reads cannot be concatenate.
If specific primer-F and specific primer-R are used as sequencing primers for the first and the last reads, you can exclude the sequences of specific primer-F and specific primer-R from the first and the last reads.
However, the following quality improvement method by insertion of N cannot be applied in such case.</p>
</div>
</section>
<section id="Ch1.S1.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1.2 </span>Quality improvement by insertion of N</h4>

<div id="Ch1.S1.SS2.p1" class="ltx_para">
<p class="ltx_p">On the Illumina platform, luminescence of syntheses of DNA on a flowcell is detected by optical sensor.
PCR amplicons of metagenomes are single locus and much more homogeneous than genome shotgun or RNA-seq library sequences.
In such case, neighboring sequences on a flowcell is difficult to distinguish one from the other.
In addition, if the nucleotide of the most sequences (especially first 12 nucleotides) are the same and nonluminescence, the Illumina platform sequencer will determined as failure and crash.
To avoid this problem, insertion of <span class="ltx_text ltx_font_typewriter">NNNNNN</span> between specific primer and interim adapter is effective.
<span class="ltx_text ltx_font_typewriter">NNNNNN</span> of the head of sequences enables sequencers to distinguish neighboring sequences and prevent black out, and the sequencing quality therefore will be improved <cite class="ltx_cite ltx_citemacro_citep">(Nelson <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib15" title="" class="ltx_ref">2014</a>)</cite>.
The varied length of <span class="ltx_text ltx_font_typewriter">NNNNNN</span> causes artificial frameshift and also effective <cite class="ltx_cite ltx_citemacro_citep">(Fadrosh <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib6" title="" class="ltx_ref">2014</a>)</cite>.
PhiX control can be reduced by using the above methods, and the application sequences will increase.</p>
</div>
</section>
</section>
</section>
<section id="Ch2" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 2 </span>Preprocessing of nucleotide sequence data</h2>

<div id="Ch2.p1" class="ltx_para">
<p class="ltx_p">Roche GS series sequencers and Ion PGM output raw sequencing data as <span class="ltx_text ltx_font_typewriter">*.sff</span>.
Illumina platform sequencers output <span class="ltx_text ltx_font_typewriter">*.fastq</span> files.
In this chapter, the procedures of demultiplexing, quality-trimming and quality-filtering.
The <span class="ltx_text ltx_font_typewriter">clsplitseq</span> command of Claident is recommended for demultiplexing because the programs provided by manufacturer ignores the quality of tag positions.
The following commands should be executed in the terminal or console.
Fundamental knowledge of terminal operations is required.
If you are unfamiliar with terminal operations, you need to become understandable about the contents of appendix <a href="#A2" title="Appendix B Terminal command examples ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">B</span></a>.</p>
</div>
<section id="Ch2.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.1 </span>Importing sequence data deposited to SRA/DRA/ERA or demultiplexed FASTQ</h3>

<div id="Ch2.S1.p1" class="ltx_para">
<p class="ltx_p">Claident assumes <span class="ltx_text ltx_font_typewriter">SequenceID__RunID__TagID__PrimerID</span> for definition lines of sequences, and <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID</span> for file names (without extension).
Therefore, the sequence data deposited to SRA/DRA/ERA or demultiplexed FASTQ cannot be used as is.
The <span class="ltx_text ltx_font_typewriter">climportfastq</span> of Claident can convert such data.
If your data is paired-end, you need to concatenate and filter the sequences before conversion (see section <a href="#Ch2.S3.SS3" title="2.3.3 Concatenating forward and reverse sequences ‣ 2.3 For Illumina platform sequences ‣ Chapter 2 Preprocessing of nucleotide sequence data ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.3.3</span></a>).
The following plain text file is required for conversion.</p>
</div>
<div id="Ch2.S1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName1 RunID__TagID__PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName2 RunID__TagID__PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| SequenceFileName3 RunID__TagID__PrimerID</span></span>
</p>
</div>
<div id="Ch2.S1.p3" class="ltx_para">
<p class="ltx_p">Dummy RunID and PrimerID is acceptable.
PrimerID need to be the same among the sample used the same primer set.
TagID need to be different among the different sample files.
TagID can be the same as the sequence file name.</p>
</div>
<div id="Ch2.S1.p4" class="ltx_para">
<p class="ltx_p">After the above file was prepared, execute <span class="ltx_text ltx_font_typewriter">climportfastq</span> like the following and the above file should be given as an input file.</p>
</div>
<div id="Ch2.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; climportfastq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S1.p6" class="ltx_para">
<p class="ltx_p">Then, you can find converted files in the output folder.
If your sequence data is single-end, quality filtering explained in section <a href="#Ch2.S2.SS3" title="2.2.3 Trimming low quality tail and filtering low quality sequences ‣ 2.2 For Roche GS series sequencers and Ion PGM ‣ Chapter 2 Preprocessing of nucleotide sequence data ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2.3</span></a> is recommended.</p>
</div>
</section>
<section id="Ch2.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.2 </span>For Roche GS series sequencers and Ion PGM</h3>

<section id="Ch2.S2.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.1 </span>Converting SFF to FASTQ</h4>

<div id="Ch2.S2.SS1.p1" class="ltx_para">
<p class="ltx_p">First of all, conversion of raw SFF format file to FASTQ file is needed like the following.</p>
</div>
<div id="Ch2.S2.SS1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sff_extract -c inputfile(SFF)↓</span></p>
</div>
<div id="Ch2.S2.SS1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">-c</span> argument enables trimming of <span class="ltx_text ltx_font_typewriter">TCAG</span> at the head of sequences.
If you add <span class="ltx_text ltx_font_typewriter">TCAG</span> to the head of tag sequences, do not use this argument.
Assuming your SFF file name is <span class="ltx_text ltx_font_typewriter">HOGEHOGE.sff</span>, <span class="ltx_text ltx_font_typewriter">HOGEHOGE.fastq</span> will be saved as FASTQ file.
<span class="ltx_text ltx_font_typewriter">HOGEHOGE.xml</span> will also be generated, but this is not required.
The output sequences have tag sequences at the beginning, followed by primer-F and target sequences, and primer-R (reverse complement) at the end.
Note that all sequences are not completely read from the beginning to the end, the incomplete sequences are included.
The <span class="ltx_text ltx_font_typewriter">sff_extract</span> command is used in this book, but any other programs which can clip <span class="ltx_text ltx_font_typewriter">TCAG</span> at the beginning can be used.
If the SFF to FASTQ converter program cannot clip <span class="ltx_text ltx_font_typewriter">TCAG</span> at the beginning, adding <span class="ltx_text ltx_font_typewriter">TCAG</span> to the beginning of tag sequences to give to <span class="ltx_text ltx_font_typewriter">clsplitseq</span> also works well, but the quality values will be strictly checked.</p>
</div>
</section>
<section id="Ch2.S2.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.2 </span>Demultiplexing of sequences</h4>

<div id="Ch2.S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The FASTQ file that contain the sequences from multiple samples need to be demultiplexed based on tag sequences and primer sequences before the subsequent analysis.
To do this process, a FASTA file which contain tag sequences and another FASTA file which contain primer-F sequences are required.</p>
</div>
<div id="Ch2.S2.SS2.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;TagID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [tag sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;examplesample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p3" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p4" class="ltx_para">
<p class="ltx_p">Degenerate codes of nucleotides are not allowed for tag sequences, but those are allowed for primer sequences.
Both of tag and primer FASTQ files can contain multiple sequences.
If you use interim adapter explained in section <a href="#Ch1.S1.SS1" title="1.1.1 Decreasing costs by interim adapters ‣ 1.1 PCR using tag- and adapter-jointed-primers ‣ Chapter 1 Sequencing of multiple samples by next-generation sequencers ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1.1</span></a>, primer sequences should be written like the following.</p>
</div>
<div id="Ch2.S2.SS2.p5" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [interim adapter][primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| TGATACTCGATACGTACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.p6" class="ltx_para">
<p class="ltx_p">Thus, the sequences between tag and target sequences should be written in primer FASTA file.</p>
</div>
<div id="Ch2.S2.SS2.p7" class="ltx_para">
<p class="ltx_p">All the above files are prepared, the following command can demultiplex nucleotide sequences to each sample FASTQ file.</p>
</div>
<div id="Ch2.S2.SS2.p8" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS2.p9" class="ltx_para">
<p class="ltx_p">RunID must differ among different sequencing runs.
RunID is given by sequencer in many cases, you can use such sequencer generated RunID.
RunID is usually contained in sequence file name or sequence name in sequence file, but the naming rules are different among sequencing platforms.
Therefore, <span class="ltx_text ltx_font_typewriter">clsplitseq</span> requires RunID given by user.
<span class="ltx_text ltx_font_typewriter">--minqualtag</span> is an argument that specifies minimum quality threshold of tag position sequences.
If 1 or more lower quality nucleotide than this threshold value is contained by a sequence, such sequence will be omitted from output sequences.
27 for minimum quality threshold is proposed by <cite class="ltx_cite ltx_citemacro_citet">Kunin <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib11" title="" class="ltx_ref">2010</a>)</cite> for 3’-tail trimming of the sequences of Roche GS series sequencers.
The different value might be more suitable for the other sequencers.
In many cases, 30 is used for minimum quality threshold and can be recommended.</p>
</div>
<div id="Ch2.S2.SS2.p10" class="ltx_para">
<p class="ltx_p">If multiplex sequencing technique is not used, <span class="ltx_text ltx_font_typewriter">--tagfile</span> argument can be omitted.
However, just omit of <span class="ltx_text ltx_font_typewriter">--tagfile</span> generates incompatible FASTQ files for Claident.
In such case, you should add identifier (dummy is acceptable) of tag sequences using <span class="ltx_text ltx_font_typewriter">--indexname=TagID</span> argument.</p>
</div>
<div id="Ch2.S2.SS2.p11" class="ltx_para">
<p class="ltx_p">The tag and primer position sequences are trimmed from the output sequences.
Tag position sequence match is evaluated exactly and strictly.
There are no arguments to tolerate a mismatch.
Primer position sequence is aligned based on Needleman-Wunsch algorithm and evaluated allowing 14% of mismatches (the threshold can be changed).
The output files are named as <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.fastq.gz</span> and saved in the output folder.
<span class="ltx_text ltx_font_typewriter">clsplitseq</span> can use multiple CPUs for faster processing.
If your computer have 4 CPU cores, 4 should be speficied for <span class="ltx_text ltx_font_typewriter">--numthreads</span> argument.
Note that operating system and/or writing speed of storage devices might limit processing speed.
By default, the output files are compressed by GZIP.
Therefore, decompression is required to read/write by incompatible programs with gzipped FASTQ files.
The commands of Claident used below can treat gzipped FASTQ files.</p>
</div>
<div id="Ch2.S2.SS2.p12" class="ltx_para">
<p class="ltx_p">Before submission of manuscripts, sequence data need to be deposited to public database such as DDBJ Sequence Read Archive (DRA).
Gzipped FASTQ files in this step can be used for the data deposition.</p>
</div>
<section id="Ch2.S2.SS2.SSSx1" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">If you sequenced a number of samples by multiple sequencing runs</h5>

<div id="Ch2.S2.SS2.SSSx1.p1" class="ltx_para">
<p class="ltx_p">Multiple demultiplexing by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> are required.
However, <span class="ltx_text ltx_font_typewriter">clsplitseq</span> cannot write already existing folder by default.
The secondary run of <span class="ltx_text ltx_font_typewriter">clsplitseq</span> requires <span class="ltx_text ltx_font_typewriter">--append</span> argument like below.</p>
</div>
<div id="Ch2.S2.SS2.SSSx1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
</section>
<section id="Ch2.S2.SS2.SSSx2" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">If your tag sequence lengths are unequal</h5>

<div id="Ch2.S2.SS2.SSSx2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">clsplitseq</span> assumes that all tag sequence lengths are equal for faster processing.
The unequal length tags must be splitted to multiple tag sequence files and multiple demultiplexing runs of <span class="ltx_text ltx_font_typewriter">clcplitseq</span> are required as the following.</p>
</div>
<div id="Ch2.S2.SS2.SSSx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
</section>
<section id="Ch2.S2.SS2.SSSx3" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">Recognition and elimination of reverse primer positions</h5>

<div id="Ch2.S2.SS2.SSSx3.p1" class="ltx_para">
<p class="ltx_p">In the above procedure, reverse primer position and subsequent sequences are not eliminated.
Reverse primer position and subsequent sequences are artificial and should be eliminated if possible.
To do so, reverse primer sequence file like the following is required.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| TCAGTCAGTCAGTCAGTCAG</span></span>
</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p3" class="ltx_para">
<p class="ltx_p">Multiple reverse primers can written in this file.
Note that the N-th reverse primer sequence is assumed to associate with the N-th forward primer sequence.
Therefore, the different number of primer sequences between forward and reverse primer sequence files causes an error.
If there are the samples whose forward or reverse primer sequence is same but the other primer sequence is different, both combinations of forward and reverse primer sequences need to be written as different primers in the files.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p4" class="ltx_para">
<p class="ltx_p">After the preparation of the above file, perform <span class="ltx_text ltx_font_typewriter">clsplitseq</span> as the following.</p>
</div>
<div id="Ch2.S2.SS2.SSSx3.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=ForwardPrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverseprimerfile=ReversePrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reversecomplement \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS2.SSSx3.p6" class="ltx_para">
<p class="ltx_p">In this processing, reverse-complement sequence of reverse primer is searched based on Needleman-Wunsch algorithm allowing 15% (this value can be changed) of mismatches and reverse primer position and subsequent sequence is eliminated in addition to the above process.
If reverse-complement sequence of reverse primer is not found and the other requirement is fullfilled, the sequence will be saved to output file by default.
The <span class="ltx_text ltx_font_typewriter">--needreverseprimer</span> argument is required to filter out the sequence which does not contain reverse-complement sequence of reverse primer.</p>
</div>
</section>
</section>
<section id="Ch2.S2.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2.3 </span>Trimming low quality tail and filtering low quality sequences</h4>

<div id="Ch2.S2.SS3.p1" class="ltx_para">
<p class="ltx_p">FASTQ sequences have read quality information.
The low quality 3’-tail can be trimmed and the low quality sequences can be filtered out based on the quality values.
The <span class="ltx_text ltx_font_typewriter">clfilterseq</span> command can perform such processing as the following.</p>
</div>
<div id="Ch2.S2.SS3.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=350 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxlen=400 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S2.SS3.p3" class="ltx_para">
<p class="ltx_p">The values of <span class="ltx_text ltx_font_typewriter">--minqual</span> and <span class="ltx_text ltx_font_typewriter">--minquallen</span> indicate the minimum threshold of read quality value and size of sliding window, respectively.
The above command trims 3’-tail positions until 3 bp long sequence whose read quality is 27 or higher in all 3 positions are observed.
In addition, trimmed sequences shorter than <span class="ltx_text ltx_font_typewriter">--minlen</span> will be filtered out and trimmed sequences longer than <span class="ltx_text ltx_font_typewriter">--maxlen</span> will be trimmed to <span class="ltx_text ltx_font_typewriter">--maxlen</span>.
The remaining sequences containing <span class="ltx_text ltx_font_typewriter">--maxplowqual</span> or more rate of lower quality positions than <span class="ltx_text ltx_font_typewriter">--minqual</span> will also be filtered out.
The output is a file by default, but can be saved to the file in the new folder using <span class="ltx_text ltx_font_typewriter">--output=folder</span> argument.
The output file name is same as the input file name in this case.
If you want to save the output files to the existing folder, add <span class="ltx_text ltx_font_typewriter">--append</span> argument.</p>
</div>
<div id="Ch2.S2.SS3.p4" class="ltx_para">
<p class="ltx_p">If you want to apply <span class="ltx_text ltx_font_typewriter">clfilterseq</span> to the all files in the output folder of <span class="ltx_text ltx_font_typewriter">clsplitseq</span>, run the following command.</p>
</div>
<div id="Ch2.S2.SS3.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in OutputFolderOfclsplitseq/*.fastq.gz↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output=folder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=27 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=350 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxlen=400 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
</section>
</section>
<section id="Ch2.S3" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.3 </span>For Illumina platform sequences</h3>

<section id="Ch2.S3.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.1 </span>Converting from BCL to FASTQ</h4>

<div id="Ch2.S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The analysis software of Illumina platform sequences can demultiplex sequencing reads, but ignores read quality of tag positions.
Therefore, the sequences have low quality tag positions possibly saved to demultiplexed FASTQ.
To filering out such sequences, pre-demultiplexed FASTQ files are required and can be converted from BCL files with the aid of bcl2fastq.
There are 1.x and 2.x series of bcl2fastq and both series can be used for Claident.
However, the sequencers may be compatible to either 1.x or 2.x, you need to select proper version.
Pre-demultiplexed FASTQ can be demultiplexed by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> in Claident.
See appendix to install bcl2fastq.</p>
</div>
<div id="Ch2.S3.SS1.p2" class="ltx_para">
<p class="ltx_p">To convert BCL to FASTQ, run data folder (superjacent folder of BaseCalls folder) need to be copied to the PC installed bcl2fastq.
If there is <span class="ltx_text ltx_font_typewriter">SampleSheet.csv</span> in run data folder, this file must be renamed or deleted.</p>
</div>
<div id="Ch2.S3.SS1.p3" class="ltx_para">
<p class="ltx_p">In the case of bcl2fastq 1.x, the following commands make FASTQ files from BCL files of 8 bp dual indexed 300PE sequencing data.</p>
</div>
<div id="Ch2.S3.SS1.p4" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd RunDataFolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; configureBclToFastq.pl \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq-cluster-count 0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--use-bases-mask Y300n,Y8,Y8,Y300n \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--input-dir BaseCalls \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output-dir outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cd outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; make -j4↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS1.p5" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">--fastq-cluster-count 0</span> argument disable large output file splitting.
The <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y300n,Y8,Y8,Y300n</span> is an argument to save forward 300 bp read (last base is trimmed), 8 bp index 1 (reverse-complement of tag-R), 8 bp index 2 (tag-F) and reverse 300 bp read (last base is trimmed) to <span class="ltx_text ltx_font_typewriter">*_R1_001.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R2_001.fastq.gz</span>, <span class="ltx_text ltx_font_typewriter">*_R3_001.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">*_R4_001.fastq.gz</span>, respectively.
The value of <span class="ltx_text ltx_font_typewriter">--use-bases-mask</span> argument need to be changed for the other sequencing settings.
For 6 bp single indexed 250SE and 8 bp dual indexed 300SE sequencing data, <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y250n,Y6</span> and <span class="ltx_text ltx_font_typewriter">--use-bases-mask Y300n,Y8,Y8</span> should be suitable, respectively.
<span class="ltx_text ltx_font_typewriter">make -j4</span> executes the conversion using 4 CPUs.
The output files will be compressed by GZIP.
The extension <span class="ltx_text ltx_font_typewriter">.gz</span> of output files indicates that the file is compressed by GZIP.
Claident is compliant with gzipped FASTQ files and decompression is not required.</p>
</div>
<div id="Ch2.S3.SS1.p6" class="ltx_para">
<p class="ltx_p">In the case of bcl2fastq 2.x, perform the following command.</p>
</div>
<div id="Ch2.S3.SS1.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; bcl2fastq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--processing-threads NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--create-fastq-for-index-reads \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--use-bases-mask Y300n,I8,I8,Y300n \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runfolder-dir RunDataFolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output-dir outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS1.p8" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">--processing-threads</span>, <span class="ltx_text ltx_font_typewriter">--use-bases-mask</span> and <span class="ltx_text ltx_font_typewriter">--runfolder-dir</span> indicate the number of processor used in conversion, masking option (almost same as 1.x but index length must be given as <span class="ltx_text ltx_font_typewriter">I[number]</span> instead of <span class="ltx_text ltx_font_typewriter">Y[number]</span>) and run data folder, respectively.</p>
</div>
</section>
<section id="Ch2.S3.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.2 </span>Demultiplexing of sequences</h4>

<div id="Ch2.S3.SS2.p1" class="ltx_para">
<p class="ltx_p">FASTA files containing tag (index) sequences and primer sequences like the following are needed for demultiplexing.
FASTA files containing secondary tag (index) sequences and reverse primer sequences are also required for paired-end sequencing data.</p>
</div>
<div id="Ch2.S3.SS2.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;TagID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [tag sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;examplesample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S3.SS2.p3" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;PrimerID</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| [primer sequence]</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;exampleprimer1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGTACGTACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S3.SS2.p4" class="ltx_para">
<p class="ltx_p">Degenerate code is not allowed for tag sequences, but can be used in primer sequences.
Multiple tags and primers can be written in the files, but the N-th reverse tag/primer sequence is assumed to associate with the N-th forward tag/primer sequence.
Therefore, the different number of tag/primer sequences between forward and reverse tag/primer sequence files causes an error.
If there are the samples whose forward or reverse tag/primer sequence is same but the other tag/primer sequence is different, both combinations of forward and reverse tag/primer sequences need to be written as different tags/primers in the files.
If you added <span class="ltx_text ltx_font_typewriter">N</span> in front of primer, <span class="ltx_text ltx_font_typewriter">N</span> need to be added in primer sequence.
If your <span class="ltx_text ltx_font_typewriter">N</span> length is unequal, only the longest <span class="ltx_text ltx_font_typewriter">N</span> should be written in the file.</p>
</div>
<div id="Ch2.S3.SS2.p5" class="ltx_para">
<p class="ltx_p">All the required files prepared, the following command demultiplex sequences to each sample file.</p>
</div>
<div id="Ch2.S3.SS2.p6" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=RunID \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--index1file=Index1Sequence(tag-Rrevcomp)File \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--index2file=Index2Sequence(tag-F)File \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=ForwardPrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverseprimerfile=ReversePrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqualtag=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberofCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile4 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS2.p7" class="ltx_para">
<p class="ltx_p">The input files should be specified in the order of forward read file, index1 read file, index2 read file and reverse read file.
The <span class="ltx_text ltx_font_typewriter">--index1file</span> and <span class="ltx_text ltx_font_typewriter">--index2file</span> arguments requires the FASTA sequence files of index 1 (reverse-complement of tag-R) and index 2 (tag-F), respectively.
By default, the acceptable mismatches are 14% and 15% for forward and reverse primers, respectively.
If you added <span class="ltx_text ltx_font_typewriter">N</span> in front of primer, the <span class="ltx_text ltx_font_typewriter">--truncateN=enable</span> argument need to be given.
This argument enables exclusion of <span class="ltx_text ltx_font_typewriter">N</span> of primer and matched positions of sequences in calculation of the rate of mismatches.
Therefore, only the longest <span class="ltx_text ltx_font_typewriter">N</span> is required to find <span class="ltx_text ltx_font_typewriter">N</span>-added primer even if the length of <span class="ltx_text ltx_font_typewriter">N</span> is unequal.
After the processing, the number of sequences in demultiplexed files should be compared with those in demultiplexed files generated by Illumina softwares.
Correctly demultiplexed files should contain fewer sequences than demultiplexed files generated by Illumina softwares.
If you used specific primers for sequencing primers, forward and reverse sequences do not contain specific primer positions.
In such cases, <span class="ltx_text ltx_font_typewriter">--primerfile</span> and <span class="ltx_text ltx_font_typewriter">--reverseprimerfile</span> arguments are not required, but <span class="ltx_text ltx_font_typewriter">--primername=PrimerID</span> argument need to be given for converting sequence names as compliant with Claident.
Dummy PrimerID is acceptable but no PrimerID is not.</p>
</div>
<div id="Ch2.S3.SS2.p8" class="ltx_para">
<p class="ltx_p">If you do not perform multiplex sequencing using tag/index, <span class="ltx_text ltx_font_typewriter">--index1file</span> and <span class="ltx_text ltx_font_typewriter">--index2file</span> arguments are unneeded, but <span class="ltx_text ltx_font_typewriter">--indexname=TagID</span> argument must be given for converting sequence names as compliant with Claident.
Dummy TagID is acceptable but no TagID is not.</p>
</div>
<div id="Ch2.S3.SS2.p9" class="ltx_para">
<p class="ltx_p">After demultiplexing, <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.forward.fastq.gz</span> and <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.reverse.fastq.gz</span> will be generated.
These gzipped FASTQ files can be used for data deposition to sequence read archive sites such as DDBJ Sequence Read Archive (DRA).
In deposition process to DRA, it is required that the sequence lengths are equal or not.
Because primer position sequences that can be unequal lengths even if only one primer set was used are eliminated from demultiplexed sequence files, do not specify that the sequence lengths are equal.</p>
</div>
</section>
<section id="Ch2.S3.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.3 </span>Concatenating forward and reverse sequences</h4>

<section id="Ch2.S3.SS3.SSSx1" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">In the case of overlapped paired-end</h5>

<div id="Ch2.S3.SS3.SSSx1.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clconcatpair</span> command in Claident can be used for concatenating overlapped paired-end sequence data.
The <span class="ltx_text ltx_font_typewriter">clconcatpair</span> concatenate forward and reverse sequences based on overlap positions using VSEARCH by the following command.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=OVL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx1.p3" class="ltx_para">
<p class="ltx_p">This command finds <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span> in inputfolder, and concatenate the pairs automatically.
Gzipped <span class="ltx_text ltx_font_typewriter">.gz</span> and/or bzip2ed <span class="ltx_text ltx_font_typewriter">.bz2</span> files are also be found and concatenated.
Concatenated sequence files will be generated as <span class="ltx_text ltx_font_typewriter">*.fastq.gz</span> in outputfolder.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p4" class="ltx_para">
<p class="ltx_p">If input file names are not compliant with <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span>, the following command can be used for concatenating a pair of files.</p>
</div>
<div id="Ch2.S3.SS3.SSSx1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=OVL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx1.p6" class="ltx_para">
<p class="ltx_p">The forward and reverse sequence FASTQ files should be given as inputfile1 and inputfile2, respectively.
Addition of <span class="ltx_text ltx_font_typewriter">.gz</span> or <span class="ltx_text ltx_font_typewriter">.bz2</span> is required for output file compression.</p>
</div>
</section>
<section id="Ch2.S3.SS3.SSSx2" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">In the case of non-overlapped paired-end</h5>

<div id="Ch2.S3.SS3.SSSx2.p1" class="ltx_para">
<p class="ltx_p">If there are no overlaps between forward and reverse sequences, quality-trimming and quality-filtering using <span class="ltx_text ltx_font_typewriter">clfilterseq</span> like the following should be performed at first.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=100 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p3" class="ltx_para">
<p class="ltx_p">The values of <span class="ltx_text ltx_font_typewriter">--minqual</span> and <span class="ltx_text ltx_font_typewriter">--minquallen</span> indicate the minimum threshold of read quality value and size of sliding window, respectively.
The above command trims 3’-tail positions until 3 bp long sequence whose read quality is 30 or higher in all 3 positions are observed.
In addition, trimmed sequences shorter than <span class="ltx_text ltx_font_typewriter">--minlen</span> will be filtered out.
The remaining sequences containing <span class="ltx_text ltx_font_typewriter">--maxplowqual</span> or more rate of lower quality positions than <span class="ltx_text ltx_font_typewriter">--minqual</span> will also be filtered out.
In this process, filtering out one of the sequence of a pair, the other sequence of the pair will also be filtered out.
The output will be generated as the same name files in outputfolder.
If you want to output to existing folder, you need to add <span class="ltx_text ltx_font_typewriter">--append</span> argument.
To apply the above command to all the pairs of <span class="ltx_text ltx_font_typewriter">*.forward.fastq</span> and <span class="ltx_text ltx_font_typewriter">*.reverse.fastq</span> in the current folder, execute the following commands.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p4" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in ‘ls *.forward.fastq.gz | grep -P -o ’^[^\.]+’‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minquallen=3 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minlen=100 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p5" class="ltx_para">
<p class="ltx_p">After the quality-trimming and quality-filtering like above, perform sequence concatenation with the aid of <span class="ltx_text ltx_font_typewriter">clconcatpair</span> like below.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p6" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clconcatpair \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--mode=NON \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfolder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p7" class="ltx_para">
<p class="ltx_p">In this process, the forward and reverse sequences like the following are assumed as input.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p8" class="ltx_para">
<span class="ltx_inline-block ltx_framed_left" style="border-color: #000000;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― forward sequence ― 3’</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― reverse sequence ― 3’</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p9" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clconcatpair --mode=NON</span> command will concatenate these sequence pairs and make sequences like the following.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p10" class="ltx_para">
<p class="ltx_p ltx_align_left ltx_framed_left" style="border-color: #000000;"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">5’ ― reverse sequence (reverse-complement) ― ACGTACGTACGTACGT ― forward sequence ― 3’</span></p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p11" class="ltx_para">
<p class="ltx_p">Conduct removal of noisy and/or chimeric sequences in the same way as concatenated overlapped paired-end sequence data.
In the sequence clustering by <span class="ltx_text ltx_font_typewriter">clclassseqv</span> and the raw reads mapping to centroid sequences by <span class="ltx_text ltx_font_typewriter">clrecoverseqv</span>, add <span class="ltx_text ltx_font_typewriter">--paddinglen=16</span> argument.
The concatenated sequences like above causes overvaluation of sequence similarity because of artificial padding sequence <span class="ltx_text ltx_font_typewriter">ACGTACGTACGTACGT</span>.
The <span class="ltx_text ltx_font_typewriter">--paddinglen=16</span> argument will offset such overvaluation by exclusion of <span class="ltx_text ltx_font_typewriter">ACGTACGTACGTACGT</span> from sequence similarity calculation and cluster concatenated sequences based on correct sequence similarity.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p12" class="ltx_para">
<p class="ltx_p">In the estimation of host organism, split concatenated sequences based on <span class="ltx_text ltx_font_typewriter">ACGTACGTACGTACGT</span>, assign taxonomy to forward and reverse sequences separately.
Then, merge 2 taxonomy (see section <a href="#Ch7.S3" title="7.3 Merging multiple taxonomic assignments based on consensus ‣ Chapter 7 Estimation of host organisms of nucleotide sequences (a.k.a. DNA barcoding) ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">7.3</span></a>).
Generally speaking, forward sequences shows higher quality than reverse sequences, prefering forward sequence taxonomy is recommended if there is no <span class="ltx_text ltx_font_italic">a priori</span> infomations about identification power and variability of forward and reverse sequences.
Sequence division based on <span class="ltx_text ltx_font_typewriter">ACGTACGTACGTACGT</span> can be applied to the sequences by the following command.</p>
</div>
<div id="Ch2.S3.SS3.SSSx2.p13" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt;cldivseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--query=ACGTACGTACGTACGT \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile2</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx2.p14" class="ltx_para">
<p class="ltx_p">The outputfile1 and outputfile2 contain reverse-complement of reverse sequences and forward sequences, respectively.</p>
</div>
</section>
<section id="Ch2.S3.SS3.SSSx3" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">Concatenating overlapping paired-end sequences using PEAR</h5>

<div id="Ch2.S3.SS3.SSSx3.p1" class="ltx_para">
<p class="ltx_p">PEAR <cite class="ltx_cite ltx_citemacro_citep">(Zhang <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib19" title="" class="ltx_ref">2014</a>)</cite> can also be used for concatenation of overlapped paired-end sequences.
The following command will concatenate the pairs of sequences.</p>
</div>
<div id="Ch2.S3.SS3.SSSx3.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; pear \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-p 0.0001 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-u 0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-j NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-f RunID__TagID__PrimerID.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-r RunID__TagID__PrimerID.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-o RunID__TagID__PrimerID↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx3.p3" class="ltx_para">
<p class="ltx_p">If the processes correctly finished, the following files will be generated.</p>
</div>
<div id="Ch2.S3.SS3.SSSx3.p4" class="ltx_para">
<dl id="Ch2.S3.I1" class="ltx_description">
<dt id="Ch2.S3.I1.ix1"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.assembled.fastq</span></span></dt>
<dd>
<div id="Ch2.S3.I1.ix1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Concatenated sequences.</span></p>
</div>
</dd>
<dt id="Ch2.S3.I1.ix2"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.unassembled.forward.fastq</span></span></dt>
<dd>
<div id="Ch2.S3.I1.ix2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Unconcatenated forward sequences.</span></p>
</div>
</dd>
<dt id="Ch2.S3.I1.ix3"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.unassembled.reverse.fastq</span></span></dt>
<dd>
<div id="Ch2.S3.I1.ix3.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Unconcatenated reverse sequences.</span></p>
</div>
</dd>
<dt id="Ch2.S3.I1.ix4"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.discarded.fastq</span></span></dt>
<dd>
<div id="Ch2.S3.I1.ix4.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Discarded sequences by statistical test.</span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch2.S3.SS3.SSSx3.p5" class="ltx_para">
<p class="ltx_p">Only <span class="ltx_text ltx_font_typewriter">RunID__TagID__PrimerID.assembled.fastq</span> is required in subsequent procedures.
These output files are not compressed and consume large amount of storages, compression by GZIP or BZIP2 is recommended.</p>
</div>
<div id="Ch2.S3.SS3.SSSx3.p6" class="ltx_para">
<p class="ltx_p">To apply concatenation to all the pairs of forward and reverse sequence files by PEAR, execute the following command.</p>
</div>
<div id="Ch2.S3.SS3.SSSx3.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in ‘ls *.forward.fastq.gz | grep -P -o ’^[^\.]+’‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do pear \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-p 0.0001 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-u 0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-j NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-f $f.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-r $f.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">-o $f↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
</section>
<section id="Ch2.S3.SS3.SSSx4" class="ltx_subsubsection">
<h5 class="ltx_title ltx_title_subsubsection">Concatenating overlapping paired-end sequences using VSEARCH</h5>

<div id="Ch2.S3.SS3.SSSx4.p1" class="ltx_para">
<p class="ltx_p">VSEARCH can also be used directly for concatenation of overlapped paired-end sequences.
The following command will concatenate the pairs of sequences.</p>
</div>
<div id="Ch2.S3.SS3.SSSx4.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; vsearch \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--threads NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_mergepairs RunID__TagID__PrimerID.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverse RunID__TagID__PrimerID.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_allowmergestagger \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastqout RunID__TagID__PrimerID.assembled.fastq↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS3.SSSx4.p3" class="ltx_para">
<p class="ltx_p">If the amplicon sequences are shorter than read length, read tail of forward sequence possibly exceeds read head of reverse sequence or read tail of reverse sequence possibly exceeds read head of forward sequence.
VSEARCH does not concatenate such sequences by default.
The <span class="ltx_text ltx_font_typewriter">--fastq_allowmergestagger</span> argument enables such sequence concatenation.
The overhang positions which exceeded read head of the other sequence will be eliminated because such positions are artificial.
The <span class="ltx_text ltx_font_typewriter">--fastq_allowmergestagger</span> argument is not required if there is no such sequences.
Using PEAR, the same processing as VSEARCH with <span class="ltx_text ltx_font_typewriter">--fastq_allowmergestagger</span> will be performed.
The unconcatenated forward and reverse sequences can be obtained by <span class="ltx_text ltx_font_typewriter">--fastqout_notmerged_fwd outputfile</span> and <span class="ltx_text ltx_font_typewriter">--fastqout_notmerged_rev outputfile</span> arguments, respectively.
The minimum overlap length, the minimum length of concatenated sequence, the maximum length of concatenated sequence, the maximum number of allowed mismatches and the maximum allowed expected errors in concatenated sequence can be specified by <span class="ltx_text ltx_font_typewriter">--fastq_minovlen</span>, <span class="ltx_text ltx_font_typewriter">--fastq_minmergelen</span>, <span class="ltx_text ltx_font_typewriter">--fastq_maxmergelen</span>, <span class="ltx_text ltx_font_typewriter">--fastq_maxdiffs</span> and <span class="ltx_text ltx_font_typewriter">--fastq_maxee</span>, respectively.
In the concatenation of overlapped paired-end sequences by <span class="ltx_text ltx_font_typewriter">clconcatpair</span>, <span class="ltx_text ltx_font_typewriter">--fastq_minovlen 20 --fastq_maxdiffs 20</span> is used by default, but <span class="ltx_text ltx_font_typewriter">--fastq_minovlen 10 --fastq_maxdiffs 5</span> is used by default of VSEARCH.</p>
</div>
<div id="Ch2.S3.SS3.SSSx4.p4" class="ltx_para">
<p class="ltx_p">To apply concatenation by VSEARCH to all the files in current folder, execute the following command.</p>
</div>
<div id="Ch2.S3.SS3.SSSx4.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in ‘ls *.forward.fastq.gz | grep -P -o ’^[^\.]+’‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do vsearch \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--threads NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_mergepairs $f.forward.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--reverse $f.reverse.fastq.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_allowmergestagger \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastqout $f.assembled.fastq↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
</section>
</section>
<section id="Ch2.S3.SS4" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3.4 </span>Filtering potentially erroneous sequences</h4>

<div id="Ch2.S3.SS4.p1" class="ltx_para">
<p class="ltx_p">There are read quality values in the FASTQ files.
Therefore, we can filter out potentially erroneous sequences using these quality values.
To do so, the following command can conduct such processing.</p>
</div>
<div id="Ch2.S3.SS4.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS4.p3" class="ltx_para">
<p class="ltx_p">The sequences containing <span class="ltx_text ltx_font_typewriter">--maxplowqual</span> or more rate of lower quality positions than <span class="ltx_text ltx_font_typewriter">--minqual</span> will also be filtered out by the above command.
The output is a file by default, but adding <span class="ltx_text ltx_font_typewriter">--output=folder</span> argument changes to save output as the same name file in the outputfolder.
If you want to save output file to existing folder, <span class="ltx_text ltx_font_typewriter">--append</span> argument is needed.
In the case of concatenated sequences of overlapped paired-end sequences generated by Illumina platform sequencers, positions close to the both end is usually high quality and overlapped positions is also high quality if the same positions of forward and reverse sequences are matched.
Therefore, trimming low quality positions close to the both end is needless.
Filtering out sequences containing low quality positions is recommended for concatenated overlapped paired-end sequences.
The existing sequence filtering programs such as FastQC <cite class="ltx_cite ltx_citemacro_citep">(Andrews, <a href="#bib.bib2" title="" class="ltx_ref">2010</a>)</cite> or PRINSEQ <cite class="ltx_cite ltx_citemacro_citep">(Schmieder &amp; Edwards, <a href="#bib.bib16" title="" class="ltx_ref">2011</a>)</cite> are also recommended.</p>
</div>
<div id="Ch2.S3.SS4.p4" class="ltx_para">
<p class="ltx_p">To apply the same processing to the concatenated sequences by PEAR, execute the following command.</p>
</div>
<div id="Ch2.S3.SS4.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in *.assembled.fastq↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output=folder \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minqual=30 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxplowqual=0.1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$f \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS4.p6" class="ltx_para">
<p class="ltx_p">Using quality values, we can calculate expected number of read errors.
Quality-filtering based on the maximum allowed expected errors in input sequences can also be applied using VSEARCH.
The following command can apply such quality-filtering to the input sequences.</p>
</div>
<div id="Ch2.S3.SS4.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; vsearch \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--threads NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_filter inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastq_maxee 1.0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--fastqout outputfile↓</span></span>
</span>
</div>
<div id="Ch2.S3.SS4.p8" class="ltx_para">
<p class="ltx_p">For the single-end sequence data and unconcatenated sequence data, same quality-trimming and quality-filtering as Roche GS series sequencers and Ion PGM can be recommended (see section <a href="#Ch2.S2.SS3" title="2.2.3 Trimming low quality tail and filtering low quality sequences ‣ 2.2 For Roche GS series sequencers and Ion PGM ‣ Chapter 2 Preprocessing of nucleotide sequence data ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2.3</span></a>).
Note that thresholds for quality values and read lengths should be changed.</p>
</div>
</section>
</section>
<section id="Ch2.S4" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2.4 </span>If you sequenced same PCR amplicons multiply or replicated PCR amplicons of same templates</h3>

<div id="Ch2.S4.p1" class="ltx_para">
<p class="ltx_p">Because chimeric sequences are constructed in each PCR tube, whether the sequences are came from same tube or not should be given to analysis programs.
Therefore, several procedure is required to give such information to the programs in the cases of replicated sequencing of same PCR amplicons and sequencing of replicated PCR amplicons of same templates.</p>
</div>
<section id="Ch2.S4.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4.1 </span>In the case of sequencing of replicated PCR amplicons of same templates using same tags in the same run</h4>

<div id="Ch2.S4.SS1.p1" class="ltx_para">
<p class="ltx_p">In this case, chimera removal based on replicates of PCR cannot be applied.
This case can treat as same as unreplicated PCR.</p>
</div>
</section>
<section id="Ch2.S4.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4.2 </span>In the case of sequencing of replicated PCR amplicons of same templates using different tags in the same run</h4>

<div id="Ch2.S4.SS2.p1" class="ltx_para">
<p class="ltx_p">In Claident, RunID__TagID__PrimerID is used as sample IDs.
Therefore, there are multiple samples from the same templates in this case.
It may be good idea that common sequences among replicated samples are treated as noiseless and nonchimeric, and uncommon sequences are treated as noisy and/or chimeric sequences if noise occurrence and chimera formation can be assumed as random.
However, noise occurrence and chimera formation likely to be nonrandom.
Noisy and/or chimeric sequences might be occurred across all replicates.
The effectiveness of this method does not confirmed enough, and combination of noisy and/or chimeric sequence removal based on PCR replicates and algorithms can be recommended (see also <cite class="ltx_cite ltx_citemacro_citet">Lange <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib12" title="" class="ltx_ref">2015</a>)</cite>).
Such combination is supported in Claident, it is explained later.
One of the final output of Claident is a sample x OTU table containing the number of sequences in every cell.
The table modification command <span class="ltx_text ltx_font_typewriter">clfiltersum</span> can be used for integration of multiple samples to one, and cell numbers will be summed up.</p>
</div>
</section>
<section id="Ch2.S4.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4.3 </span>In the case of sequencing of same PCR amplicons using same tags in different runs</h4>

<div id="Ch2.S4.SS3.p1" class="ltx_para">
<p class="ltx_p">In Claident, RunID__TagID__PrimerID is used as sample IDs.
Therefore, there are multiple samples from the same templates in this case.
Such samples can be treated as separate or integrated to one sample.
It is recommended that separate samples have been used subsequent analysis without any change, all filtered and denoised sequences of multiple runs are given to clustering commands, and such samples are finally integrated in a sample x OTU table.
The table generation command <span class="ltx_text ltx_font_typewriter">clsumclass</span> with <span class="ltx_text ltx_font_typewriter">--runname</span> argument or table modification command <span class="ltx_text ltx_font_typewriter">clfiltersum</span> with <span class="ltx_text ltx_font_typewriter">--runname</span> argument can be used to replace RunIDs and to integrate multiple samples to one, and cell numbers will be summed up.</p>
</div>
<div id="Ch2.S4.SS3.p2" class="ltx_para">
<p class="ltx_p">If you want to integrate multiple samples from the same templates at this time, execute the following commands.</p>
</div>
<div id="Ch2.S4.SS3.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=FOO \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--runname=FOO \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S4.SS3.p4" class="ltx_para">
<p class="ltx_p">The inputfile1 and inputfile2 are the FASTQ files of first and second run, respectively.
The <span class="ltx_text ltx_font_typewriter">--runname</span> argument replaces RunID of the sequence names to <span class="ltx_text ltx_font_typewriter">FOO</span>.
Thus, all the sequences of both input files will be saved to <span class="ltx_text ltx_font_typewriter">FOO__TagID__PrimerID.fastq.gz</span> in the output folder.</p>
</div>
<div id="Ch2.S4.SS3.p5" class="ltx_para">
<p class="ltx_p">In Claident, the sequences whose names contain the same RunID, TagID and PrimerID are treated as the sequences from the same samples.
If tag and primer sequence files are the same, TagID and PrimerID of output sequences are the same.
Replacing RunID to <span class="ltx_text ltx_font_typewriter">FOO</span>, all of RunID, TagID and PrimerID become the same in the sequences from the same samples.</p>
</div>
</section>
<section id="Ch2.S4.SS4" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4.4 </span>In the case of sequencing of replicated PCR amplicons of same templates using same tags in different runs</h4>

<div id="Ch2.S4.SS4.p1" class="ltx_para">
<p class="ltx_p">You should have multiple FASTQ files.
Run demultiplexing by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> and quality-trimming and quality-filtering by <span class="ltx_text ltx_font_typewriter">clfilterseq</span> separately, and save processed sequences to different folders.
After separate processing of noisy and/or chimeric sequence removal, give all processed sequence files to clustering programs at once.
One of the final output of Claident is a sample x OTU table containing the number of sequences in every cell.
The table modification command <span class="ltx_text ltx_font_typewriter">clfiltersum</span> can be used for integration of multiple samples to one, and cell numbers will be summed up.</p>
</div>
</section>
<section id="Ch2.S4.SS5" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4.5 </span>In the case of sequencing of replicated PCR amplicons of same templates using different tags in different runs</h4>

<div id="Ch2.S4.SS5.p1" class="ltx_para">
<p class="ltx_p">In Claident, RunID__TagID__PrimerID is used as sample IDs.
Therefore, there are multiple replicated samples from the same templates in this case.
Such samples can be treated as separate or integrated to one sample.
It is recommended that separate samples have been used subsequent analysis without any change and such samples are finally integrated in a sample x OTU table.
It may be good idea that common sequences among replicated samples are treated as noiseless and nonchimeric, and uncommon sequences are treated as noisy and/or chimeric sequences if noise occurrence and chimera formation can be assumed as random.
Noisy and/or chimeric sequences might be occurred across all replicates.
The effectiveness of this method does not confirmed enough, and combination of noisy and/or chimeric sequence removal based on PCR replicates and algorithms can be recommended (see also <cite class="ltx_cite ltx_citemacro_citet">Lange <span class="ltx_text ltx_font_italic">et al.</span> (<a href="#bib.bib12" title="" class="ltx_ref">2015</a>)</cite>).
Such combination is supported in Claident, it is explained later.
One of the final output of Claident is a sample x OTU table containing the number of sequences in every cell.
The table modification command <span class="ltx_text ltx_font_typewriter">clfiltersum</span> can be used for integration of multiple samples to one, and cell numbers will be summed up.</p>
</div>
<div id="Ch2.S4.SS5.p2" class="ltx_para">
<p class="ltx_p">In the demultiplexing using <span class="ltx_text ltx_font_typewriter">clsplitseq</span>, process multiple FASTQ files like the following.</p>
</div>
<div id="Ch2.S4.SS5.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsplitseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--tagfile=TagSequenceFile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primerfile=PrimerSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--append \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile2 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch2.S4.SS5.p4" class="ltx_para">
<p class="ltx_p">The sequence file of primer and the output folder should be the same.
The different RunID should be given for <span class="ltx_text ltx_font_typewriter">--runname</span> argument.
The different tag sequence files should be prepared and given.
Note that the same TagID should be specified for the replicated samples from the same templates even if tag/index sequence is different.
In addition, different TagID should be specified for the samples from different templates even if tag/index sequence is the same.
In the case of the following tag sequence files, the sequences of sample1 were added <span class="ltx_text ltx_font_typewriter">ACGTACGT</span> as a tag in the first run, and <span class="ltx_text ltx_font_typewriter">ATGCATGC</span> in the second run.</p>
</div>
<div id="Ch2.S4.SS5.p5" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;sample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;sample2</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ATGCATGC</span></span>
</p>
</div>
<div id="Ch2.S4.SS5.p6" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;sample1</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ATGCATGC</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| &gt;sample2</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| ACGTACGT</span></span>
</p>
</div>
<div id="Ch2.S4.SS5.p7" class="ltx_para">
<p class="ltx_p">The sequences of sample2 were added <span class="ltx_text ltx_font_typewriter">ATGCATGC</span> as a tag in the first run, and <span class="ltx_text ltx_font_typewriter">ACGTACGT</span> in the second run.
The subsequent analysis is explained later.</p>
</div>
</section>
</section>
</section>
<section id="Ch3" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 3 </span>Noisy and/or chimeric sequence removal</h2>

<div id="Ch3.p1" class="ltx_para">
<p class="ltx_p">Claident can detect noisy sequences containing read errors and/or copy errors based on sequence abandance.
This method is similar to the method implemented in CD-HIT-OTU <cite class="ltx_cite ltx_citemacro_citep">(Li <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib14" title="" class="ltx_ref">2012</a>)</cite>.
In the old pipeline using Assams for dereplication and clustering, chimera removal based on UCHIME <cite class="ltx_cite ltx_citemacro_citep">(Edgar <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib5" title="" class="ltx_ref">2011</a>)</cite> algorithm can be applied.
Chimera removal based on UCHIME <cite class="ltx_cite ltx_citemacro_citep">(Edgar <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib5" title="" class="ltx_ref">2011</a>)</cite> algorithm is applied after OTU picking in the new pipeline using VSEARCH for dereplication and clustering.</p>
</div>
<div id="Ch3.p2" class="ltx_para">
<p class="ltx_p">Run the following command to perform noisy sequence detection and removal.
Note that multiple input files can be given, and all sequence files of the same run should be given at once because sequencing quality varied among different runs.
If there are too many sequences and the processing requires long time, give one sequence file and run the following command several times.</p>
</div>
<div id="Ch3.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clcleanseqv \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--derepmode=PREFIX \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primarymaxnmismatch=0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--secondarymaxnmismatch=1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--pnoisycluster=0.5 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfileN \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch3.p4" class="ltx_para">
<p class="ltx_p">Whether full-length perfect matching (<span class="ltx_text ltx_font_typewriter">FULLLENGTH</span>) or prefix search (<span class="ltx_text ltx_font_typewriter">PREFIX</span>) applied in dereplication is given for the <span class="ltx_text ltx_font_typewriter">--derepmode</span> argument.
<span class="ltx_text ltx_font_typewriter">FULLLENGTH</span> is recommended for concatenated sequences of overlapped paired-end sequencing.
<span class="ltx_text ltx_font_typewriter">PREFIX</span> is recommended for single-end or concatenated sequences of non-overlapped paired-end sequencing.
The <span class="ltx_text ltx_font_typewriter">--primarymaxnmismatch</span> argument indicates the number of mismatches in primary clustering and 0 is recommended for the most cases.
The <span class="ltx_text ltx_font_typewriter">--secondarymaxnmismatch</span> argument describes the number of mismatches in secondary clustering and 1 is recommended for the most cases.
For the noisy data, use <span class="ltx_text ltx_font_typewriter">--primarymaxnmismatch=1 --secondarymaxnmismatch=3</span> or <span class="ltx_text ltx_font_typewriter">--primarymaxnmismatch=2 --secondarymaxnmismatch=5</span>.
Twice as <span class="ltx_text ltx_font_typewriter">--primarymaxnmismatch</span> plus one should be specified for <span class="ltx_text ltx_font_typewriter">--secondarymaxnmismatch</span>.
The <span class="ltx_text ltx_font_typewriter">--pnoisycluster</span> argument determines sensitivity of noise detection.
Decimal value larger than 0 and smaller than 1 must be specified for this argument.
The larger value acquire higher sensitivity.
If you use 97% identity cutoff in clustering, 0.5 is recommended and this is the default value.
If you use 99% or larger identity cutoff in clustering, 0.9 or larger value might be more suitable.
The larger value causes exclusion of more low abundance sequences, more sequences should be obtained per sample.</p>
</div>
<div id="Ch3.p5" class="ltx_para">
<p class="ltx_p">The following files should be saved in the output folder.</p>
</div>
<div id="Ch3.p6" class="ltx_para">
<dl id="Ch3.S4.I1" class="ltx_description">
<dt id="Ch3.S4.I1.ix5"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">parameter.txt</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix5.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">The minimum size of primary clusters remained</span></p>
</div>
</dd>
<dt id="Ch3.S4.I1.ix6"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">primarycluster.denoised.fasta.gz</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix6.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences of primary clusters determined as non-noisy</span></p>
</div>
</dd>
<dt id="Ch3.S4.I1.ix7"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">primarycluster.fasta.gz</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix7.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences of primary clusters</span></p>
</div>
</dd>
<dt id="Ch3.S4.I1.ix8"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">secondarycluster.fasta.gz</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix8.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences of secondary clusters</span></p>
</div>
</dd>
<dt id="Ch3.S4.I1.ix9"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.noisyreads.txt.gz</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix9.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">List of sequences determined as noisy</span></p>
</div>
</dd>
<dt id="Ch3.S4.I1.ix10"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.singletons.txt.gz</span></span></dt>
<dd>
<div id="Ch3.S4.I1.ix10.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">List of singletons after primary clustering</span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch3.p7" class="ltx_para">
<p class="ltx_p">Many other files might be generated and do not delete such files because those might be required in subsequent analysis.</p>
</div>
<section id="Ch3.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3.1 </span>Noisy and/or chimeric sequence removal based on PCR replicates</h3>

<div id="Ch3.S1.p1" class="ltx_para">
<p class="ltx_p">To detect and remove noisy and/or chimeric sequences using PCR replicates, which samples are from the same templates need to be provided as a text file like below.</p>
</div>
<div id="Ch3.S1.p2" class="ltx_para">
<p class="ltx_p ltx_align_left" style="background-color:#E6E6E6;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| sample1  sample2  sample3</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| sample4  sample5</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">| sample6  sample7</span></span>
</p>
</div>
<div id="Ch3.S1.p3" class="ltx_para">
<p class="ltx_p">Tab-delimited sample list in a line indicates samples from the same templates.
Samples from different templates must be placed in different lines.
3 or more replicates are allowed.
The number of replicates can vary among templates.</p>
</div>
<div id="Ch3.S1.p4" class="ltx_para">
<p class="ltx_p">The above file are prepared, the following command removes uncommon primary cluster as noisy and/or chimeric.</p>
</div>
<div id="Ch3.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clcleanseqv \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--replicatelist=ListOfPCRreplicates \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--derepmode=PREFIX \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--primarymaxnmismatch=0 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--secondarymaxnmismatch=1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--pnoisycluster=0.5 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfileN \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch3.S1.p6" class="ltx_para">
<p class="ltx_p">If 3 or more replicates are available, only common sequences among all replicates are determined as non-noisy and nonchimeric.
If a primary cluster occured among multiple templates and the cluster determined as noisy or chimeric in a template, the cluster determined as noisy or chimeric in the other templates and will be excluded from all the samples by default.
However, the following arguments can change this decision method.</p>
</div>
<div id="Ch3.S1.p7" class="ltx_para">
<dl id="Ch3.S1.I1" class="ltx_description">
<dt id="Ch3.S1.I1.ix11"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minnreplicate</span></span></dt>
<dd>
<div id="Ch3.S1.I1.ix11.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer larger than 1.
If the number of replicates occured is equal to or larger than this value, such primary cluster will be determined as non-noisy and nonchimeric.
The default value is </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">2</span><span class="ltx_text" style="font-size:100%;">.</span></p>
</div>
</dd>
<dt id="Ch3.S1.I1.ix12"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minpreplicate</span></span></dt>
<dd>
<div id="Ch3.S1.I1.ix12.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as decimal larger than 0.
If the proportion of replicates occured (number of replicates occured / total number of replicates of the same template) is equal to or larger than this value, such primary cluster will be determined as non-noisy and nonchimeric.
The default value is </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">1</span><span class="ltx_text" style="font-size:100%;">.</span></p>
</div>
</dd>
<dt id="Ch3.S1.I1.ix13"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minnpositive</span></span></dt>
<dd>
<div id="Ch3.S1.I1.ix13.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer larger than 0.
If the number of sequences determined as noisy or chimeric of a primary cluster is equal to or larger than this value, the cluster determined as noisy or chimeric in all the samples.
The default value is </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">1</span><span class="ltx_text" style="font-size:100%;">.</span></p>
</div>
</dd>
<dt id="Ch3.S1.I1.ix14"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minppositive</span></span></dt>
<dd>
<div id="Ch3.S1.I1.ix14.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as decimal equal to or larger than 0.
If the proportion of sequences determined as noisy or chimeric of a primary cluster (number of sequences determined as noisy or chimeric of a primary cluster / total number of sequences of a primary cluster) is equal to or larger than this value, the cluster determined as noisy or chimeric in all the samples.
The default value is </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">0</span><span class="ltx_text" style="font-size:100%;">.</span></p>
</div>
</dd>
<dt id="Ch3.S1.I1.ix15"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--runname</span></span></dt>
<dd>
<div id="Ch3.S1.I1.ix15.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify RunID.
RunIDs in all sample names will be replaced to this RunID.
If multiple sample names become to the same, such samples will be integrated.</span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch3.S1.p8" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">--minnreplicate</span> and <span class="ltx_text ltx_font_typewriter">--minpreplicate</span> are arguments about intrasample judgement.
The <span class="ltx_text ltx_font_typewriter">--minnpositive</span> and <span class="ltx_text ltx_font_typewriter">--minppositive</span> are arguments about intersample judgement.
If both of <span class="ltx_text ltx_font_typewriter">--minnreplicate</span> and <span class="ltx_text ltx_font_typewriter">--minpreplicate</span> are fullfilled, the primary cluster will be determined as non-noisy and nonchimeric.
If both of <span class="ltx_text ltx_font_typewriter">--minnpositive</span> and <span class="ltx_text ltx_font_typewriter">--minppositive</span> are fullfilled, the primary cluster will be determined as noisy or chimeric.
The decision is common among all samples and different decision among samples are not allowed.
If there are samples that is not written in replicate list file, such sample is not used in this decision.</p>
</div>
<div id="Ch3.S1.p9" class="ltx_para">
<p class="ltx_p">If <span class="ltx_text ltx_font_typewriter">clcleanseqv</span> is executed like above, the following files will be saved additionally.</p>
</div>
<div id="Ch3.S1.p10" class="ltx_para">
<dl id="Ch3.S1.I2" class="ltx_description">
<dt id="Ch3.S1.I2.ix16"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">primarycluster.chimeraremoved.fasta.gz</span></span></dt>
<dd>
<div id="Ch3.S1.I2.ix16.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences determined as nonchimeric.</span></p>
</div>
</dd>
<dt id="Ch3.S1.I2.ix17"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">primarycluster.cleaned.fasta.gz</span></span></dt>
<dd>
<div id="Ch3.S1.I2.ix17.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences determined as non-noisy and nonchimeric</span></p>
</div>
</dd>
<dt id="Ch3.S1.I2.ix18"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">RunID__TagID__PrimerID.chimericreads.txt.gz</span></span></dt>
<dd>
<div id="Ch3.S1.I2.ix18.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">List of sequences determined as chimeric</span></p>
</div>
</dd>
</dl>
</div>
</section>
</section>
<section id="Ch4" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 4 </span>OTU picking based on nucleotide sequence clustering</h2>

<section id="Ch4.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4.1 </span>Inter-sample clustering</h3>

<div id="Ch4.S1.p1" class="ltx_para">
<p class="ltx_p">To pick OTUs by clustering, run the following command.</p>
</div>
<div id="Ch4.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clclassseqv \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minident=0.97 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfileN \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch4.S1.p3" class="ltx_para">
<p class="ltx_p">Give <span class="ltx_text ltx_font_typewriter">primarycluster.cleaned.fasta.gz</span> (if you applied noisy and/or chimeric sequence removal using PCR replicates) or <span class="ltx_text ltx_font_typewriter">primarycluster.denoised.fasta.gz</span> (if you did not apply noisy and/or chimeric sequence removal using PCR replicates) which is generated by <span class="ltx_text ltx_font_typewriter">clcleanseqv</span> as input files.
If you are using non-overlapped paired-end sequence data concatenated by <span class="ltx_text ltx_font_typewriter">clconcatpair</span>, give <span class="ltx_text ltx_font_typewriter">--paddinglen=16</span> argument additionally.</p>
</div>
<div id="Ch4.S1.p4" class="ltx_para">
<p class="ltx_p">In the output folder, the following files should be saved.</p>
</div>
<div id="Ch4.S1.p5" class="ltx_para">
<dl id="Ch4.S1.I1" class="ltx_description">
<dt id="Ch4.S1.I1.ix19"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">clustered.otu.gz</span></span></dt>
<dd>
<div id="Ch4.S1.I1.ix19.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Compressed file which records affiliation of raw sequences to OTUs</span></p>
</div>
</dd>
<dt id="Ch4.S1.I1.ix20"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">clustered.fasta</span></span></dt>
<dd>
<div id="Ch4.S1.I1.ix20.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Representative sequences of OTUs</span></p>
</div>
</dd>
</dl>
</div>
</section>
<section id="Ch4.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4.2 </span>Mapping raw sequencing reads to representative sequences of the clusters</h3>

<div id="Ch4.S2.p1" class="ltx_para">
<p class="ltx_p">If there are the raw sequences determined as noisy or chimeric which are as similar or more similar to representative sequences than specified identity threshold, such raw sequences can be recovered in this step.
This process can decrease excluded sequences.</p>
</div>
<div id="Ch4.S2.p2" class="ltx_para">
<p class="ltx_p">To perform this process, run the following command.</p>
</div>
<div id="Ch4.S2.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clrecoverseqv \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minident=0.97 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--centroid=RepresentativeSequenceFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfileN \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch4.S2.p4" class="ltx_para">
<p class="ltx_p">Specify <span class="ltx_text ltx_font_typewriter">clustered.fasta</span> generated by <span class="ltx_text ltx_font_typewriter">clclassseqv</span> and <span class="ltx_text ltx_font_typewriter">primarycluster.fasta.gz</span> generated by <span class="ltx_text ltx_font_typewriter">clcleanseqv</span> as representative sequence file and input file, respectively.
If you are using non-overlapped paired-end sequence data concatenated by <span class="ltx_text ltx_font_typewriter">clconcatpair</span>, give <span class="ltx_text ltx_font_typewriter">--paddinglen=16</span> argument additionally.
In the output folder, the output files same as <span class="ltx_text ltx_font_typewriter">clclassseqv</span> will be generated.</p>
</div>
</section>
</section>
<section id="Ch5" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 5 </span>Summarizing and post-processing of OTU picking results</h2>

<section id="Ch5.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.1 </span>Making summary table</h3>

<div id="Ch5.S1.p1" class="ltx_para">
<p class="ltx_p">The following command generates a sample x OTU table containing the number of sequences in every cell from OTU picking results.</p>
</div>
<div id="Ch5.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clsumclass \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--output=Matrix \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch5.S1.p3" class="ltx_para">
<p class="ltx_p">Give <span class="ltx_text ltx_font_typewriter">clustered.otu.gz</span> generated by <span class="ltx_text ltx_font_typewriter">clclassseqv</span> or <span class="ltx_text ltx_font_typewriter">clrecoverseqv</span> as input file.
The output file is tab-delimited text file like Table <a href="#Ch5.T1" title="Table 5.1 ‣ 5.1 Making summary table ‣ Chapter 5 Summarizing and post-processing of OTU picking results ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5.1</span></a> and can be edit by spreadsheet softwares such as Microsoft Excel.
Note that spreadsheet software can not read too large table.
This file can be used for community ecological analysis in R.</p>
</div>
<figure id="Ch5.T1" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">samplename</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU1</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU2</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU3</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU4</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU5</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">OTU6</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_tt"><span class="ltx_text" style="font-size:100%;">sampleA</span></th>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">2371</span></td>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">12</span></td>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">3</span></td>
<td class="ltx_td ltx_align_right ltx_border_tt"><span class="ltx_text" style="font-size:100%;">0</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">sampleB</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">1518</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">25</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">1</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">sampleC</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">1398</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">8</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">77</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">6</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">sampleD</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">1436</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">10</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">sampleE</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">1360</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">15</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">3</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">sampleF</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">0</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">977</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">55</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">6</span></td>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:100%;">8</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table  5.1: </span>An example of summary table — Numbers in cells indicate the observed number of raw sequences.</figcaption>
</figure>
</section>
<section id="Ch5.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.2 </span>Excluding specified OTUs and/or samples from summary table</h3>

<div id="Ch5.S2.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clsumclass</span> command output all OTUs and samples to summary table.
The following command can filter samples and/or OTUs which matches several conditions.</p>
</div>
<div id="Ch5.S2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clfiltersum \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">arguments \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch5.S2.p3" class="ltx_para">
<p class="ltx_p">Both input file and output file are tab-delimited text files of summary tables.
Acceptable arguments are listed below.</p>
</div>
<div id="Ch5.S2.p4" class="ltx_para">
<dl id="Ch5.S2.I1" class="ltx_description">
<dt id="Ch5.S2.I1.ix21"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minnseqotu</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix21.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer as large as or larger than 0.
OTUs whose number of raw sequencing reads of every sample is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix22"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minpseqotu</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix22.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as decimal ranging from 0 to 1.
OTUs whose proportion of raw sequencing reads (number of raw reads / total number of raw reads of sample) of every sample is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix23"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minntotalseqotu</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix23.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer as large as or larger than 0.
OTUs whose total number of raw sequencing reads is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix24"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minnseqsample</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix24.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer as large as or larger than 0.
Samples whose number of raw sequencing reads of every OTU is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix25"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minpseqsample</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix25.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as decimal ranging from 0 to 1.
Samples whose proportion of raw sequencing reads (number of raw reads / total number of raw reads of OTU) of every OTU is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix26"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--minntotalseqsample</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix26.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specified as integer as large as or larger than 0.
Samples whose total number of raw sequencing reads is lower than this value will be excluded.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix27"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--otu</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix27.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify OTUs as comma-delimited names you want to keep.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix28"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--negativeotu</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix28.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify OTUs as comma-delimited names you want to eliminate.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix29"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--otulist</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix29.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a file name.
In the file, write an OTU name per line you want to keep.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix30"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--negativeotulist</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix30.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a file name.
In the file, write an OTU name per line you want to eliminate.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix31"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--otuseq</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix31.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a FASTA sequence file name.
In the file, write OTU names you want to keep. Sequences will be ignored.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix32"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--negativeotuseq</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix32.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a FASTA sequence file name.
In the file, write OTU names you want to eliminate. Sequences will be ignored.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix33"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--sample</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix33.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify samples as comma-delimited names you want to keep.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix34"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--negativesample</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix34.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify samples as comma-delimited names you want to eliminate.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix35"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--samplelist</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix35.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a file name.
In the file, write a sample name per line you want to keep.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix36"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--negativesamplelist</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix36.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a file name.
In the file, write a sample name per line you want to eliminate.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix37"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--replicatelist</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix37.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify a file name.
In the file, write PCR replicates as tab-delimited sample names in a line.
PCR replicates will be integrated in output file.
The number of raw reads will be summed up.</span></p>
</div>
</dd>
<dt id="Ch5.S2.I1.ix38"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_typewriter ltx_font_bold" style="font-size:100%;">--runname</span></span></dt>
<dd>
<div id="Ch5.S2.I1.ix38.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Specify RunID.
RunIDs in all sample names will be replaced this RunID.
If samples whose names are completely matched are occur, the samples will be integrated in output file.
The number of raw reads will be summed up.</span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch5.S2.p5" class="ltx_para">
<p class="ltx_p">Applying post OTU picking chimeric sequence removal or nontarget sequence removal explained later, OTU filtering using remaining sequences and <span class="ltx_text ltx_font_typewriter">--otuseq</span> argument can be applied to summary table.
If you want to apply additional filtering to summary table, this OTU filtering should be applied at first.</p>
</div>
</section>
<section id="Ch5.S3" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.3 </span>Chimera removal based on UCHIME algorithm</h3>

<div id="Ch5.S3.p1" class="ltx_para">
<p class="ltx_p">Post OTU picking chimera removal based on UCHIME algorithm with and without (<span class="ltx_text ltx_font_italic">de novo</span>) reference sequences can be applied additionally.
If you want to apply both with and without reference chimera removal, without reference (<span class="ltx_text ltx_font_italic">de novo</span>) chimera removal should be applied at first, and then reference-based chimera removal should be applied.
To perform these chimera removal, use <span class="ltx_text ltx_font_typewriter">clrunuchime</span> command.
In this command, UCHIME algorithm implemented in VSEARCH is used.</p>
</div>
<section id="Ch5.S3.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">5.3.1 </span><span class="ltx_text ltx_font_italic">De novo</span> chimera removal</h4>

<div id="Ch5.S3.SS1.p1" class="ltx_para">
<p class="ltx_p">Execute <span class="ltx_text ltx_font_typewriter">clrunuchime</span> like the following.</p>
</div>
<div id="Ch5.S3.SS1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clrunuchime \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--otufile=*.otu.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch5.S3.SS1.p3" class="ltx_para">
<p class="ltx_p">The output files in output folder is explained later.</p>
</div>
</section>
<section id="Ch5.S3.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">5.3.2 </span>Reference-based chimera removal</h4>

<div id="Ch5.S3.SS2.p1" class="ltx_para">
<p class="ltx_p">Execute <span class="ltx_text ltx_font_typewriter">clrunuchime</span> like the following.</p>
</div>
<div id="Ch5.S3.SS2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clrunuchime \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--referencedb=ReferenceDatabase \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch5.S3.SS2.p3" class="ltx_para">
<p class="ltx_p">If you already applied <span class="ltx_text ltx_font_italic">de novo</span> chimera removal, there is <span class="ltx_text ltx_font_typewriter">nonchimeras.fasta</span> in the output folder of <span class="ltx_text ltx_font_italic">de novo</span> chimera removal.
This file is recommended for input file of reference-based chimera removal.
The output files in output folder is explained later.</p>
</div>
<div id="Ch5.S3.SS2.p4" class="ltx_para">
<p class="ltx_p">The following ready-made reference databases are provided and installed.</p>
</div>
<div id="Ch5.S3.SS2.p5" class="ltx_para">
<dl id="Ch5.S3.I1" class="ltx_description">
<dt id="Ch5.S3.I1.ix39"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdu12s</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix39.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for animal 12S ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix40"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdu16s</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix40.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for animal 16S ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix41"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cducox1</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix41.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for animal COX1(COI) ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix42"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cducytb</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix42.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for animal Cyt-b ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix43"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdudloop</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix43.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for animal D-loop ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix44"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdumatk</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix44.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for plant matK ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix45"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdurbcl</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix45.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for plant rbcL ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix46"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">cdutrnhpsba</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix46.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Claident Database for UCHIME for plant trnH-psbA ver.20180412</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix47"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">rdpgoldv9</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix47.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">RDP Gold v9 for prokaryotic 16S</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix48"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">silva132LSUref</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix48.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">SILVA Release 132 for LSU rRNA</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix49"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">silva132SSUrefnr99</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix49.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">SILVA Release 132 Nr99 for SSU rRNA</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix50"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">unite20170628</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix50.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">UNITE ver.20170628 for fungal ITS</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix51"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">unite20170628untrim</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix51.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">UNITE ver.20170628 without trimming for fungal ITS</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix52"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">unite20170628its1</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix52.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">UNITE ver.20170628 for fungal ITS1</span></p>
</div>
</dd>
<dt id="Ch5.S3.I1.ix53"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">unite20170628its2</span></span></dt>
<dd>
<div id="Ch5.S3.I1.ix53.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">UNITE ver.20170628 for fungal ITS2</span></p>
</div>
</dd>
</dl>
</div>
</section>
<section id="Ch5.S3.SS3" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">5.3.3 </span>About contents of output folder</h4>

<div id="Ch5.S3.SS3.p1" class="ltx_para">
<p class="ltx_p">In the output folder of <span class="ltx_text ltx_font_typewriter">clrunuchime</span>, the following files will be saved.</p>
</div>
<div id="Ch5.S3.SS3.p2" class="ltx_para">
<dl id="Ch5.S3.I2" class="ltx_description">
<dt id="Ch5.S3.I2.ix54"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">chimeras.fasta</span></span></dt>
<dd>
<div id="Ch5.S3.I2.ix54.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Sequences determined as chimeric</span></p>
</div>
</dd>
<dt id="Ch5.S3.I2.ix55"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">nonchimeras.fasta</span></span></dt>
<dd>
<div id="Ch5.S3.I2.ix55.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Sequences determined as nonchimeric</span></p>
</div>
</dd>
<dt id="Ch5.S3.I2.ix56"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">uchimealns.txt</span></span></dt>
<dd>
<div id="Ch5.S3.I2.ix56.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Alignment used in chimera detection</span></p>
</div>
</dd>
<dt id="Ch5.S3.I2.ix57"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">uchimeout.txt</span></span></dt>
<dd>
<div id="Ch5.S3.I2.ix57.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Detected parent sequences, chimera scores and the other information</span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch5.S3.SS3.p3" class="ltx_para">
<p class="ltx_p">To know meaning of each element in <span class="ltx_text ltx_font_typewriter">uchimeout.txt</span>, see the following URL.
<br class="ltx_break"><a href="http://drive5.com/usearch/manual/uchimeout.html" title="" class="ltx_ref ltx_href">http://drive5.com/usearch/manual/uchimeout.html</a></p>
</div>
</section>
</section>
<section id="Ch5.S4" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.4 </span>Excluding low-abundance OTUs from OTU sequences</h3>

<div id="Ch5.S4.p1" class="ltx_para">
<p class="ltx_p">In the output folder of OTU picking, <span class="ltx_text ltx_font_typewriter">clustered.fasta</span> is saved as representative sequence file.
This file can be used for taxonomic assignment.
However, the number of OTUs is sometimes too large to assign taxonomy if rare OTUs are kept.
In such cases, extracting more abundant OTUs than specified value is useful.
The following command exclude OTUs observed less than 5.</p>
</div>
<div id="Ch5.S4.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clfilterseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--otufile=*.otu.gz \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minnseq=5 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
</section>
<section id="Ch5.S5" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.5 </span>Sequence splitting based on conservative motif recognition</h3>

<div id="Ch5.S5.p1" class="ltx_para">
<p class="ltx_p">If your data sequences contain multiple loci (e.g. ITS1–5.8S rRNA–ITS2), splitting loci might cause better taxonomic assignment.
If conservative motif exists at the border of loci or close to the border, such motif can be used for splitting.
Universal primer annealing positions are recommended for such conservative motif.
The following command can divide sequences to anterior and posterior of matched positions.</p>
</div>
<div id="Ch5.S5.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cldivseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--query=Sequence \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--border=start \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile2↓</span></span>
</span>
</div>
<div id="Ch5.S5.p3" class="ltx_para">
<p class="ltx_p">By default, given query sequence will be searched from the sequences contained in inputfile based on Needleman-Wunsch algorithm allowing 15% mismatches.
If matched positions are found, sequences will be divided on head of matched positions to anterior and posterior sequences to outputfile1 and outputfile2, respectively.
Unmatched sequences will be output to outputfile1.
Given query sequence must be same strand as target sequences.
To use different strand query, add <span class="ltx_text ltx_font_typewriter">--reversecomplement</span> argument.</p>
</div>
<div id="Ch5.S5.p4" class="ltx_para">
<p class="ltx_p">If <span class="ltx_text ltx_font_typewriter">--border=end</span> is specified, sequences will be divided on tail of matched positions.
If <span class="ltx_text ltx_font_typewriter">--border=both</span> is given (this is default), anterior sequences of head of matched positions and posterior sequences of tail of matched positions will be saved, and matched positions will be excluded from both output.
If query is unmatched, undivided sequences will be saved to outputfile1 and outputfile2 will lack the sequences.
Specifying the <span class="ltx_text ltx_font_typewriter">--makedummy</span> argument, dummy sequence <span class="ltx_text ltx_font_typewriter">A</span> will be saved to outputfile2.
This argument is useful for merging multiple taxonomic assignment results of multiple loci.
Repeat <span class="ltx_text ltx_font_typewriter">cldivseq</span> execution to divide sequences to 3 or more subsequences.</p>
</div>
</section>
<section id="Ch5.S6" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.6 </span>ITS, SSU rRNA or LSU rRNA sequence extraction using ITSx or Metaxa</h3>

<div id="Ch5.S6.p1" class="ltx_para">
<p class="ltx_p">ITSx can detect ITS1 and ITS2, and extract those positions only <cite class="ltx_cite ltx_citemacro_citep">(Bengtsson-Palme <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib4" title="" class="ltx_ref">2013</a>)</cite>.
In many taxa, ITS is highly variable but SSU, 5.8S and LSU rRNA is much more conservative.
Such conservative positions cause misidentifications or unidentified results because conservative position of distant taxa will match to query sequences.
Extracting ITS positions by ITSx might solve this problem.</p>
</div>
<div id="Ch5.S6.p2" class="ltx_para">
<p class="ltx_p">Metaxa can detect SSU (12S/16S/18S) rRNA and LSU (26S/28S) rRNA, and extract those positions only <cite class="ltx_cite ltx_citemacro_citep">(Bengtsson <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib3" title="" class="ltx_ref">2011</a>)</cite>.
In the case of SSU rRNA, Metaxa can also distinguish among eukaryotic nuclear, mitochondrial, chloroplast and prokaryotic.
SSU rRNA is widely used for DNA barcoding and metabarcoding of eukaryotes, but is contained not only by nuclear of eukaryotes but also by mitochondria, chloroplast and contaminated prokaryotes.
Therefore, nontarget SSU rRNA frequently contaminate the data.
Such nontarget SSU should be deleted for community ecological analysis.</p>
</div>
<div id="Ch5.S6.p3" class="ltx_para">
<p class="ltx_p">After the filtering sequences by ITSx or Metaxa, <span class="ltx_text ltx_font_typewriter">clfiltersum</span> with <span class="ltx_text ltx_font_typewriter">--otuseq</span> argument can be used to justify summary table to the sequence file (see section <a href="#Ch5.S2" title="5.2 Excluding specified OTUs and/or samples from summary table ‣ Chapter 5 Summarizing and post-processing of OTU picking results ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5.2</span></a>).</p>
</div>
</section>
<section id="Ch5.S7" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5.7 </span>Searching and excluding nontarget sequences</h3>

<div id="Ch5.S7.p1" class="ltx_para">
<p class="ltx_p">ITSx and Metaxa can apply to ITS and SSU/LSU rRNA only and cannot apply to the other loci.
For the other loci, gene prediction and annotation programs and multiple alignment programs can be used for finding nontarget sequences.
In ClustalW2, ClustalX2 <cite class="ltx_cite ltx_citemacro_citep">(Larkin <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib13" title="" class="ltx_ref">2007</a>)</cite> and MAFFT <cite class="ltx_cite ltx_citemacro_citep">(Katoh &amp; Standley, <a href="#bib.bib10" title="" class="ltx_ref">2013</a>)</cite>, sorting function based on phylogenetic relatedness is available.
Aligning sequences applying this sorting function, you can easily find some types of nontarget sequences by eyes with the aid of multiple alignment viewer.
After the filtering sequences by this method, <span class="ltx_text ltx_font_typewriter">clfiltersum</span> with <span class="ltx_text ltx_font_typewriter">--otuseq</span> argument can be used to justify summary table to the sequence file (see section <a href="#Ch5.S2" title="5.2 Excluding specified OTUs and/or samples from summary table ‣ Chapter 5 Summarizing and post-processing of OTU picking results ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5.2</span></a>).</p>
</div>
</section>
</section>
<section id="Ch6" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 6 </span>Data deposition to DRA</h2>

<div id="Ch6.p1" class="ltx_para">
<p class="ltx_p">Representative sequences can be deposited to DDBJ, EMBL or GenBank if required, but raw sequencing reads should be deposited to Sequence Read Archive such as DDBJ Sequence Read Archive (DRA).
If you sequenced multiple samples with tag sequences in a NGS run like above, demultiplexed (tag and primer positions are trimmed) sequence files should be deposited.
FASTQ files made by <span class="ltx_text ltx_font_typewriter">clsplitseq</span> command can be used for deposition.</p>
</div>
<div id="Ch6.p2" class="ltx_para">
<p class="ltx_p">XML files recoding metadata about sample information need to be created.
DRA provides XML creation support tool, but such tools is not easy-to-use in a case of many samples.
To reduce time and effort to make XML, <span class="ltx_text ltx_font_typewriter">clmaketsv</span> and <span class="ltx_text ltx_font_typewriter">clmakexml</span> can be used.</p>
</div>
<div id="Ch6.p3" class="ltx_para">
<p class="ltx_p">To deposit raw sequences to DRA, user account of D-way of DDBJ and public key registration are required.
Read DRA Handbook
<br class="ltx_break"><a href="http://trace.ddbj.nig.ac.jp/dra/submission_e.shtml" title="" class="ltx_ref ltx_href">http://trace.ddbj.nig.ac.jp/dra/submission_e.shtml</a>
<br class="ltx_break">to know detailed procedures.
In the deposition process, concept of Submission, Study, Experiment, Sample and Run, and association of those are important and need to be understood.</p>
</div>
<div id="Ch6.p4" class="ltx_para">
<p class="ltx_p">In DRA, Study need to be registered to BioProject database which is a research project database, and to be referred to BioProject ID (accession number).
Read BioProject Handbook
<br class="ltx_break"><a href="http://trace.ddbj.nig.ac.jp/bioproject/submission_e.html" title="" class="ltx_ref ltx_href">http://trace.ddbj.nig.ac.jp/bioproject/submission_e.html</a>
<br class="ltx_break">and register Study to BioProject before data deposition to DRA.</p>
</div>
<div id="Ch6.p5" class="ltx_para">
<p class="ltx_p">Sample also need to be registered to BioSample database which is a biological specimen database, and to be referred to BioSample ID (accession number).
You need to read BioSample Handbook
<br class="ltx_break"><a href="http://trace.ddbj.nig.ac.jp/biosample/submission_e.html" title="" class="ltx_ref ltx_href">http://trace.ddbj.nig.ac.jp/biosample/submission_e.html</a>
<br class="ltx_break">and to register Sample to BioSample before data deposition to DRA.
In the cases of metabarcoding using amplicon sequences amplified by universal primer set, MIMarks-Survey should be specified as type of MIxS.
In BioSample database, sampling locality, elevation, depth, temperature, humidity, pH etc. can be added as metadata.
Sample information should be added as many as possible for future generations and yourself.
Meaning of each item is explained in checklist
<br class="ltx_break"><a href="http://wiki.gensc.org/index.php?title=MIMARKS" title="" class="ltx_ref ltx_href">http://wiki.gensc.org/index.php?title=MIMARKS</a>
<br class="ltx_break">provided by Genomic Standards Consortium.
If you cannot understand meaning of each item, ask to DDBJ.
In the cases of underground samples and underseafloor samples, hight or depth of sampling point from mean sea level, hight or depth of sampling point from ground/seafloor surface and hight or depth of ground/seafloor surface from mean sea level need to be distinguished.
Because it is very confusing, please be careful.</p>
</div>
<section id="Ch6.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6.1 </span>Preparing tab-delimited text file for XML generation</h3>

<div id="Ch6.S1.p1" class="ltx_para">
<p class="ltx_p">In the XML files, contained sequence infomation need to be provided for each FASTQ files.
Because this is very costful, <span class="ltx_text ltx_font_typewriter">clmaketsv</span> generates simple tab-delimited text file, and <span class="ltx_text ltx_font_typewriter">clmakexml</span> creates XML files from edited tab-delimited text file.
The <span class="ltx_text ltx_font_typewriter">clmaketsv</span> command can run like below.</p>
</div>
<div id="Ch6.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmaketsv \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">*snip* \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfileN \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch6.S1.p3" class="ltx_para">
<p class="ltx_p">FASTQ files to deposit to DRA should be specified as input files.
Wild cards can be used in input file name.
Once tab-delimited text generated, use spreadsheet softwares such as Microsoft Excel to edit this file, and fill in the blank cells.
Note that Microsoft Excel automatically convert several types of characters and this function cannot be disabled.
If <span class="ltx_text ltx_font_typewriter">[Foo,Bar]</span> is found in a cell, select <span class="ltx_text ltx_font_typewriter">Foo</span> or <span class="ltx_text ltx_font_typewriter">Bar</span> and delete bracket, comma and alternatives.
If you find <span class="ltx_text ltx_font_typewriter">&lt;Fill in this cell&gt;</span> in a cell, fill this cell according to written instruction in <span class="ltx_text ltx_font_typewriter">&lt;&gt;</span> and delete <span class="ltx_text ltx_font_typewriter">&lt;&gt;</span>.
You might do something with the other cells.
If you do not have BioProject ID and/or BioSample ID because of assignment delay, you can use Submission ID with prefix PSUB for BioProject and SSUB for BioSample.</p>
</div>
</section>
<section id="Ch6.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6.2 </span>XML generation from tab-delimited text</h3>

<div id="Ch6.S2.p1" class="ltx_para">
<p class="ltx_p">After editing tab-delimited text, run <span class="ltx_text ltx_font_typewriter">clmakexml</span> like below.</p>
</div>
<div id="Ch6.S2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmakexml \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">Tab-DelimitedTextFile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">Submission-ID↓</span></span>
</span>
</div>
<div id="Ch6.S2.p3" class="ltx_para">
<p class="ltx_p">Then, XML files required to deposit FASTQ files will be created.
Multiple tab-delimited text files can be given to <span class="ltx_text ltx_font_typewriter">clmakexml</span>.
In such cases, the contents of first 3 lines in tab-delimited text of secondary or n-ary files will be ignored.</p>
</div>
<div id="Ch6.S2.p4" class="ltx_para">
<p class="ltx_p">Submission-ID is assigned by DRA when [Create new submission(s)] button is pushed.
Submission-ID is compliant with <span class="ltx_text ltx_font_typewriter">UserID-0001</span> and should be given for last argument of <span class="ltx_text ltx_font_typewriter">clmakexml</span>.
The execution of <span class="ltx_text ltx_font_typewriter">clmakexml</span> is completed, 3 <span class="ltx_text ltx_font_typewriter">Submission-ID.*.xml</span> files will be generated.
These files can be upload to DRA using XML Upload.
All processes were completed, DRA accession number is assigned and is sent from DRA to your Email address.
Note that BioProject ID (accession number) was written in manuscript in most cases.</p>
</div>
</section>
</section>
<section id="Ch7" class="ltx_chapter">
<h2 class="ltx_title ltx_title_chapter">
<span class="ltx_tag ltx_tag_chapter">Chapter 7 </span>Estimation of host organisms of nucleotide sequences (a.k.a. DNA barcoding)</h2>

<div id="Ch7.p1" class="ltx_para">
<p class="ltx_p">DNA barcoding is a taxonomic identification method of biological specimens using nucleotide sequences and becoming widely applied to broad area.
However, our reference sequence database is not enough for species level identification and algorithm that is suitable for incomplete reference database is lacking.
To solve this problem, I proposed a new criterion that sequence distance between query and nearest-neighbor must be smaller than maximum distance within resulting taxon, developed QCauto algorithm which fullfills this criterion <cite class="ltx_cite ltx_citemacro_citep">(Tanabe &amp; Toju, <a href="#bib.bib18" title="" class="ltx_ref">2013</a>)</cite> and implemented in Claident.
Note that we can expect that host organism of query is same as nearest-neighbor in the case that potentially observable species are completely described and barcode DNA sequences are completely sequenced and registered to reference database.
For such case, 1-nearest-neighbor (closest match) method is also implemented in Claident.</p>
</div>
<section id="Ch7.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7.1 </span>Retrieval of neighborhood sequences based on BLAST search</h3>

<div id="Ch7.S1.p1" class="ltx_para">
<p class="ltx_p">The GenBank IDs of neighborhood sequences required to fullfill “sequence distance between query and nearest-neighbor must be smaller than maximum distance within resulting taxon” can be retrieved by the following command.</p>
</div>
<div id="Ch7.S1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clidentseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--blastdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S1.p3" class="ltx_para">
<p class="ltx_p">Give FASTA formatted nucleotide sequence file as an input file.
Specify reference sequence database for BLAST search to <span class="ltx_text ltx_font_typewriter">--blastdb</span> argument.
The following databases should be installed with Claident.</p>
</div>
<div id="Ch7.S1.p4" class="ltx_para">
<dl id="Ch7.S1.I1" class="ltx_description">
<dt id="Ch7.S1.I1.ix58"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">animals_COX1_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix58.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Animal (Metazoa) </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">COX1</span><span class="ltx_text" style="font-size:100%;"> sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix59"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">animals_COX1_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix59.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Animal (Metazoa) </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">COX1</span><span class="ltx_text" style="font-size:100%;"> sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix60"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">animals_mt_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix60.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Animal (Metazoa) mitochondrial sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix61"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">animals_mt_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix61.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Animal (Metazoa) mitochondrial sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix62"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">eukaryota_LSU_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix62.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Eukaryotic LSU (28S) rRNA sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix63"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">eukaryota_LSU_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix63.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Eukaryotic LSU (28S) rRNA sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix64"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">eukaryota_SSU_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix64.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Eukaryotic SSU (18S) rRNA sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix65"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">eukaryota_SSU_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix65.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Eukaryotic SSU (18S) rRNA sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix66"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">fungi_ITS_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix66.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Fungal ITS sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix67"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">fungi_ITS_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix67.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Fungal ITS sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix68"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">overall_class</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix68.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">NCBI nt sequences which have class or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix69"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">overall_order</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix69.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">NCBI nt sequences which have order or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix70"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">overall_family</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix70.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">NCBI nt sequences which have family or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix71"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">overall_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix71.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">NCBI nt sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix72"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">overall_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix72.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">NCBI nt sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix73"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_matK_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix73.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">matK</span><span class="ltx_text" style="font-size:100%;"> sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix74"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_matK_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix74.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">matK</span><span class="ltx_text" style="font-size:100%;"> sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix75"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_rbcL_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix75.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">rbcL</span><span class="ltx_text" style="font-size:100%;"> sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix76"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_rbcL_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix76.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">rbcL</span><span class="ltx_text" style="font-size:100%;"> sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix77"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_trnH-psbA_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix77.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">trnH</span><span class="ltx_text" style="font-size:100%;">–</span><span class="ltx_text ltx_font_italic" style="font-size:100%;">psbA</span><span class="ltx_text" style="font-size:100%;"> sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix78"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">plants_trnH-psbA_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix78.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Plant </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">trnH</span><span class="ltx_text" style="font-size:100%;">–</span><span class="ltx_text ltx_font_italic" style="font-size:100%;">psbA</span><span class="ltx_text" style="font-size:100%;"> sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix79"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">prokaryota_16S_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix79.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Prokaryotic (Bacterial and Archaeal) 16S rRNA sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix80"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">prokaryota_16S_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix80.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Prokaryotic (Bacterial and Archaeal) 16S rRNA sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix81"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">prokaryota_all_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix81.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Prokaryotic (Bacterial and Archaeal) sequences which have genus or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix82"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">prokaryota_all_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix82.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Prokaryotic (Bacterial and Archaeal) sequences which have species or lower level taxonomic information</span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix83"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">semiall_class</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix83.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Reduced database excluding all sequences of vertebrates, </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Caenorhabditis</span><span class="ltx_text" style="font-size:100%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Drosophila</span><span class="ltx_text" style="font-size:100%;"> from </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">overall_class</span><span class="ltx_text" style="font-size:100%;"></span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix84"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">semiall_order</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix84.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Reduced database excluding all sequences of vertebrates, </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Caenorhabditis</span><span class="ltx_text" style="font-size:100%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Drosophila</span><span class="ltx_text" style="font-size:100%;"> from </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">overall_order</span><span class="ltx_text" style="font-size:100%;"></span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix85"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">semiall_family</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix85.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Reduced database excluding all sequences of vertebrates, </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Caenorhabditis</span><span class="ltx_text" style="font-size:100%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Drosophila</span><span class="ltx_text" style="font-size:100%;"> from </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">overall_family</span><span class="ltx_text" style="font-size:100%;"></span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix86"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">semiall_genus</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix86.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Reduced database excluding all sequences of vertebrates, </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Caenorhabditis</span><span class="ltx_text" style="font-size:100%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Drosophila</span><span class="ltx_text" style="font-size:100%;"> from </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">overall_genus</span><span class="ltx_text" style="font-size:100%;"></span></p>
</div>
</dd>
<dt id="Ch7.S1.I1.ix87"><span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" style="font-size:100%;">semiall_species</span></span></dt>
<dd>
<div id="Ch7.S1.I1.ix87.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:100%;">Reduced database excluding all sequences of vertebrates, </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Caenorhabditis</span><span class="ltx_text" style="font-size:100%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:100%;">Drosophila</span><span class="ltx_text" style="font-size:100%;"> from </span><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">overall_species</span><span class="ltx_text" style="font-size:100%;"></span></p>
</div>
</dd>
</dl>
</div>
<div id="Ch7.S1.p5" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">overall_*</span> databases are very large and require large memory but are versatile and robust to contamination of sequences of unexpected taxa or loci.
Thus, the <span class="ltx_text ltx_font_typewriter">overall_*</span> are strongly recommended in most cases.
In the <span class="ltx_text ltx_font_typewriter">overall_genus</span> database, the sequences which have genus or lower level taxonomic information are kept and the other sequences are excluded and nearest-neighbor often cannot be found in minor taxa.
The <span class="ltx_text ltx_font_typewriter">overall_class</span> database is recommended for such cases.
The other databases are much smaller than <span class="ltx_text ltx_font_typewriter">overall_*</span> and require much smaller memory.</p>
</div>
<section id="Ch7.S1.SS1" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">7.1.1 </span>Accelerating BLAST search using cache databases</h4>

<div id="Ch7.S1.SS1.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">clidentseq</span> command runs BLAST search multiple times and requires long time.
To reduce runtime of <span class="ltx_text ltx_font_typewriter">clidentseq</span>, cache sequence database containing top-10,000 high score sequences can be constructed for each query sequence.
Processing speed of <span class="ltx_text ltx_font_typewriter">clidentseq</span> can be extremely accelerated by using cache database and strongly recommended.
To construct cache database, run <span class="ltx_text ltx_font_typewriter">clmakecachedb</span> like the following.</p>
</div>
<div id="Ch7.S1.SS1.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmakecachedb \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--blastdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfolder↓</span></span>
</span>
</div>
<div id="Ch7.S1.SS1.p3" class="ltx_para">
<p class="ltx_p">In the execution of <span class="ltx_text ltx_font_typewriter">clidentseq</span>, give the output folder of <span class="ltx_text ltx_font_typewriter">clmakecachedb</span> to <span class="ltx_text ltx_font_typewriter">--blastdb</span> argument, then cache database will be used.
Note that the input files of <span class="ltx_text ltx_font_typewriter">clmakecachedb</span> and <span class="ltx_text ltx_font_typewriter">clidentseq</span> must be tha same file.
Required amount of memory is also extremely reduced.</p>
</div>
</section>
<section id="Ch7.S1.SS2" class="ltx_subsection">
<h4 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">7.1.2 </span>In the case of perfect reference database</h4>

<div id="Ch7.S1.SS2.p1" class="ltx_para">
<p class="ltx_p">If potentially observable species are completely described and barcode DNA sequences are completely sequenced and registered to reference database, retrieving top-1 and tie sequences whose percent-identity to query is 99% or more is recommended and it can be performed by the following command.</p>
</div>
<div id="Ch7.S1.SS2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clidentseq \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">blastn -task megablast -word_size 16 end \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--method=1,99% \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--blastdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--numthreads=NumberOfCPUs \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S1.SS2.p3" class="ltx_para">
<p class="ltx_p">In search of top-1 and tie sequences whose percent-identity to query is 99% or more, we can expect that overleaping hardly occur even if <span class="ltx_text ltx_font_typewriter">-task megablast -word_size 16</span> is given as BLAST search argument, thus this argument is specified.</p>
</div>
</section>
</section>
<section id="Ch7.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7.2 </span>Taxonomic assignment based on neighborhood sequences</h3>

<div id="Ch7.S2.p1" class="ltx_para">
<p class="ltx_p">The following command assigns taxonomy to query ascending taxonomic level until taxonomic information of neighborhood sequences are completely matched.
This is called as lowest common ancestor (LCA) algorithm <cite class="ltx_cite ltx_citemacro_citep">(Huson <span class="ltx_text ltx_font_italic">et al.</span>, <a href="#bib.bib8" title="" class="ltx_ref">2007</a>)</cite>.</p>
</div>
<div id="Ch7.S2.p2" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; classigntax \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--taxdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S2.p3" class="ltx_para">
<p class="ltx_p">Give the output file of <span class="ltx_text ltx_font_typewriter">clidentseq</span> as input file.
The <span class="ltx_text ltx_font_typewriter">--taxdb</span> argument is to give taxonomy database of reference sequences.
The same name taxonomy databases as reference sequence databases should be installed and specified.</p>
</div>
<div id="Ch7.S2.p4" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">classigntax</span> requires 2 or more neighborhood sequences by default.
If <span class="ltx_text ltx_font_typewriter">--method=1,99%</span> was given to <span class="ltx_text ltx_font_typewriter">clidentseq</span> like section <a href="#Ch7.S1.SS2" title="7.1.2 In the case of perfect reference database ‣ 7.1 Retrieval of neighborhood sequences based on BLAST search ‣ Chapter 7 Estimation of host organisms of nucleotide sequences (a.k.a. DNA barcoding) ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">7.1.2</span></a>, taxonomic assignment of all sequences must be failed.
In such case, reduce required minimum number of neighborhood sequences like below.</p>
</div>
<div id="Ch7.S2.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; classigntax \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--taxdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minnsupporter=1 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S2.p6" class="ltx_para">
<p class="ltx_p">The output file of <span class="ltx_text ltx_font_typewriter">classigntax</span> is tab-delimited text file like Table <a href="#Ch7.T1" title="Table 7.1 ‣ 7.2 Taxonomic assignment based on neighborhood sequences ‣ Chapter 7 Estimation of host organisms of nucleotide sequences (a.k.a. DNA barcoding) ‣ Metabarcoding and DNA barcoding for Ecologists: Sequence analysis" class="ltx_ref"><span class="ltx_text ltx_ref_tag">7.1</span></a>.</p>
</div>
<figure id="Ch7.T1" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">query</span></th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">phylum</span></th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">genus</span></th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column"><span class="ltx_text" style="font-size:100%;">species</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_tt"><span class="ltx_text" style="font-size:100%;">seqA</span></th>
<td class="ltx_td ltx_align_left ltx_border_tt"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left ltx_border_tt"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium</span></td>
<td class="ltx_td ltx_align_left ltx_border_tt"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium virescens</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqB</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium virescens</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqC</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Chloridium virescens</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqD</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Amanita</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Amanita fuliginea</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqE</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Coltriciella</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Coltriciella dependens</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqF</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Filobasidium</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Filobasidium uniguttulatum</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqG</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Laccaria</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Laccaria bicolor</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqH</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Lactarius</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Lactarius quietus</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqI</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula densifolia</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqJ</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula densifolia</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqK</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula densifolia</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqL</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Russula vesca</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqM</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Agaricus</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqN</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Amanita</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqO</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Basidiomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Amanita</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqP</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Bisporella</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqQ</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Capronia</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqR</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Capronia</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:100%;">seqS</span></th>
<td class="ltx_td ltx_align_left"><span class="ltx_text" style="font-size:100%;">Ascomycota</span></td>
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_italic" style="font-size:100%;">Cenococcum</span></td>
<td class="ltx_td"></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table  7.1: </span>Examples of taxonomic assignment — Blank cel means unidentified. Several taxonomic levels are omitted for reduce width.</figcaption>
</figure>
<div id="Ch7.S2.p7" class="ltx_para">
<p class="ltx_p">By default, the <span class="ltx_text ltx_font_typewriter">classigntax</span> command assigns taxonomy to query ascending taxonomic level until taxonomic information of neighborhood sequences are completely matched.
If reference database is contaminated by misidentified sequence and such sequence is contain in neighborhoods, lower level taxonomy likely to be unidentified.
This “strict consensus” method is robust but too conservative in some cases.
The following command allows 5% of neighborhood sequences have different taxonomic information to resulting taxonomy.</p>
</div>
<div id="Ch7.S2.p8" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; classigntax \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--taxdb=overall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--maxpopposer=0.05 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--minsoratio=19 \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">inputfile \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S2.p9" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">classigntax</span> treat neighborhood sequences which have same taxonomic information to resulting taxonomy as “supporter”, neighborhood sequences which have different taxonomic information to resulting taxonomy as “opposer”.
The <span class="ltx_text ltx_font_typewriter">--maxpopposer</span> argument is to specify maximum allowing proportion of opposer sequences.
The <span class="ltx_text ltx_font_typewriter">--minsoratio</span> argument is to set minimum required ratio of “supporter / opposer”.
Because there are possibly unidentified sequences which do not have taxonomic information and such sequences are neither supporter nor opposer, the above 2 arguments are required.
In the case of above command, 5% of opposers are tolerated and 19 times as many supporters as opposers are required.</p>
</div>
</section>
<section id="Ch7.S3" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7.3 </span>Merging multiple taxonomic assignments based on consensus</h3>

<div id="Ch7.S3.p1" class="ltx_para">
<p class="ltx_p">In plant chloroplast loci, single locus query sequences cannot be identified at the lower taxonomic level in many cases because of lack of variations.
Merging multiple results based on <span class="ltx_text ltx_font_typewriter">overall_genus</span> (more reliable database) and <span class="ltx_text ltx_font_typewriter">overall_class</span> (less reliable database) are often useful.
Merging multiple results based on “strict consensus” LCA (strict LCA) and “95%-majority rule consensus” LCA (relaxed LCA) is also useful in some cases.
Such merging multiple assignment results can be performed with the aid of <span class="ltx_text ltx_font_typewriter">clmergeassign</span>.</p>
</div>
<div id="Ch7.S3.p2" class="ltx_para">
<p class="ltx_p">Multiple assignment results of <span class="ltx_text ltx_font_italic">rbcL</span> and <span class="ltx_text ltx_font_italic">matK</span> can be merged prefering lower level identification by the following command.</p>
</div>
<div id="Ch7.S3.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmergeassign \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--priority=equal \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--preferlower \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfrbcL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfmatK \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S3.p4" class="ltx_para">
<p class="ltx_p">The following command merges multiple results accepting that query identified as same taxon in both results and that query identified in one result and unidentified in the other result.
Note that mismatch at the higher level taxonomy is not allowed.</p>
</div>
<div id="Ch7.S3.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmergeassign \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--priority=equal \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfrbcL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfmatK \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S3.p6" class="ltx_para">
<p class="ltx_p">Three assignment results of <span class="ltx_text ltx_font_italic">rbcL</span>, <span class="ltx_text ltx_font_italic">matK</span> and <span class="ltx_text ltx_font_italic">trnH</span>–<span class="ltx_text ltx_font_italic">psbA</span> can be merged prefering lower level identification by the following command.</p>
</div>
<div id="Ch7.S3.p7" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmergeassign \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--priority=equal \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--preferlower \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfrbcL \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfmatK \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOftrnH-psbA \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S3.p8" class="ltx_para">
<p class="ltx_p">The following command merges two results prefering result of <span class="ltx_text ltx_font_typewriter">overall_genus</span>.</p>
</div>
<div id="Ch7.S3.p9" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmergeassign \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--priority=descend \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfoverall_genus \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfoverall_class \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S3.p10" class="ltx_para">
<p class="ltx_p">In this case, result of <span class="ltx_text ltx_font_typewriter">overall_class</span> is accepted if matched to result of <span class="ltx_text ltx_font_typewriter">overall_genus</span> and identified at the lower level than result of <span class="ltx_text ltx_font_typewriter">overall_genus</span>.</p>
</div>
<div id="Ch7.S3.p11" class="ltx_para">
<p class="ltx_p">The following command merges two results prefering result of strict LCA.</p>
</div>
<div id="Ch7.S3.p12" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; clmergeassign \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">--priority=descend \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfStrictLCA \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ResultOfRelaxedLCA \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">outputfile↓</span></span>
</span>
</div>
<div id="Ch7.S3.p13" class="ltx_para">
<p class="ltx_p">In this case, result of relaxed LCA is accepted if matched to result of strict LCA and identified at the lower level than result of strict LCA.</p>
</div>
</section>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>

<ul class="ltx_biblist" style="padding:0em 0em 0em 1em;text-indent:-1em;">
      
<li id="bib.bib2" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Andrews (2010)</span>
<span class="ltx_bibblock">
Andrews, S. (2010) Software distributed by the author at
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

</span>
</li>
      
<li id="bib.bib3" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bengtsson <span class="ltx_text ltx_font_italic">et al.</span> (2011)</span>
<span class="ltx_bibblock">
Bengtsson, J., Eriksson, K. M., Hartmann, M., Wang, Z., Shenoy, B. D., Grelet,
G.-A., Abarenkov, K., Petri, A., Rosenblad, M. A., Nilsson, R. H. (2011)
Metaxa: a software tool for automated detection and discrimination among
ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria,
eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental
sequencing datasets. <span class="ltx_text ltx_font_italic">Antonie Van Leeuwenhoek</span>, <span class="ltx_text ltx_font_bold">100</span>, 471–475.

</span>
</li>
      
<li id="bib.bib4" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bengtsson-Palme <span class="ltx_text ltx_font_italic">et al.</span> (2013)</span>
<span class="ltx_bibblock">
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A.,
De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend,
A., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand,
Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H.
(2013) Improved software detection and extraction of ITS1 and ITS2 from
ribosomal ITS sequences of fungi and other eukaryotes for analysis of
environmental sequencing data <span class="ltx_text ltx_font_italic">Methods in Ecology and Evolution</span>,
<span class="ltx_text ltx_font_bold">4</span>, 914–919.

</span>
</li>
      
<li id="bib.bib5" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Edgar <span class="ltx_text ltx_font_italic">et al.</span> (2011)</span>
<span class="ltx_bibblock">
Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C., Knight, R. (2011)
UCHIME improves sensitivity and speed of chimera detection. <span class="ltx_text ltx_font_italic">Bioinformatics</span>, <span class="ltx_text ltx_font_bold">27</span>, 2194–2200.

</span>
</li>
      
<li id="bib.bib6" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Fadrosh <span class="ltx_text ltx_font_italic">et al.</span> (2014)</span>
<span class="ltx_bibblock">
Fadrosh, D. W., Ma, B., Gajer, P., Sengamalay, N., Ott, S., Brotman, R. M.,
Ravel, J. (2014) An improved dual-indexing approach for multiplexed 16S rRNA
gene sequencing on the Illumina MiSeq platform. <span class="ltx_text ltx_font_italic">Microbiome</span>,
<span class="ltx_text ltx_font_bold">2</span>,  6.

</span>
</li>
      
<li id="bib.bib7" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hamady <span class="ltx_text ltx_font_italic">et al.</span> (2008)</span>
<span class="ltx_bibblock">
Hamady, M., Walker, J. J., Harris, J. K., Gold, N. J., Knight, R. (2008)
Error-correcting barcoded primers for pyrosequencing hundreds of samples in
multiplex. <span class="ltx_text ltx_font_italic">Nature Methods</span>, <span class="ltx_text ltx_font_bold">5</span>, 235–237.

</span>
</li>
      
<li id="bib.bib8" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Huson <span class="ltx_text ltx_font_italic">et al.</span> (2007)</span>
<span class="ltx_bibblock">
Huson, D. H., Auch, A. F., Qi, J., Schuster, S. C. (2007) MEGAN analysis of
metagenomic data. <span class="ltx_text ltx_font_italic">Genome Research</span>, <span class="ltx_text ltx_font_bold">17</span>, 377–386.

</span>
</li>
      
<li id="bib.bib9" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Illumina corporation (2013)</span>
<span class="ltx_bibblock">
Illumina corporation (2013) 16S metagenomic sequencing library preparation.

</span>
</li>
      
<li id="bib.bib10" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Katoh &amp; Standley (2013)</span>
<span class="ltx_bibblock">
Katoh, K., Standley, D. M. (2013) MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. <span class="ltx_text ltx_font_italic">Molecular Biology
and Evolution</span>, <span class="ltx_text ltx_font_bold">30</span>, 772–780.

</span>
</li>
      
<li id="bib.bib11" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kunin <span class="ltx_text ltx_font_italic">et al.</span> (2010)</span>
<span class="ltx_bibblock">
Kunin, V., Engelbrektson, A., Ochman, H., Hugenholtz, P. (2010) Wrinkles in the
rare biosphere: pyrosequencing errors can lead to artificial inflation of
diversity estimates. <span class="ltx_text ltx_font_italic">Environmental Microbiology</span>, <span class="ltx_text ltx_font_bold">12</span>,
118–123.

</span>
</li>
      
<li id="bib.bib12" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lange <span class="ltx_text ltx_font_italic">et al.</span> (2015)</span>
<span class="ltx_bibblock">
Lange, A., Jost, S., Heider, D., Bock, C., Budeus, B., Schilling, E.,
Strittmatter, A., Boenigk, J., Hoffmann, D. (2015) AmpliconDuo: A
Split-Sample Filtering Protocol for High-Throughput Amplicon Sequencing of
Microbial Communities. <span class="ltx_text ltx_font_italic">PLoS One</span>, <span class="ltx_text ltx_font_bold">10</span>, e0141590.

</span>
</li>
      
<li id="bib.bib13" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Larkin <span class="ltx_text ltx_font_italic">et al.</span> (2007)</span>
<span class="ltx_bibblock">
Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A.,
McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson,
J. D., Gibson, T. J., Higgins, D. G. (2007) Clustal W and Clustal X version
2.0. <span class="ltx_text ltx_font_italic">Bioinformatics</span>, <span class="ltx_text ltx_font_bold">23</span>, 2947–2948.

</span>
</li>
      
<li id="bib.bib14" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li <span class="ltx_text ltx_font_italic">et al.</span> (2012)</span>
<span class="ltx_bibblock">
Li, W., Fu, L., Niu, B., Wu, S., Wooley, J. (2012) Ultrafast clustering
algorithms for metagenomic sequence analysis. <span class="ltx_text ltx_font_italic">Briefings in
Bioinformatics</span>, <span class="ltx_text ltx_font_bold">13</span>, 656–668.

</span>
</li>
      
<li id="bib.bib15" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Nelson <span class="ltx_text ltx_font_italic">et al.</span> (2014)</span>
<span class="ltx_bibblock">
Nelson, M. C., Morrison, H. G., Benjamino, J., Grim, S. L., Graf, J. (2014)
Analysis, optimization and verification of Illumina-generated 16S rRNA gene
amplicon surveys. <span class="ltx_text ltx_font_italic">PLoS One</span>, <span class="ltx_text ltx_font_bold">9</span>, e94249.

</span>
</li>
      
<li id="bib.bib16" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Schmieder &amp; Edwards (2011)</span>
<span class="ltx_bibblock">
Schmieder, R., Edwards, R. (2011) Quality control and preprocessing of
metagenomic datasets. <span class="ltx_text ltx_font_italic">Bioinformatics</span>, <span class="ltx_text ltx_font_bold">27</span>, 863–864.

</span>
</li>
      
<li id="bib.bib17" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Stevens <span class="ltx_text ltx_font_italic">et al.</span> (2013)</span>
<span class="ltx_bibblock">
Stevens, J. L., Jackson, R. L., Olson, J. B. (2013) Slowing PCR ramp speed
reduces chimera formation from environmental samples. <span class="ltx_text ltx_font_italic">Journal of
Microbiological Methods</span>, <span class="ltx_text ltx_font_bold">93</span>, 203–205.

</span>
</li>
      
<li id="bib.bib18" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tanabe &amp; Toju (2013)</span>
<span class="ltx_bibblock">
Tanabe, A. S., Toju, H. (2013) Two new computational methods for universal DNA
barcoding: a benchmark using barcode sequences of bacteria, archaea, animals,
fungi, and land plants. <span class="ltx_text ltx_font_italic">PLoS One</span>, <span class="ltx_text ltx_font_bold">8</span>, e76910.

</span>
</li>
      
<li id="bib.bib19" class="ltx_bibitem" style="padding:0em 0em 0.2em 0em;">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang <span class="ltx_text ltx_font_italic">et al.</span> (2014)</span>
<span class="ltx_bibblock">
Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014) PEAR: a fast and
accurate Illumina Paired-End reAd mergeR. <span class="ltx_text ltx_font_italic">Bioinformatics</span>, <span class="ltx_text ltx_font_bold">30</span>,
614–620.

</span>
</li>
    
</ul>
</section>
<section id="A1" class="ltx_appendix">
<h2 class="ltx_title ltx_title_appendix">
<span class="ltx_tag ltx_tag_appendix">Appendix  A </span>Instllation of optional programs</h2>

<section id="A1.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">A.1 </span>Installation of bcl2fastq</h3>

<div id="A1.S1.p1" class="ltx_para">
<p class="ltx_p">Bcl2fastq is a program to convert BCL formatted basecall data to FASTQ provided by Illumina.
Bcl2fastq is available at
<br class="ltx_break"><a href="http://support.illumina.com/downloads/bcl2fastq_conversion_software.html" title="" class="ltx_ref ltx_href">http://support.illumina.com/downloads/bcl2fastq_conversion_software.html</a></p>
</div>
<div id="A1.S1.p2" class="ltx_para">
<p class="ltx_p">Bcl2fastq v1.8.4 for Illumina sequencing systems using versions of RTA earlier than RTA version 1.18.54 can be installed like below.</p>
</div>
<div id="A1.S1.p3" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Install alien</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ sudo apt-get install alien</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Download bcl2fastq v1.8.4 from Illumina</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ wget \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">ftp://webdata:webdata@ussd-ftp.illumina.com/Downloads/Software/bcl2fastq/bcl2fastq-1.8.4-Linux-x86_64.rpm</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Convert .rpm to .deb and install .deb</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ sudo alien -i \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">bcl2fastq-1.8.4-Linux-x86_64.rpm</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Install required packages</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ sudo apt-get install libxml-simple-perl xsltproc</span></span>
</span>
</div>
<div id="A1.S1.p4" class="ltx_para">
<p class="ltx_p">In newer version of Ubuntu, incompatible Perl to bcl2fastq is installed.
To solve this problem, run the following commands.</p>
</div>
<div id="A1.S1.p5" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ sudo perl -i -npe \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">’s/qw\(ELAND_FASTQ_FILES_PER_PROCESS\)/\("ELAND_FASTQ_FILES_PER_PROCESS"\)/’ \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">/usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">$ sudo perl -i -npe \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">’s/qw\(ELAND_GENOME\)/\("ELAND_GENOME"\)/’ \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">/usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm</span></span>
</span>
</div>
<div id="A1.S1.p6" class="ltx_para">
<p class="ltx_p">You can also use text editor program to edit line 747 and line 751 of
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">/usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm
<br class="ltx_break"></span>and replace <span class="ltx_text ltx_font_typewriter">qw(FOO)</span> to <span class="ltx_text ltx_font_typewriter">("FOO")</span>.</p>
</div>
<div id="A1.S1.p7" class="ltx_para">
<p class="ltx_p">To install bcl2fastq2 v2.20.0.422 for all Illumina sequencing systems running RTA version 1.18.54 and above, run the following commands.</p>
</div>
<div id="A1.S1.p8" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Install rpm2cpio and cpio</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sudo apt-get install rpm2cpio cpio↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Download bcl2fastq2 v2.20.0.422 from Illumina</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; wget -O bcl2fastq2-v2-20-0-linux-x86-64.zip https://bit.ly/2NEy9qG↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Decompress zip file</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; unzip -qq bcl2fastq2-v2-20-0-linux-x86-64.zip↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract executable command from .rpm</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; rpm2cpio \</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">bcl2fastq2-v2.20.0.422-Linux-x86_64.rpm | cpio -id↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Install executable command</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sudo mv usr/local/bin/bcl2fastq /usr/local/bin/↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Install the other resources</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sudo cp -R usr/local/share/css /usr/local/share/↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; sudo cp -R usr/local/share/xsl /usr/local/share/↓</span></span>
</span>
</div>
</section>
</section>
<section id="A2" class="ltx_appendix">
<h2 class="ltx_title ltx_title_appendix">
<span class="ltx_tag ltx_tag_appendix">Appendix  B </span>Terminal command examples</h2>

<section id="A2.S1" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">B.1 </span>Counting sequences in a file</h3>

<div id="A2.S1.p1" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; grep -P -c ’^\+\r?\n?$’ inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in gzipped FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | grep -P -c ’^\+\r?\n?$’↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in bzip2ed FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; bzip2 -dc inputfile | grep -P -c ’^\+\r?\n?$’↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in xzed FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; xz -dc inputfile | grep -P -c ’^\+\r?\n?$’↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in FASTA</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; grep -P -c ’^&gt;’ inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in gzipped FASTA</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | grep -P -c ’^&gt;’↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in bzip2ed FASTA</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; bzip2 -dc inputfile | grep -P -c ’^&gt;’↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Count seqs in xzed FASTA</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; xz -dc inputfile | grep -P -c ’^&gt;’↓</span></span>
</span>
</div>
</section>
<section id="A2.S2" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">B.2 </span>Viewing sequence files</h3>

<div id="A2.S2.p1" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Output contents of uncompressed file to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; cat inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># View contents of uncompressed file</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; less inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Output contents of gzipped file to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># View contents of gzipped file</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | less↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract first seq from FASTQ and output to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; head -n 4 inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract first seq from gzipped FASTQ and output to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | head -n 4↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract matched seqs to regular expression in seq name from FASTQ and output to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; grep -P -A 3 ’^\@RegularExpressionOfSequenceName’ inputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract matched seqs to regular expression in seq name from gzipped FASTQ and output to screen</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | grep -P -A 3 ’^\@RegularExpressionOfSequenceName’↓</span></span>
</span>
</div>
</section>
<section id="A2.S3" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">B.3 </span>Compression and decompression</h3>

<div id="A2.S3.p1" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Decompress all .fastq.gz files in current folder</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in *.fastq.gz↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do gzip -d $f↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Decompress all .fastq.gz files in current folder and subfolders</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in ‘find . -name *.fastq.gz‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do gzip -d $f↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Decompress all .fastq.gz files in current folder using 4 CPUs</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; ls *.fastq.gz | xargs -L 1 -P 4 gzip -d↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Decompress all .fastq.gz files in current folder and subfolders using 4 CPUs</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; find . -name *.fastq.gz | xargs -L 1 -P 4 gzip -d↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Compress all .fastq files in current folder</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in *.fastq↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do gzip $f↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Compress all .fastq files in current folder and subfolders</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; for f in ‘find . -name *.fastq‘↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">do gzip $f↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">done↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Compress all .fastq files in current folder using 4 CPUs</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; ls *.fastq | xargs -L 1 -P 4 gzip↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Compress all .fastq files in current folder and subfolders using 4 CPUs</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; find . -name *.fastq | xargs -L 1 -P 4 gzip↓</span></span>
</span>
</div>
</section>
<section id="A2.S4" class="ltx_section">
<h3 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">B.4 </span>Extraction and output of sequences</h3>

<div id="A2.S4.p1" class="ltx_para">
<span class="ltx_p ltx_framed_rectangle" style="border-color: #000000;padding-top:12pt;padding-bottom:12pt;">
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract first 10,000 seqs from FASTQ and save those seqs to a file</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; head -n 40000 inputfile &gt; outputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract last 10,000 seqs from FASTQ and save those seqs to a file</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; tail -n 40000 inputfile &gt; outputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract first 10,000 seqs from gzipped FASTQ and save those seqs to gzipped FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | head -n 40000 | gzip -c &gt; outputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract last 10,000 seqs from gzipped FASTQ and save those seqs to gzipped FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | tail -n 40000 | gzip -c &gt; outputfile↓</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;"># Extract matched seqs to regular expression in seq name from gzipped FASTQ and save to gzipped FASTQ</span></span>
<span class="ltx_p ltx_align_left"><span class="ltx_text ltx_font_typewriter" style="font-size:100%;">&gt; gzip -dc inputfile | grep -P -A 3 ’^\@RegularExpressionOfSequenceName’ | gzip -c &gt; outputfile↓</span></span>
</span>
</div>
</section>
</section><div class="ltx_rdf" about="" property="dcterms:creator" content="Akifumi S. Tanabe"></div>
<div class="ltx_rdf" about="" property="dcterms:subject" content="Metagenome"></div>
<div class="ltx_rdf" about="" property="dcterms:title" content="Metabarcoding and DNA barcoding for Ecologists: Sequence analysis"></div>

</article>
</div>
</div>
</body>
</html>