Skip to content

Names Joining into Ordered Database with R for MJOLNIR3 pipeline

License

Notifications You must be signed in to change notification settings

adriantich/NJORDR-MJOLNIR3

Repository files navigation

NJORDR-MJOLNIR3

Names Joining into Ordered Database with R for MJOLNIR3 pipeline

This project to create a Reference DataBase for THOR-MJOLNIR3 function to perform the taxonomic assignment of metabarcoding Data.

This project is intended to create a pipeline extrapolable to any barcode but we provide both the databases and the scripts to create them for the following barcodes:

  • COI
    • Leray-XT segment of the COI marker.
      • DUFA v202305

        This database is created with sequences from BOLD, NCBI and local sequences of four different projects. Initial files contain the following sequences

        • DUFA -> 174,544 sequences
        • AWI -> 306 sequences
        • CESC -> 764 sequences
        • NIS -> 6,008 sequences

        The total number of sequences was 181,622 sequences. However, after cleaning steps, it ended up in 148,932 sequences.

        https://drive.google.com/drive/folders/1ZFKwXjAlhUy-6ChN5hqGjM6LLUi9JPNn?usp=drive_link

        Instructions: download the full TAXO_DUFA directory and set the following parameters from mjolnir5_THOR function to:

        • tax_dir = 'PATH/TO/DIRECTORY/TAXO_DUFA'
        • tax_dms_name = 'COI_NJORDR'
        • minimum_circle = 0.7
      • NJODRD_COI -> Under developing

  • 18S
    • v7 with All_shorts primers (Guardiola et al. 2015)
      • NJORDR_18Sv7 -> Under developing
    • v1v2 with primers SSU_F04 (Blaxter et al., 1998) and SSURmod (Sinniger et al., 2016)
      • NJORDR_18Sv1v2 -> Future project
  • 12S
    • MiFish fragment with MiFish primers (Miya et al. 2015)
      • NJORDR_12S -> Future project

The NJORDR Workflow

This is a simplified scheme of the NJORDR workflow:

DESCRIPTION

The forkflow of the NJORDR pipeline consists in 5 steps descrived below. However, as NJORDR is intended to be a package to create reference databases for different markers, there are some steps that will be standard for all markers but the first part of the step-0 and the step-2 will require special adaptation to each case.

STEP-0 - To obtain a formated sequence database and the phylogenetic tree.

In this step the final goal is to obtain a uniq .tsv file that contains the taxonomic information of the database based on the tax_id and parent_tax_id to assign each sequence to a branch of the phylogenetic tree. The whole taxonomic tree is then used both for the following steps to create the Taxdump object (The object with all the tree in the required format dor obitools3) and to complete or correct the information in our database.
  • To obtain the taxonomic tree you can create it or download it directly (recomended). NJORDR will follow the format of mkCOInr taxonomy.tsv file to only use an standardised intitial file. This format also allows to distinguish between tax_id's that are in the NCBI and the new tax_id's that are not present (at this point, negative tax_id's). You can download the taxonomy.csv from drive or as specified in mkCOInr.

    cd NJORDR-MJOLNIR3
    mkdir TAXONOMY_TREE
    cd TAXONOMY_TREE
    wget https://zenodo.org/record/6555985/files/COInr_2022_05_06.tar.gz
    tar -zxvf COInr_2022_05_06.tar.gz
    rm COInr_2022_05_06.tar.gz
    mv COInr_2022_05_06/taxonomy.tsv .
    rm -r COInr_2022_05_06
    
  • To format the prefered database. This point consists in two separate substeps.

    • The first, NJORDR0.1, has to be adjusted to each database but can be done also manually. See readme of each one of them for more detail. This process consist on creating a csv file with the following columns seq_id / name_txt / tax_id / parent_tax_id / rank / sequence. This fields can be empty for some sequences but don't worry, during the following substep this will be completed. However, be aware that seq_id must be a unique identifier for each sequence.

    • The second, NJORDR0.2, is a standard script for all markers. Here, if one sequence has the tax_id but misses the name_txt, the parent_tax_id or the rank, those will be completed. Also if the tax_id has been deprecated, it will be updated to the version of the taxonomy.tsv file. If the tax_id is missing, this script will use the name to find the tax_id in the taxonomic tree or will create a new tax_id and attach it to the genus tax_id. If no match is found, then the output file will have the sufix _to_correct.tsv and it will require checking for correct_manually tags and correct them manually. At the end is possible that some rows contains only the fields name_txt / tax_id / parent_tax_id / rank; these are intermediate taxonomic levels.

      # help
      # Syntax: Rscript NJORDR0.2_complete_taxonomy.R [-t] [taxonomy_input] [-i] [sequence_input] [-c] [cores] [-o] [sequence_output]
      -t --taxonomy_input    taxonomy tab file with colnames [tax_id	parent_tax_id	rank	name_txt	old_tax_id	taxlevel	synonyms]
      -i --sequence_input    name of the input file for the sequences in tsv format to be completed
      -c --cores             number or cores to run the process in parallel. 1 default
      -o --sequence_output   name of the output file for the sequences in tsv format
      

STEP-1 - To obtain two separated files, the sequence file and the taxdump.

This step is the number one because from here on, the database will be build on the obitools3 format but no more sequences will be added. This means that the previous step can be seen as the standarisation of the reference database into a single file and the folloring steps as the ones to build the database to be used. Until new updates, all changes to the database will have to be done in the step-0 and then run the following four.
  • NJORDR1 will take the database_taxonomy_complete.tsv file and the taxonomy.tsv file with the phylogenetic tree and will create the sequence_db.tsv, with only the sequences, their seq_id and their tax_id, and the Taxdump object, this contains four files required by obitools3 to build the taxonomy tree.

    # help
    # Syntax: Rscript NJORDR1_split.R [-s] [sequence_input] [-t] [taxonomy_input] [-T] [taxdump_output] [-n] [new_taxids]
    -s --sequence_input      sequence tab file in NJORDR format.
                                    colnames: seq_id name_txt tax_id parent_tax_id parent_name_txt(optional) rank sequence
    -t --taxonomy_input      taxonomy tab file with colnames [tax_id	parent_tax_id	rank	name_txt	old_tax_id	taxlevel	synonyms]
    -T --taxdump_output      output taxdump
    -n --new_taxids          Number from which start the notation for new taxids. Since new taxids (not represented in NCBI) are set as negative,
                                    they must be turned into positive and to be sure they do not have the same number as an existing taxid, they
                                    will be higher than this value. However, the highest taxid cannot exced 2,147,483,647 (obitools3 limitations)
    

STEP-2 - To select the region of the fragment.

This step is still in development and is not working properly. It has to have special inputs depending on the primers used.
  • NJORDR2 has as input the sequence_db.tsv file and will retrieve a trimmed version of it, sequence_db_trimmed.tsv

    #help
    #Syntax: bash NJORDR_3_select_region.sh [-h] [help] [-s] [scripts_dir] [-f] [forward] [-r] [reverse] [-c] [sequences] [-d] [out_dir] [-m] [min_amplicon_length] [-M] [max_amplicon_length] [-e] [eco_pcr]
    -h --help	  Print this Help.
    -s --scripts_dir Path to the mkCOInr scripts directory: <PATH>/mkCOInr/scripts
    -f --forward	  Forward primer
    -r --reverse	  Reverse primer
    -c --sequences	File with sequences for the database
    -d --out_dir	  Optional, directory path of the output files. If not specified, ouput files will be "
               		  printed in the current directory
    -m		          Minimum length for an amplicon after trimming. 299 default.
    --min_amplicon_length
    -M		          Maximum length for an amplicon after trimming
    --max_amplicon_length
    -e		          perform eco pcr
    --eco_pcr
    

STEP-3 - To select the region of the fragment.

The intention in this step is to transform the sequence_db_trimmed.tsv into a fasta file. In addition, this step is suposed to remove the duplicated sequences after trimming if they belong to the same tax_id, otherwise they are kept separately. However, this step has to be updated. Rigth now has an extra step in which the database is reduced for those cases in which the size of the database was to big to be run efficiently by the step-4. Untill future updates, this step is necessary, otherwise the step-4 can last many weeks. Rigth now it also creates the Taxdump output but this is no longer necessary since it is done in the step-1.
  • NJORDR3 has as input the sequence_db_trimmed.tsv and returns a reduced and format final_sequence_db.fasta file

     -s --sequence_input    sequence tab file with colnames [seqID	taxID	sequence]"),
     -t --taxonomy_input    taxonomy tab file with colnames [tax_id	parent_tax_id	rank	name_txt	old_tax_id	taxlevel	synonyms]"),
     -c --cores             number or cores to run the process in parallel. 1 default"),
     -f --fasta_output      output fasta file reduced"),
     -T --taxdump_output    output taxdump"),
     -n --seqs_taxids       maximum number of sequences per taxid"),
     -x --new_taxids        Number from which start the notation for new taxids. Since new taxids (not represented in NCBI) are set as negative, 
                                 they must be turned into positive and to be sure they do not have the same number as an existing taxid, 
                                 they will be higher than this value. However, the highest taxid cannot exced 2,147,483,647.
     -k --keep_tag          Initial pattern for which a sequence must be retained. If Additional sequences are meant to be retained, they must 
                                 have this pattern at the beggining of the sequence Id. The default pattern is 'NS_'
    

STEP-4 - To create the obidms object for taxonomy assignament with ecotag.

In this last step, the object obidms is created. This step is not finished yet but it will be soon.

INSTALLATION

NJORDR runs on linux using conda environment. Most of the code is in R and bash. To install the depencies follow the next steps:

  • Create a conda environment (click here to see how to install miniconda). If you have installed MJOLNIR3 following the tutorial, you can use the same environment.

    # with conda previously installed
    # Create conda enviroment
    conda create -n mjolnir3 python=3.9
    # to activate the environment tipe
    conda activate mjolnir3
    
  • Clone NJORDR repository

    git clone https://github.com/adriantich/NJORDR-MJOLNIR3.git
    
  • Install mkCOInr and dependencies

    cd NJORDR-MJOLNIR3
    python3 -m pip install cutadapt
    conda install -c bioconda blast -y
    conda install -c bioconda vsearch -y
    pip install nsdpy
    git clone https://github.com/meglecz/mkCOInr.git
    

Previous versions:

https://drive.google.com/drive/folders/1GUmPjNlBobIv4OMdB1i1Yf0_HKVkN0T6?usp=share_link

  • v20210723 -> 139,082 sequences
  • DUFA_COLR_20210723 -> 140,638 sequences

WARNING! This project is under developing. In future versions it is expected to be a Tutorial to run the scripts for any barcode

About

Names Joining into Ordered Database with R for MJOLNIR3 pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published