Skip to content
Manuel Mendoza edited this page Dec 12, 2021 · 16 revisions

Introduction

There are many genes whose expression is strongly influence by the sex, introducing some bias in the results of gene expression analysis and make necessary the individual sexation of each sample previously to sequencing. However, the sex determination in some marine invertebrates involved the observation of gametes or mature gonads which are no always available so here we present a novel tools to determine the sex of marine bivalves based on a special mechanisms of mitochondria inheritance. Our tool filter the reads that belong to the mitogenomes from the rest of RNA-Seq reads and after that quantify multiple metrics to infer the sex of the individual sex. Furthermore, we also implemented additional analysis to bring more support to the results, the first is based on dimensional reduction of multiple metrics quantified previously and, the other, is a phylogenetic analysis of protein-coding genes extracted from the samples and from other species.

The metrics that we use to predict the sex of the samples as input for the neural network are the following:

  • Genome coverage: Percentage of positions covered by at least one read.
  • Mean sequencing depth: Number of reads that support each sequence position (in average).
  • Sequencing depth uniformity: Dispersion of sequencing depth along the sequence.

More information about this metrics are discussed in the following article: Sims et al., 2014.

Example

In this case, we have a genome of reference with 25bp of length, and we have sequenced four reads of 10bp of length. The alignment of these reads is showed below. In the bottom is calculated the sequencing depth of each position. The sequencing depth of each nucleotide can vary from 0 to the number of reads sequenced.

In this case, the coverage of the genome was 100% because all the positions are covered by at least one read, and the sequencing depth in average was 1.6 (the mean os depth at each position). Finally, the uniformity of the depth was estimated using the Gini coefficient, whose value was 0.24; that indicates a relative equality of the coverage.

seq:	AACTTGAAACATAAGCGTGTGGCTA
r01:	AACTTGAAAC
r02:	      AAACATAAGC
r03:	       AACATAAGCG
r04:	               CGTGTGGCTA
sqd:	1111112333222223211111111

Implementation and usage

MyToSex is an open-source tools written in Python3 that requires multiple modules. This tool can be installed as follows.

# Clone the repository
git clone https://github.com/manuelsmendoza/mytosex.git

# Create the conda environment
cd mytosex && conda env create --file environment.yaml

# Add it to the PATH
export PATH=$PATH:$PWD

To use MyToSex we only need a yaml file containing the settings value to run the analysis.

# Using example
python mytosex.py settings.yaml

These values can be classified in five different statements:

  1. Resources (Optional): Assign the computational resources (number of threads and maximum memory) to use. If this information is not provided, then all the threads and the whole memory will be employed.
    numb_threads: 24
    max_memory: 32
    
  2. Output directory: The directory path to store the result.
    output_dir: /abs_path/output_directory 
    
  3. Mitotypes of references: Both mitotypes (mtF and mtM) from the specie that we are analysing. This information can be stored locally so, only the path is required; but it also can be fetched from the NCBI Nucleotide database, providing the sequence accession. In addition to the sequences itself, the user have to provide the annotation (gff format) and the coordinates of the CDS (bed format).
    reference:
       alias: ref_abbrev
       mtf: NCBI_ACCESSION
       mtm: /abs_path/mtm_ref
    
  4. Samples reads: Specify the sample alias and the path to the reads. The samples can be single-end or paired-end and can ben fetched from the NCBI SRA database if required.
    samples:
       sample_01:
          alias: sample_one
          accession: NCBI_ACCESSION
       sample_02:
          alias: sample_two
          forward: /abs_path/sample_one_1.fastq.gz
          reverse: /abs_path/sample_one_2.fastq.gz
       sample_03:
          alias: sample_three
          single: /abs_path/sample_three.fastq.gz
       ...
    
  5. Other species mitogenomes (Optional): Sequence of other mitogenomes from different species, these species can have or not double uniparental inheritance also, or not.
    other_spp:
       specie_01:
          alias: specie_one
          mtf: NCBI_ACCESSION 
          mtm: NCBI_ACCESSION
       specie_02:
          alias: specie_two
          mt: NCBI_accession
       ...
    

We present an example of settings file here. All the samples and sequences, including other species mitogenomes and the references, requires an alias to identify them. It may be a user-friendly name to call the sample or sequence. The reads and mitogenomes can be fetched from the different NCBI databases is their accession are provided (them are detected automatically).

Mitogenomes annotation

We recommend use a common nomenclature for the mitogenes which is not implemented in all the mitogenomes deposited in the NCBI Nucleotide database, if the accession is provided to fetch the sequence and its annotation, these names will be used by default. The abbreviations that use for the different genes are the following:

  • alias_mt#_CYTB: Cytochrome B.
  • alias_mt#_COX#: Cytochrome C Oxidase (1, 2 and 3).
  • alias_mt#_ND#: NADH:Ubiquinone Oxidoreductase Core Subunit (1, 2, 3, 4, 5 and 6).
  • alias_mt#_ATP#: ATP Synthase Membrane Subunit (6 and 8).

Artificial neural network training

The double uniparental inheritance of mitogenomes is a mechanism present in some marine mussels and clams (from Orders Mytiloida and Venerida) and freshwater mussels (Order Unionida).

Clone this wiki locally