Skip to content
Manuel Mendoza edited this page Jan 16, 2022 · 16 revisions

Introduction

There are many genes whose expression is strongly influenced by the sex, introducing some bias in gene expression analysis and making necessary the individual sexation of each sample previously to sequencing. However, the sex determination in some marine invertebrates involved the observation of gametes or mature gonads, which are not always available, so here we present a novel tool to determine the sex of marine bivalves based on a special mechanism of mitochondria inheritance. Our tool filters the reads that belong to the mitogenomes from the rest of RNA-Seq reads and then quantifies multiple metrics to infer the sex of the individual sex. Furthermore, we also implemented additional analysis to bring more support to the results. The first is based on dimensional reduction of multiple metrics quantified previously and, the other is a phylogenetic analysis of protein-coding genes extracted from the samples and other species.

The metrics that we use to predict the sex of the samples as input for the neural network are the following:

  • Genome coverage: Percentage of positions covered by at least one read.
  • Mean sequencing depth: Number of reads that support each sequence position (in average).
  • Sequencing depth uniformity: Dispersion of sequencing depth along the sequence.

More information about these metrics are discussed in the following article: Sims et al., 2014.

Example

In this case, we have a genome of reference with 25bp of length, and we have sequenced four reads of 10bp of length. The alignment of these reads is showed below. In the bottom is calculated the sequencing depth of each position. The sequencing depth of each nucleotide can vary from 0 to the number of reads sequenced.

In this case, the genome coverage was 100% because all the positions are covered by at least one read, and the sequencing depth in average was 1.6 (the mean os depth at each position). Finally, the depth uniformity was estimated using the Gini coefficient, whose value was 0.24; that indicates a relative equality of the coverage.

seq:	AACTTGAAACATAAGCGTGTGGCTA
r01:	AACTTGAAAC
r02:	      AAACATAAGC
r03:	       AACATAAGCG
r04:	               CGTGTGGCTA
sqd:	1111112333222223211111111

Implementation and usage

MyToSex is an open-source tools written in Python3 that requires multiple modules. This tool can be installed as follows.

# Clone the repository
git clone https://github.com/manuelsmendoza/mytosex.git

# Create the conda environment
cd mytosex && conda create --name mytosex_test --file environment.txt 

# Activate the environment
conda activate mytosex_test 

To use MyToSex, create a file containing the information required for the analysis and pass it as a positional argument as we have specified below. To see more information about the settings file, check it in the wiki.

python3 mytosex.py settings.yaml

Settings file

These values can be classified in five different statements:

  1. Resources (Optional): Assign the computational resources (number of threads and maximum memory). If this information is not provided, all the threads and the whole memory will be employed.
    numb_threads: 24
    max_memory: 32
    
  2. Output directory: The directory path to store the result.
    output_dir: /abs_path/output_directory 
    
  3. Mitotypes of references: Both mitotypes (mtF and mtM) from the specie that we are analysing. This information can be stored locally so, only the path is required; but it also can be fetched from the NCBI Nucleotide database, providing the sequence accession. In addition to the sequences itself, the user have to provide the annotation (gff format) and the coordinates of the CDS (bed format).
    reference:
       alias: ref_abbrev
       mtf: NCBI_ACCESSION
       mtm: /abs_path/mtm_ref
    
  4. Samples reads: Specify the sample alias and the path to the reads. The samples can be single-end or paired-end and can ben fetched from the NCBI SRA database if required.
    samples:
       sample_01:
          alias: sample_one
          accession: NCBI_ACCESSION
       sample_02:
          alias: sample_two
          forward: /abs_path/sample_one_1.fastq.gz
          reverse: /abs_path/sample_one_2.fastq.gz
       sample_03:
          alias: sample_three
          single: /abs_path/sample_three.fastq.gz
       ...
    
  5. Other species mitogenomes (Optional): Sequence of other mitogenomes from different species, these species can have or not double uniparental inheritance also, or not.
    other_spp:
       specie_01:
          alias: specie_one
          mtf: NCBI_ACCESSION 
          mtm: NCBI_ACCESSION
       specie_02:
          alias: specie_two
          mt: NCBI_accession
       ...
    

We present an example of settings file here. All the samples and sequences, including other species mitogenomes and the references, requires an alias to identify them. It may be a user-friendly name to call the sample or sequence. The reads and mitogenomes can be fetched from the different NCBI databases is their accession are provided (them are detected automatically).

Mitogenomes annotation

We recommend use a common nomenclature for the mitogenes which is not implemented in all the mitogenomes deposited in the NCBI Nucleotide database, if the accession is provided to fetch the sequence and its annotation, these names will be used by default. The abbreviations that use for the different genes are the following:

  • alias_mt#_CYTB: Cytochrome B.
  • alias_mt#_COX#: Cytochrome C Oxidase (1, 2 and 3).
  • alias_mt#_ND#: NADH:Ubiquinone Oxidoreductase Core Subunit (1, 2, 3, 4, 5 and 6).
  • alias_mt#_ATP#: ATP Synthase Membrane Subunit (6 and 8).

Artificial neural network

The double uniparental inheritance of mitogenomes is a mechanism present in some marine mussels and clams. (from Orders Mytiloida and Venerida) and freshwater mussels (Order Unionida). We found information (HTS reads) from four species: manila clam (Ruditapes philippinarum), grooved carpet-shell clam (Ruditapes decussatus), blue mussel (Mytilus edulis) and Asian green mussel (Perna viridis) after filtering by taxonomy, individual samples with sex information and sequencing strategy. Only a couple of these species have both mitogenomes sequenced completely, i.e. M. edulis and R. phillipinarum so we used them to train and test the artificial neural network.

There was a huge difference in the number of reads sequenced for each species. The libraries of R. phillipinarum have 1.9M of reads. By contrast, the samples of M. edulis had 48.3M. This fact made it impossible to perform the phylogenetic analysis of manila clam genes because there were too few genes to assemble the complete coding sequences.

Analysis output

The output generate has the following strcutre:

mytosex_result
|-- data
|  |-- align_stats.tsv
|  |-- msa
`-- results.tsv
  • data (directory): A directory containing information about the alignments stats and the gene-trees.
    • align_stats.tsv (file): Summary of the alignment statistics used for the prediction using the neural network.
      sample	mtfcov	mtmcov	mtfmd	mtmmd	mtfgi	mtmgi
      MedF45	96.5781	33.8203	7141.78	17.58	0.5593	0.9425
      MedF49	99.3192	86.6763	8889.41	74.20	0.4935	0.8512
      MedF52	99.6656	33.6640	9342.07	24.19	0.5026	0.9458
      MedM31	98.3637	93.9153	3596.86	2107.04	0.6348	0.8801
      MedM37	98.9907	98.4007	2539.29	5712.72	0.6593	0.4964
      MedM40	97.7784	85.7744	2782.13	2128.21	0.6374	0.8808
      
    • msa (directory): The results of the phylogenetic analysis i.e. the sequences of the genes (geneName.fasta) and the phylogenetic tree in newick format (geneName.fasta.raxml.bestTreeCollapsed).
      ((((MedF49_mtm_COX1:0.008557,(MedM37_mtm_COX1:0.005141,(med_mtm_COX1:0.007449,MedF49_mtm_COX1.1:0.005022):0.003096):0.002674):0.006137,((mtr_mtm_COX1:0.282718,((mtr_mtf_COX1:0.141422,((cgi_mt_COX1:13.836890,mca_mtf_COX1:0.000001):0.136354,((mse_mtm_COX1:0.283001,mse_mtf_COX1:0.349553):3.468455,mca_mtm_COX1:0.405446):0.336197):0.184282):0.043661,(MedM31_mtf_COX1:0.000001,MedM37_mtf_COX1:0.000001,(mga_mtf_COX1:0.007411,(MedF49_mtf_COX1:0.001381,(MedM40_mtf_COX1:0.000671,(med_mtf_COX1:0.000670,(MedF52_mtf_COX1:0.000001,MedF45_mtf_COX1:0.000001):0.002000):0.002702):0.001986):0.000626):0.002060):0.140925):0.153585):0.181751,mga_mtm_COX1:0.059057):0.078585):0.013026,MedF49_mtm_COX1.2:0.000001):0.004609,MedM40_mtm_COX1:0.003872,MedF45_mtm_COX1:0.000001,MedF52_mtm_COX1:0.000001,MedM31_mtm_COX1:0.001732);
      
  • results.tsv (file): It is a summary containinig the name of the sample, the prediction of the sex, the sexual index and the components of UMAP.
sample	sex	sindex	umapx	umapy
MedF45	Female	0.3502	6.3447	-4.2188
MedF49	Female	0.8727	7.4416	-3.2908
MedF52	Female	0.3378	7.3536	-4.1654
MedM31	Male	0.9548	5.7352	-2.5487
MedM37	Male	0.9940	6.4773	-2.4543
MedM40	Male	0.8772	6.3211	-3.2528

Plotting the output

# Attach the packages
library(readr)
library(ggplot2)
library(treeio)
library(ggtree)
library(ape)
library(patchwork)

# Load the data to plot
result_tbl <- read_delim(file = "~/Desktop/mytosex_result/results.tsv", delim = "\t", col_types = "cfddd")
tree_text <- "(((mtr_mtf_ATP6:0.133345,(mca_mtf_ATP6:0.246735,(MedM40_mtf_ATP6:0.005459,MedF49_mtf_ATP6:0.001354,(MedF45_mtf_ATP6:0.020092,med_mtf_ATP6:0.002737,((mse_mtf_ATP6:0.504083,mse_mtm_ATP6:0.122829):3.315627,rph_mtf_ATP6:8.503156):1.680810,MedF52_mtf_ATP6:0.000001,MedF45_mtf_ATP6.1:0.000001):0.004096,(mga_mtf_ATP6:0.002767,(MedM37_mtf_ATP6:0.000001,MedM31_mtf_ATP6:0.000001):0.002779):0.001342):0.085134):0.040679):0.041445,mca_mtm_ATP6:0.907819):0.066568,mtr_mtm_ATP6:0.188816,(cgi_mt_ATP6:12.670376,(((((med_mtm_ATP6:0.007078,(MedF49_mtm_ATP6:0.000001,MedM31_mtm_ATP6.1:0.001900):0.004279,MedM37_mtm_ATP6:0.000001):0.020253,MedM40_mtm_ATP6:0.000001):0.006472,MedM40_mtm_ATP6.1:0.000001):0.003552,MedM31_mtm_ATP6:0.002806):0.070388,mga_mtm_ATP6:0.040895):0.055934):0.079141);"
tree <- read.tree(text = tree_text)

# Plot the sexual index
sindex_plot <- ggplot(data = result_tbl) +
  geom_rect(mapping = aes(xmin = -Inf, xmax = Inf, ymin = 0.6, ymax = 0.9), fill = "lightgrey", alpha = 0.1) +
  geom_boxplot(mapping = aes(x = sex, y = sindex, fill = sex), alpha = 0.25) +
  geom_point(mapping = aes(x = sex, y = sindex, fill = sex), shape = 21, size = 2.5, position = position_jitter(width = 0.25)) +
  ggtitle(label = "A") +
  theme(panel.background = element_blank(), 
        panel.border = element_rect(fill = NA, colour = "black", size = 0.5),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none")

# Plot the UMAP result
umap_plot <- ggplot(data = result_tbl) +
  geom_point(mapping = aes(x = umapx, y = umapy, fill = sex), shape = 21, size = 2.5) +
  ggtitle(label = "B") +
  theme(panel.background = element_blank(), 
        panel.border = element_rect(fill = NA, colour = "black", size = 0.5), 
        axis.ticks = element_blank(), 
        axis.text = element_blank(),
        axis.title = element_blank(), 
        legend.key = element_blank(), 
        legend.title = element_blank())

# Plot ATP6 genetree
genetree_plot <- ggplot(tree, aes(x, y)) + 
  ggtitle(label = "C") +
  geom_tree() + 
  geom_tiplab(size = 3) + 
  geom_treescale(linesize = 0.5) + 
  theme_tree()


# Make a pane with all the plots
sindex_plot + umap_plot + genetree_plot + plot_layout(ncol = 3, nrow = 1, widths = c(0.4, 1, 1.6), guides = "collect")