-
Notifications
You must be signed in to change notification settings - Fork 0
Home
There are many genes whose expression is strongly influenced by the sex, introducing some bias in gene expression analysis and making necessary the individual sexation of each sample previously to sequencing. However, the sex determination in some marine invertebrates involved the observation of gametes or mature gonads, which are not always available, so here we present a novel tool to determine the sex of marine bivalves based on a special mechanism of mitochondria inheritance. Our tool filters the reads that belong to the mitogenomes from the rest of RNA-Seq reads and then quantifies multiple metrics to infer the sex of the individual sex. Furthermore, we also implemented additional analysis to bring more support to the results. The first is based on dimensional reduction of multiple metrics quantified previously and, the other is a phylogenetic analysis of protein-coding genes extracted from the samples and other species.
The metrics that we use to predict the sex of the samples as input for the neural network are the following:
- Genome coverage: Percentage of positions covered by at least one read.
- Mean sequencing depth: Number of reads that support each sequence position (in average).
- Sequencing depth uniformity: Dispersion of sequencing depth along the sequence.
More information about these metrics are discussed in the following article: Sims et al., 2014.
In this case, we have a genome of reference with 25bp of length, and we have sequenced four reads of 10bp of length. The alignment of these reads is showed below. In the bottom is calculated the sequencing depth of each position. The sequencing depth of each nucleotide can vary from 0 to the number of reads sequenced.
In this case, the genome coverage was 100% because all the positions are covered by at least one read, and the sequencing depth in average was 1.6 (the mean os depth at each position). Finally, the depth uniformity was estimated using the Gini coefficient, whose value was 0.24; that indicates a relative equality of the coverage.
seq: AACTTGAAACATAAGCGTGTGGCTA
r01: AACTTGAAAC
r02: AAACATAAGC
r03: AACATAAGCG
r04: CGTGTGGCTA
sqd: 1111112333222223211111111
MyToSex
is an open-source tools written in Python3 that requires multiple modules.
This tool can be installed as follows.
# Clone the repository
git clone https://github.com/manuelsmendoza/mytosex.git
# Create the conda environment
cd mytosex && conda create --name mytosex_test --file environment.txt
# Activate the environment
conda activate mytosex_test
To use MyToSex, create a file containing the information required for the analysis and pass it as a positional argument as we have specified below. To see more information about the settings file, check it in the wiki.
python3 mytosex.py settings.yaml
These values can be classified in five different statements:
-
Resources (Optional): Assign the computational resources (number of threads and maximum memory). If this
information is not provided, all the threads and the whole memory will be employed.
numb_threads: 24 max_memory: 32
-
Output directory: The directory path to store the result.
output_dir: /abs_path/output_directory
-
Mitotypes of references: Both mitotypes (mtF and mtM) from the specie that we are analysing.
This information can be stored locally so, only the path is required; but it also can be fetched from the NCBI
Nucleotide database, providing the sequence accession. In addition to the sequences itself, the user have to
provide the annotation (gff format) and the coordinates
of the CDS (bed format).
reference: alias: ref_abbrev mtf: NCBI_ACCESSION mtm: /abs_path/mtm_ref
-
Samples reads: Specify the sample alias and the path to the reads. The samples can be single-end or paired-end
and can ben fetched from the NCBI SRA database if required.
samples: sample_01: alias: sample_one accession: NCBI_ACCESSION sample_02: alias: sample_two forward: /abs_path/sample_one_1.fastq.gz reverse: /abs_path/sample_one_2.fastq.gz sample_03: alias: sample_three single: /abs_path/sample_three.fastq.gz ...
-
Other species mitogenomes (Optional): Sequence of other mitogenomes from different species, these species can
have or not double uniparental inheritance also, or not.
other_spp: specie_01: alias: specie_one mtf: NCBI_ACCESSION mtm: NCBI_ACCESSION specie_02: alias: specie_two mt: NCBI_accession ...
We present an example of settings file here. All the samples and sequences, including other species mitogenomes and the references, requires an alias to identify them. It may be a user-friendly name to call the sample or sequence. The reads and mitogenomes can be fetched from the different NCBI databases is their accession are provided (them are detected automatically).
We recommend use a common nomenclature for the mitogenes which is not implemented in all the mitogenomes deposited in the NCBI Nucleotide database, if the accession is provided to fetch the sequence and its annotation, these names will be used by default. The abbreviations that use for the different genes are the following:
- alias_mt#_CYTB: Cytochrome B.
- alias_mt#_COX#: Cytochrome C Oxidase (1, 2 and 3).
- alias_mt#_ND#: NADH:Ubiquinone Oxidoreductase Core Subunit (1, 2, 3, 4, 5 and 6).
- alias_mt#_ATP#: ATP Synthase Membrane Subunit (6 and 8).
The double uniparental inheritance of mitogenomes is a mechanism present in some marine mussels and clams. (from Orders Mytiloida and Venerida) and freshwater mussels (Order Unionida). We found information (HTS reads) from four species: manila clam (Ruditapes philippinarum), grooved carpet-shell clam (Ruditapes decussatus), blue mussel (Mytilus edulis) and Asian green mussel (Perna viridis) after filtering by taxonomy, individual samples with sex information and sequencing strategy. Only a couple of these species have both mitogenomes sequenced completely, i.e. M. edulis and R. phillipinarum so we used them to train and test the artificial neural network.
There was a huge difference in the number of reads sequenced for each species. The libraries of R. phillipinarum have 1.9M of reads. By contrast, the samples of M. edulis had 48.3M. This fact made it impossible to perform the phylogenetic analysis of manila clam genes because there were too few genes to assemble the complete coding sequences.
The output generate has the following strcutre:
mytosex_result
|-- data
| |-- align_stats.tsv
| |-- msa
`-- results.tsv
-
data (directory): A directory containing information about the alignments stats and the gene-trees.
- align_stats.tsv (file): Summary of the alignment statistics used for the prediction using the neural network.
sample mtfcov mtmcov mtfmd mtmmd mtfgi mtmgi MedF45 96.5781 33.8203 7141.78 17.58 0.5593 0.9425 MedF49 99.3192 86.6763 8889.41 74.20 0.4935 0.8512 MedF52 99.6656 33.6640 9342.07 24.19 0.5026 0.9458 MedM31 98.3637 93.9153 3596.86 2107.04 0.6348 0.8801 MedM37 98.9907 98.4007 2539.29 5712.72 0.6593 0.4964 MedM40 97.7784 85.7744 2782.13 2128.21 0.6374 0.8808
- msa (directory): The results of the phylogenetic analysis i.e. the sequences of the genes (
geneName.fasta
) and the phylogenetic tree in newick format (geneName.fasta.raxml.bestTreeCollapsed
).((((MedF49_mtm_COX1:0.008557,(MedM37_mtm_COX1:0.005141,(med_mtm_COX1:0.007449,MedF49_mtm_COX1.1:0.005022):0.003096):0.002674):0.006137,((mtr_mtm_COX1:0.282718,((mtr_mtf_COX1:0.141422,((cgi_mt_COX1:13.836890,mca_mtf_COX1:0.000001):0.136354,((mse_mtm_COX1:0.283001,mse_mtf_COX1:0.349553):3.468455,mca_mtm_COX1:0.405446):0.336197):0.184282):0.043661,(MedM31_mtf_COX1:0.000001,MedM37_mtf_COX1:0.000001,(mga_mtf_COX1:0.007411,(MedF49_mtf_COX1:0.001381,(MedM40_mtf_COX1:0.000671,(med_mtf_COX1:0.000670,(MedF52_mtf_COX1:0.000001,MedF45_mtf_COX1:0.000001):0.002000):0.002702):0.001986):0.000626):0.002060):0.140925):0.153585):0.181751,mga_mtm_COX1:0.059057):0.078585):0.013026,MedF49_mtm_COX1.2:0.000001):0.004609,MedM40_mtm_COX1:0.003872,MedF45_mtm_COX1:0.000001,MedF52_mtm_COX1:0.000001,MedM31_mtm_COX1:0.001732);
- align_stats.tsv (file): Summary of the alignment statistics used for the prediction using the neural network.
- results.tsv (file): It is a summary containinig the name of the sample, the prediction of the sex, the sexual index and the components of UMAP.
sample sex sindex umapx umapy
MedF45 Female 0.3502 6.3447 -4.2188
MedF49 Female 0.8727 7.4416 -3.2908
MedF52 Female 0.3378 7.3536 -4.1654
MedM31 Male 0.9548 5.7352 -2.5487
MedM37 Male 0.9940 6.4773 -2.4543
MedM40 Male 0.8772 6.3211 -3.2528
# Attach the packages
library(readr)
library(ggplot2)
library(treeio)
library(ggtree)
library(ape)
library(patchwork)
# Load the data to plot
result_tbl <- read_delim(file = "~/Desktop/mytosex_result/results.tsv", delim = "\t", col_types = "cfddd")
tree_text <- "(((mtr_mtf_ATP6:0.133345,(mca_mtf_ATP6:0.246735,(MedM40_mtf_ATP6:0.005459,MedF49_mtf_ATP6:0.001354,(MedF45_mtf_ATP6:0.020092,med_mtf_ATP6:0.002737,((mse_mtf_ATP6:0.504083,mse_mtm_ATP6:0.122829):3.315627,rph_mtf_ATP6:8.503156):1.680810,MedF52_mtf_ATP6:0.000001,MedF45_mtf_ATP6.1:0.000001):0.004096,(mga_mtf_ATP6:0.002767,(MedM37_mtf_ATP6:0.000001,MedM31_mtf_ATP6:0.000001):0.002779):0.001342):0.085134):0.040679):0.041445,mca_mtm_ATP6:0.907819):0.066568,mtr_mtm_ATP6:0.188816,(cgi_mt_ATP6:12.670376,(((((med_mtm_ATP6:0.007078,(MedF49_mtm_ATP6:0.000001,MedM31_mtm_ATP6.1:0.001900):0.004279,MedM37_mtm_ATP6:0.000001):0.020253,MedM40_mtm_ATP6:0.000001):0.006472,MedM40_mtm_ATP6.1:0.000001):0.003552,MedM31_mtm_ATP6:0.002806):0.070388,mga_mtm_ATP6:0.040895):0.055934):0.079141);"
tree <- read.tree(text = tree_text)
# Plot the sexual index
sindex_plot <- ggplot(data = result_tbl) +
geom_rect(mapping = aes(xmin = -Inf, xmax = Inf, ymin = 0.6, ymax = 0.9), fill = "lightgrey", alpha = 0.1) +
geom_boxplot(mapping = aes(x = sex, y = sindex, fill = sex), alpha = 0.25) +
geom_point(mapping = aes(x = sex, y = sindex, fill = sex), shape = 21, size = 2.5, position = position_jitter(width = 0.25)) +
ggtitle(label = "A") +
theme(panel.background = element_blank(),
panel.border = element_rect(fill = NA, colour = "black", size = 0.5),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank(),
legend.position = "none")
# Plot the UMAP result
umap_plot <- ggplot(data = result_tbl) +
geom_point(mapping = aes(x = umapx, y = umapy, fill = sex), shape = 21, size = 2.5) +
ggtitle(label = "B") +
theme(panel.background = element_blank(),
panel.border = element_rect(fill = NA, colour = "black", size = 0.5),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.key = element_blank(),
legend.title = element_blank())
# Plot ATP6 genetree
genetree_plot <- ggplot(tree, aes(x, y)) +
ggtitle(label = "C") +
geom_tree() +
geom_tiplab(size = 3) +
geom_treescale(linesize = 0.5) +
theme_tree()
# Make a pane with all the plots
sindex_plot + umap_plot + genetree_plot + plot_layout(ncol = 3, nrow = 1, widths = c(0.4, 1, 1.6), guides = "collect")