CoverM_Test.Rmd

---
title: "CoverM_Test"
output: html_document
date: "2023-08-17"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Background

This script is for understanding the "relative_abundance" calculation method of CoverM more clearly. For various calculation methods of CoverM, refer to <https://github.com/wwood/CoverM#calculation-methods>.

What I want to calculate below is the relative abundance of genome A (my focal genome).\
Specifically, I'd like to know what would happen to the relative abundance of genome A, if I include other genomes with variable sizes that recruit variable number of metagenome reads.

## Analysis

```{r, message = FALSE}
library(tidyverse)
```

### Metagenome

Set the total number of reads = 1000.\
Set the length of each read = 100 bp. To make it simple, assume that all reads have the same length and are from single-end sequencing.

```{r metagenome, message = FALSE}
tot_reads <- 1000 
read_len <- 100 
```

### Genomes

#### Genome A (my focus)

Genome size = 1000 bp\
The number of mapped reads = 10 reads\
To make it simple, assume that all matches are perfect; all mapped reads are aligned to genomes across their whole length with 100% sequence identity.\
"mean" is calculated according to the CoverM formula. For simplicity, end problem is not considered.

```{r focal_genome, message = FALSE}
size_A <- 1000 
mapped_reads_A <- 10 
mapped_bases_A <- mapped_reads_A * read_len 
mean_A <- mapped_bases_A / size_A
```

#### Genomes B, C, and D (other genomes)

Set the size of other genomes.\
Three genomes with different sizes will be used here to test the effects of the size of other genomes.\

-   B is smaller than A.
-   C is equal to A.
-   D is larger than A.

The number of metagenome reads mapped to the other genomes is set to change from 0 to 50.\
This number will be used as a variable for x-axis in the plotting.\
The same range is used for the three genomes.\

```{r other_genome, message = FALSE}
size_B <- 500
size_C <- 1000
size_D <- 2000

mapped_reads_other <- 0:50
```

### Calculate the "relative_abundance" of genomes

Three genome sets will be assumed, each including Genome A and one other genome (B, C, or D).\
"relative_abundance" of genomes will be calculated according to the CoverM formula.\

```{r relative_abundance, message = FALSE}
coverm <- data.frame(mapped_reads_other) %>%
  mutate(mapped_bases_other = mapped_reads_other * read_len,
         tot_mapped_reads = mapped_reads_A + mapped_reads_other,
         mean_B = mapped_bases_other / size_B,
         mean_C = mapped_bases_other / size_C,
         mean_D = mapped_bases_other / size_D,
         relcov_A_with_B = mean_A / (mean_A + mean_B),
         relcov_B_with_A = mean_B / (mean_A + mean_B),
         relcov_A_with_C = mean_A / (mean_A + mean_C),
         relcov_C_with_A = mean_C / (mean_A + mean_C),
         relcov_A_with_D = mean_A / (mean_A + mean_D),
         relcov_D_with_A = mean_D / (mean_A + mean_D),
         relabu_A_with_B = relcov_A_with_B * tot_mapped_reads / tot_reads,
         relabu_B_with_A = relcov_B_with_A * tot_mapped_reads / tot_reads,
         relabu_A_with_C = relcov_A_with_C * tot_mapped_reads / tot_reads,
         relabu_C_with_A = relcov_C_with_A * tot_mapped_reads / tot_reads,
         relabu_A_with_D = relcov_A_with_D * tot_mapped_reads / tot_reads,
         relabu_D_with_A = relcov_D_with_A * tot_mapped_reads / tot_reads)
```

### Plotting (focusing on genome A)

Create a data frame for plotting the relative abundance of genome A based on the above calculation.

```{r data_for_plotting_only_A, message = FALSE}
coverm_A <- coverm %>%
  select(mapped_reads_other, starts_with("relabu_A")) %>% 
  rename_with(str_replace, 
              pattern = "relabu_A_", replacement = "") %>% 
  pivot_longer(-mapped_reads_other, names_to = "Other_Genome", values_to = "Rel_Abu_A")
```

Draw a plot.\
x-axis shows the number of reads mapped to the other genomes (B, C, or D).\
Therefore, "x=0" means that no other genomes are included.\
y-axis shows the "relative_abundance" of genome A with the other three genomes of variable sizes.\
Colors correspond to the other genomes.

```{r plotting_only_A, message = FALSE}
ggplot(coverm_A, aes(x = mapped_reads_other, y = Rel_Abu_A)) +
  geom_point(aes(color = Other_Genome))
```

It is clear that the relative abundance of genome A is affected by the presence of other genomes unless the other genome has the same size as A.\
If the other genome is smaller than the focal genome ("with_B"), "relative_abundance" of the focal genome decreases.\
If the other genome is larger than the focal genome ("with_D"), "relative_abundance" of the focal genome increases.\
Further, the number of reads mapped to the other genomes also have effects.

### Plotting (focusing on both genomes comprising the three genome sets: A+B, A+C, or A+D)

Create the three data frames, each corresponding to a genome set.

```{r data_for_plotting_both, message = FALSE}
coverm_AB <- coverm %>%
  select(mapped_reads_other, relabu_A_with_B, relabu_B_with_A) %>% 
  rename_with(str_replace, pattern = "_with.*", replacement = "") %>% 
  mutate(relabu_sum = relabu_A + relabu_B) %>% 
  pivot_longer(-mapped_reads_other, names_to = "Genome", values_to = "Rel_Abu")

coverm_AC <- coverm %>%
  select(mapped_reads_other, relabu_A_with_C, relabu_C_with_A) %>% 
  rename_with(str_replace, pattern = "_with.*", replacement = "") %>% 
  mutate(relabu_sum = relabu_A + relabu_C) %>%
  pivot_longer(-mapped_reads_other, names_to = "Genome", values_to = "Rel_Abu")

coverm_AD <- coverm %>%
  select(mapped_reads_other, relabu_A_with_D, relabu_D_with_A) %>% 
  rename_with(str_replace, pattern = "_with.*", replacement = "") %>% 
  mutate(relabu_sum = relabu_A + relabu_D) %>%
  pivot_longer(-mapped_reads_other, names_to = "Genome", values_to = "Rel_Abu")
```

Draw plots.\
y-axis shows the "relative_abundance" of genome A and the other genomes (B, C, or D; variable sizes.)\
Sum of the two relative abundances are also plotted.

```{r plotting, message = FALSE}
ggplot(coverm_AB, aes(x = mapped_reads_other, y = Rel_Abu)) +
  geom_point(aes(color = Genome))

ggplot(coverm_AC, aes(x = mapped_reads_other, y = Rel_Abu)) +
  geom_point(aes(color = Genome))

ggplot(coverm_AD, aes(x = mapped_reads_other, y = Rel_Abu)) +
  geom_point(aes(color = Genome))
```

In all cases, the sum of relative abundance corresponds exactly to the total number of mapped reads (A & other genome).\

The relative abundance of each genome is determined by their genome sizes in addition to the number of mapped reads.\

I think this calculation method (relative abundance) makes sense. With the sum reflecting the total number of mapped reads faithfully, relative abundance of each genome is calculated by taking genome sizes into consideration.


### RPKM for metagenomics
But, in some cases, the reference genome sets are constructed a little arbitrarily.\
E.g., when I have to analyze environmental distribution of my new marine bacterial isolate, I may want to include some well-known genomes (e.g., SAR11) in analyses just for comparison, to get more clear insights on environmental prevalence of my isolate. In this case, I can decide arbitrarily what and how many other genomes I'll include.\
As shown above, this decision will impact "relative abundance" of my isolate, with the direction (increase/decrease) and degree putatively dependent on "what and how many other genomes".

Therefore, if we want to analyze environmental distribution of focal genomes without the effect of other genomes included in the analyses for comparison purpose, then "relative abundance" may be not suitable.\

I think that RPKM for metagenomics can be a solution in these cases.\
Here, RPKM is defined as "the number of mapped reads / genome size (in Kbp) / number of metagenome reads used for mapping (in million reads).\
\
Note that this RPKM is different from the "rpkm" method available in CoverM, which follows a formula widely used in RNA-seq studies.\
As far as I know, in RNA-seq studies and the "rpkm" in CoverM, "M" in RPKM stands for **the total number of reads mapped to the reference genome(s)**. Reads that are not mapped to the reference genome(s) are disregarded.\

But, here, "M" stands for **the total number of reads that were used for mapping, irrespective of whether the reads were mapped to the reference genome(s) or not**.

Create a data frame of RPKM of the genomes.\
Then, make another data frame for plotting.
```{r rpkm_data, message = FALSE}
rpkm <- data.frame(mapped_reads_other) %>%
  mutate(rpkm_A = mapped_reads_A / (size_A / 1000) / (tot_reads / 1000000),
         rpkm_B = mapped_reads_other / (size_B / 1000) / (tot_reads / 1000000),
         rpkm_C = mapped_reads_other / (size_C / 1000) / (tot_reads / 1000000),
         rpkm_D = mapped_reads_other / (size_D / 1000) / (tot_reads / 1000000))

rpkm_plot <- rpkm %>%
  rename_with(str_replace, pattern = "rpkm_", replacement = "") %>%
  pivot_longer(-mapped_reads_other, names_to = "Genome", values_to = "RPKM")
```

Draw a plot.
```{r rpkm_plot, message = FALSE}
ggplot(rpkm_plot, aes(x = mapped_reads_other, y = RPKM)) +
  geom_point(aes(color = Genome))  
```

RPKM value of any genomes, including genome A, are not affected by the other genomes.
RPKM values are affected only by the genome sizes and and the number of reads mapped to each genome.

## Conclusion

1.  "relative_abundance" of a genome can be affected by the inclusion of another genome in the analysis.
2.  If the other genome is smaller than the focal genome, "relative_abundance" of the focal genome decreases.
3.  If the other genome is larger than the focal genome, "relative_abundance" of the focal genome increases.
4.  The degree of increase/decrease is dependent on the number of reads mapped to the other genomes.
5. The sum of "relative_abundance" reflects the number of reads mapped to any of reference genome(s) exactly.
6. If you want a metric that is not affected by other genomes, then you may consider calculating RPKM for metagenomics, where "M" is the total number of reads used for mapping.  To this end, you may want to run CoverM using "count" or "reads_per_base" as calculation options. In this case, normalization should be performed manually.