This tool has been developed to simulate G2G analysis.
G2G or genome-to-genome analysis is a joint analysis of host and pathogen genomes that study side by side correlation between host and pathogen systematic variation.
source("G2G_simulator.R")
Similarly to G2G simplified, a study design must be fined in term of host populations (population P1, P2...) and pathogen strains distribution (strain A, B,...).
G2G_conf defines the G2G data structure through a composition of SNP, AA and association function calls.
G2G_conf(SNP, AA, association, ...)
G2G_conf =G2G_conf(
association(
AA(
size=1,
stratified = c("A","B"),
fst_strat = 0.2,
biased = c("P1","P2"),
fst_bias = 0.01,
beta = 0.3,
bio_tag = "Asso_Stratified_Biased_AA_PG2"),
SNP(
size=1,
stratified = c("P2","P1"),
biased = c("B","A"),
fst_strat = 0.2,
fst_bias = 0.016,
bio_tag = "Stratified_biased_SNP"),
replicate = 100),
association(
AA(
size=1,
beta = 0.3,
bio_tag = "Asso_Unstratified_AA"),
SNP(
size=1,
bio_tag = "Unstratified_SNP"),
replicate = 100),
AA(
size=100,
stratified = c("A","B"),
biased = c("P1","P2"),
fst_strat = 0.2,
fst_bias = 0.005,
bio_tag = "Stratified_biased_AA"),
SNP(
size=100,
stratified = c("P1","P2"),
biased = c("A","B"),
fst_strat = 0.2,
fst_bias = 0.016,
bio_tag = "Stratified_biased_SNP"),
AA(
size=100,
stratified = c("A","B"),
fst_strat = 0.2,
bio_tag = "Stratified_AA"),
SNP(
size=10000,
stratified = c("P1","P2"),
fst_strat = 0.2,
bio_tag = "Stratified_SNP"),
SNP(
size = 40000,
bio_tag = "Unstratified_SNP")))
-
SNP, fun: SNP function call
- description: defines (a) SNP(s). SNPs corresponding to the variations of the host side
- Usage: SNP(size, stratified = NA, fst_strat=NA, biased = NA, fst_bias=NA, bio_tag=NA)
- Arguments:
- size, int: the number of SNPs
- stratified, vector of strings: host populations groups order give the direction of the stratification from higher MAF to lower MAF
- fst_strat, int: is the fixation coefficient that defines the stratification magnitude defined by stratified
- biased, vector of strings: include a bias such that, the pathogen strains are associated with host stratification (regardless of the defined populations). The order gives the direction from higher MAF to lower MAF
- fst_bias, int: is the fixation coefficient that defines the stratification magnitude defined by biased
-
AA, fun: AA function call
- description: defines (an) AA(s). AAs for amino acids correspond to the variations on the pathogen side
- Usage: AA(size, stratified = NA, fst_strat=NA, biased = NA, fst_bias=NA, beta=NA, bio_tag=NA)
- Arguments:
- size, int: the number of pathogen variant
- stratified, vector of strings: pathogen strains order give the direction of the stratification from higher MAF to lower MAF
- fst_strat, int: is the fixation coefficient that defines the stratification magnitude defined by stratified
- biased, vector of strings: include a bias such that, the host populations are associated with pathogen stratification (regardless of the defined pathogen strains). The order gives the direction from higher MAF to lower MAF
- fst_bias, int: is the fixation coefficient that defines the stratification magnitude defined by biased
- beta, int: in case of association (and therefore inside the association() function call (see bellow)), the log of odd ratio.
-
association, fun: association function call
- description: defines an association between (a) SNP(s) and (a) AA(s)
- Usage: association(SNP, AA, replicate)**
- Arguments:
- SNP, fun: is a SNP function call outcome, the number of SNP will define how many are associated with the AA function call
- AA, fun: is a AA function call outcome, the number of AA will define how many are associated with the SNP function call
- replicate, int: is the number of time such an association is added
-
..., fun: other AA, SNP or association function calls
-
bio_tag, string: a tag that will be added in the generated dataset.
get_study_design defines the host populations and pathogen strains distributions
get_study_design(structure)
study_design = get_study_design(list(
`P1` = c(`A` = 250, `B` = 250),
`P2` = c(`A` = 250, `B` = 250)))
structure, list of nammed vector of nammed int: defines the study design with the host populations P1 and P2 and their respective proportion in pathogen strains A and B
eg : Here we have the same number of samples in each host population (500) with each 250 with strain A and strain B
get_G2G_data generates the G2G data
get_G2G_data(study_design, G2G_conf)
G2G_data = get_G2G_data(
study_design,
G2G_conf)
study_design, fun: get_study_design function call
G2G_conf, fun: G2G_conf function call
analyse_G2G runs the G2G analysis
analyse_G2G(G2G_data, correction, nb_cpu = 40)
analyse_G2G(G2G_data,
get_correction(WO_correction = T, W_host_PC = T, W_pathogen_group = T, W_pathogen_groups_host_PC = T),
nb_cpu = 40)
-
G2G_data is get_G2G_data function call
-
correction, fun: get_correction function call
- description: defines the series of corrections to assess
- Usage: get_correction(WO_correction = F, W_human_PC = F, W_pathogen_group = F, W_pathogen_groups_host_PC = F)
- Arguments:
- WO_correction, bool : no correction
- W_pathogen_group, bool: with pathogen strains
- W_host_PC, bool: with 5 first PCs from SNPs data (imputed human groups)
- W_pathogen_groups_host_PC, bool: with 5 first PCs from hosts data and pathogen strains
-
nb_cpu, int : number of available CPU to use
See here for the results visualization
study_design = get_study_design(structure = list(
`P1` = c(`A` = 1500, `B` = 1000),
`P2` = c(`A` = 1000, `B` = 1500)))
get_study_design defines the host and pathogen structure
get_study_design(structure)
structure, list of nammed vector of nammed int: defines the study design with the host populations P1 and P2 and their respective proportion in strains A and B
eg : Here we have the same number of samples in each host population (2500) but in P1 1500 samples have strain A and 1000 strain B and conversely in P2, 1000 samples have strain A and 1500 strain B.
G2G_setup = get_G2G_setup(rep = 1000,
s_stratified = c("P1","P2"),
s_biased = c("A","B"),
a_stratified = c("A","B"))
get_G2G_setup allows to specify the stratification direction
get_G2G_setup(rep, s_stratified = NA, s_biased = NA, a_stratified = NA, a_biased = NA)
rep, int: is the number of repetition you want to execute to draw the pvalue distribution.
s_stratified, vector of strings: host populations groups order give the direction of the stratification from higher MAF to lower MA
eg : here there will be a higher minor allele frequencyi (MAF) in population P1 than in population P2.
s_biased, vector of strings: include a bias such that, the pathogen strains are associated with host stratification (regardless of the defined sub-populations groups). The order gives the direction from higher MAF to lower MAF
eg : Here there will be a higher MAF for the hosts that have strain A than strain B. In conlcusion, the MAF decreases with a maximum for P1 with strain A (P1.A) to P1.B, P2.A and finally P2.B
Similarly for the variants on the pathogen side...
a_stratified, vector of strings: pathogen strains order give the direction of the stratification from higher MAF to lower MAF
a_biased, vector of strings: include a bias such that, the host populations groups are associated with pathogen stratification (regardless of the defined pathiogen strains). The order gives the direction from higher MAF to lower MAF
test_G2G_setup(study_design, G2G_setup,
fst_host_strat = 0.2,
fst_host_bias = 0.2,
fst_pathogen_strat = 0.2,
tag = 'demo')
test_G2G_setup runs the simplified G2G
test_G2G_setup(study_design, G2G_setup, fst_host_strat = NA, fst_host_bias = NA, fst_pathogen_strat = NA, fst_pathogen_bias=NA, tag = 'unnamed')
study_design, fun: get_study_design function call
G2G_setup, fun: get_G2G_setup function function call
fst_host_strat, int: is the fixation coefficient that defines the stratification magnitude defined by s_stratified
fst_host_bias, int: is the fixation coefficient that defines the stratification magnitude defined by s_biased
fst_pathogen_strat, int: is the fixation coefficient that defines the stratification magnitude defined by a_stratified
fst_pathogen_bias, int: is the fixation coefficient that defines the stratification magnitude defined by a_biased
tag folder name to save results
The results are automatically plotted in the tag folder.
my_population = generate_population_for_GWAS(list(
`P1` = c(`case` = 200, `control` = 400),
`P2` = c(`case` = 400, `control` = 200)))
Here we want two sub-populations P1 and P2.
- From P1, 200 individuals are in case group and 400 in control group.
- From P2, 400 individuals are in case group and 200 in control group.
GWAS_result = GWAS_scenario(populations = my_population,
neutral = 100000,
neutral_S_rate = 0.05,
causal_NS = seq(1,2, by = 0.05),
causal_S = seq(1,2, by = 0.05),
fst_strat = 0.2)
Here we want 100,040 SNPs, neutral and causal, stratified or not
- Number of neutral SNP is 100,000
- On this 5% will be stratified
- 20 non stratified causal SNP will be added with R coefficient between 1 and 2
- 20 stratified causal SNP will be added with R coefficient between 1 and 2
- Fixation coefficient for making stratification strength is 0.2
Plot the results with 3 different conditions :
- Without correction
- With human groups
- With 5 first PCs
On Manhattan plots
plot_GWAS_manhattan(GWAS_result)
On QQ plots
plot_GWAS_QQ(GWAS_result)