code for producing figures in bayNorm
#Purpose of this repository The main purpose of this repository is to provide the analysis procedure used in the paper.
Source code of bayNorm can be found here
This paper involves the following 8 studies:
- Klein study (https://www.cell.com/cell/abstract/S0092-8674%2815%2900500-0)
- Grün study (https://www.nature.com/articles/nmeth.2930)
- Torre study (https://www.cell.com/cell-systems/abstract/S2405-4712(18)30051-6)
- Bacher study (https://www.nature.com/articles/nmeth.4263)
- Islam study (https://www.ncbi.nlm.nih.gov/pubmed/21543516)
- Soumillon study (https://www.biorxiv.org/content/early/2014/03/05/003236)
- Tung study (https://www.nature.com/articles/srep39921)
- Patel study (http://science.sciencemag.org/content/344/6190/1396)
There are 4 simulated datasets with DE genes. Each one of them consists of 2 two groups of cells, and 100 cells in each group. 2000 out of 10000 genes were simulated to be DE genes in the first group and half of the 2000 genes were upregulated \Simulations\SIM_DE
.
- SIM DE I: mean capture efficiency
$<\beta>=10%$ for two groups. - SIM DE II: mean capture efficiency
$<\beta>=5% \text{ and } 10%$ for two groups respectively. - SIM DE III: mean capture efficiency
$<\beta>=10% \text{ and } 5%$ for two groups respectively. - SIM DE IV: mean capture efficiency
$<\beta>=5% \text{ and } 5%$ for two groups respectively.
There are another 2 simulated datasets without DE genes \Simulations\SIM_noDE
. Mean capture efficiency
- SIM Bacher I: Parameters were estimated from Klein study.
- SIM Bacher II: Parameters were estimated from H1_P24 cells from Bacher study.
- Klein study: Fig1 (b)-(e), Fig3 (a)-(b); Fig S2, S8a-b.
- Grün study: Fig2 a,c,e and g; Fig S11a-b, S12-S13
- Torre study: Fig2 b,d,f and h; Fig S6, S8e-f, S10a, S11c, S14.
- Bacher study: Fig S7, S9 a-d, S10e, S16, S19a, S23a.
- Islam study: Fig 3c; Fig S9e-f, S23b.
- Soumillon study: Fig3d; Fig S21.
- Tung study: Fig4, FigS3-S5, S8c-d, S10b-d, S25-26
- Patel study: FigS10f
- SIM DE I: FigS15a,e,i, S20c-d, S22a, S24a, S27-29
- SIM DE II: FigS15b,f,j, S20c-d, S22b, S24b, S27-29
- SIM DE III: FigS15c,g,k, S20c-d, S22c, S24c, S27-29
- SIM DE IV: FigS15d,h,l, S20c-d, S22d, S24d, S27-29
- SIM Bacher I: S17, S19b, S20a-b
- SIM Bacher II: S18, S19c
- You cannot directly run all the code at the same time. The paths in each R file need to be modified accordingly.
- The normalization and DE detection could take a long time, which depends on the size of raw data. Hence make sure running the code step by step so as to avoid bugs.
- Useful functions are stored in the file
\Functions
, some of them need to be loaded in advance. - The noramlization method
DCA
is developed using python. The Jupyter Notebooks for running DCA are stored in the file\DCA
. Make sure running DCA normalization and corresponding DE detection, and them feed the DCA normalized data into the other R files. - Some R files need several
.RData
files as input and will also output.RData
files used in other cases. Hence make sure the first step is completed so as to produce necessary.RData
files to begin with.
- Klein study: firstly, run
\RealData\Klein_study\Klein_bayNorm.R
, outputKlein_bayNorm.RData
.
2.Grün study: run LOAD_Grun_smFISH.R
(output smFISH_norm_load.RData
), LOAD_Grun_2i.R
(output Grun_2014_RAW.RData
) and LOAD_Grun_serum.R
(output Grun_2014_RAW_serum.RData
). Then run Grun_2i_norms.R
(output Grun_2i_norms.RData
) and Grun_serum_norms.R
(output Grun_serum_norms.RData
) for normalizing data. Note that the other method DCA needs to be run separately.
-
Torre study: run
Load_Torre.R
(outputLoad_Torre.RData
). Then runTorre_many_normalizations.R
(out putTorre_many_normalizations.RData
) for normalizing data. -
Bacher study: run
LOAD_Bacher.R
(outputRAW_INITIATE.RData
) to load H1 and H9 datasets. Then runH1_many_normalizations.R
(output"H1_many_normalizations.RData"
) andH9_many_normalizations.R
(output"H9_many_normalizations.RData"
) respectively. -
Islam study: run
Load_Islam.R
(outputLoad_Islam.RData
). Then runIslam_many_normalizations.R
(outputIslam_many_normalizations.RData
). -
Soumillon study: run
LOAD_Soumillon.R
(outputSoumillon_2014.RData
). Then runSoumillon_norms.R
(outputSoumillon_analysis.RData
). -
Tung study: run
Load_Tung.R
(outputLoad_Tung.RData
). Then runTung_many_normalizations.R
(outputTung_norms.RData
). -
Patel study: run
Load_Patel.R
(outputPatel2014_bay_out.RData
)
Firstly, we need to estimate parameters from the real data. Relevant codes are stored in \bayNorm_papercode\Figure1
.
- For Klein dataset, if you have completed the step 1 as shown above, then
Klein_bayNorm.RData
stored the parameters you need.Klein_bayNorm.RData
is needed in SIM DE I-IV and SIM Bacher I. - For Bacher dataset (H1_P24), run a section named
REAL DATA 6: Bacher study (H1_P24 cells)
in the fileSimulations_realdata.R
, which outputH1p24_bay_sim_allgene.RData
used in SIM Bacher II.
The codes are stored in: \Simulations\SIM_DE
- SIM DE I: run
DE_sim_01_01.R
(outputSIM_1.RData
andGG_SIM_1.RData
). - SIM DE II: run
SIM_005_01.R
(outputSIM_005_01.RData
andGG_SIM_005_01.RData
). - SIM DE III: run
SIM_01_005.R
(outputSIM_01_005.RData
andGG_SIM_01_005.RData
). - SIM DE IV: run
SIM_005_005.r
(outputSIM_005_005.RData
andGG_SIM_005_005.RData
).
The codes are stored in: \Simulations\SIM_noDE
- SIM Bacher I: run
SIM_noDE_01_005.R
(outputSIM_noDE_01_005.RData
) - SIM Bacher II: run
SIM_noDE_01_005_H1.R
(outputSIM_noDE_01_005_H1.RData
)
After the above steps, you can try the other R files which include various code for analysing the data.