This is the original repository for the Network-centric Framework for the Evaluation of Mutual Exclusivity Tests project codes. The project involves the evaluation of mutual exclusivity methods: Discover, Discover_strat, Fisher's Exact Test, MEGSA, MEMO, and WExT. The results from these methods include pairwise mutual exclusivity p-values. Based on them, we apply our network-centric epistatic evaluation.
Python: 3.8-3.9
Using Github clone
git clone https://github.com/abu-compbio/netcentric
cd NetCentric
pip install -r requirements.txt
You can also run this project locally by following these steps:
- Download the repo
- Unzip NetCentric-main
- Open cmd/terminal and cd into the project
- Execute python -m pip install -r requirements.txt
Some of the datasets are given in the .zip file format. In order to unzip, you should run script_unzip_data.py located under the NetCentric.
python script_unzip_data.py
The file is located at data/intact_nodupl_index_file.txt where the first column contains gene identifiers and the second column corresponds to gene symbols.
0 ""CHEBI
1 100147744
2 1B
3 1EFV
...
The file is located at data/intact_nodupl_edge_file.txt where the first two columns contain the gene identifiers and the third column denotes the confidence level of the interaction.
7589 13441 0.99
9123 10446 0.98
4248 1740 0.98
3776 5279 0.98
...
The file is located at data/hint_index_file.txt where the first column contains gene identifiers and the second column corresponds to gene symbols.
1 A1BG
2 A1CF
3 A2BP1
...
The file is located at data/hint_edge_file.txt where the first two columns contain the gene identifiers.
2988 7255
7255 9111
7255 9109
...
These index files are located at data/intact_index_file_0.25.txt and data/intact_index_file_0.45.txt where the first column contains gene identifiers and the second column corresponds to gene symbols.
1 ""CHEBI
2 100147744
3 1B
4 1C
...
These edge file are located at data/intact_edge_file_0.25.txt and data/intact_edge_file_0.45.txt where the first two columns contain the gene identifiers.
MDM2 TP53
MYC MAX
BRAF MAP2K1
...
The file is located at data/string_network.txt where it shows the edges between two genes (gene1 and gene2).
gene1 gene2
0 A1CF APOBEC3H
1 A1CF APOBEC3G
2 A1CF APOBEC3A
3 A1CF APOBEC3C
...
The mutation data includes pairwise mutual exclusivity p-values given for each method (discover, discover_strat, fishers, megsa, memo and wext). The files with the name of mutations_all_genes include all genes and intact_filtered include only ones in intact network.
The file is located at data/{method}mutation_filtered_ep_data/{cancerType}{method}result_mutations_all_genes{threshold}.txt
gene1 gene2 pvalue
0 A2M A2ML1 0.6584654889330113
1 A2M ABCA1 0.5332913581418495
2 A2M ABCA10 0.8971732886956303
...
The file is located at data/{method}mutation_filtered_ep_data/{cancerType}{method}_pairs_intact_filtered_subset{threshold}.txt
gene1 gene2 pvalue oddsratio
0 TCF7L2 CTNNB1 0.9015805073650888 1.6786858974358974
1 SMAD4 SMAD3 0.839665475908354 1.4567901234567902
2 EP300 TP53 0.0742406168447767 0.5221052631578947
...
The file is located at data/binary_matrices_all_genes_ep_mutation_filtered/ directory. Each row is a TCGA patient id and each column is a gene. The matrix contains 1 if the gene is mutated in the corresponding patient. Here, we only provide the mutation matrix for COADREAD.
patients A1BG A1CF A2M ...
TCGA-3L-AA1B-01A 0 0 0
TCGA-4N-A93T-01A 0 0 0
TCGA-4T-AA8H-01A 0 0 0
...
MLA: The file contains the corresponding MLA.
The file is located at data/MLA_ep_mutation_filtered_all_genes
A1BG 4.261253658699028
A1CF 5.095042391780406
A2M 5.539871662596874
...
The file is located at data/known_cancer_genes directory.
-
CGC genes: We download all the genes from Cancer Gene Census from COSMIC database.
-
CGC_SNV genes: We try using a subset of CGC genes to include only those which have SNV type of mutations in cancer (378 out of 723 genes). To this end, we filter out the genes where the mutation type column consists of only A (amplification), D (large deletion) or T (translocation).
-
IntoGen genes: We download Unfiltered driver results 05.tsv file (2020-02- 02 release) from https://www.intogen.org and include the genes where FILTER column is PASS, which results in 503 genes.
Rather than using a common nonspecific network for all the cancer types, in this component of our evaluation framework we employ TSN based on the tissue in which the tumor develops. To construct the TSN for a particular tissue, we start with the original PPI network and remove the edges between the pairs of genes that are not co-expressed in the corresponding tissue. For this purpose, we download RNA-seq datasets from GTEX portal. In the main article it was discussed under the section "Network-centric ME Evaluations in Relation to TSN".
The file is located at data/gtex_tsn_fractions_intact_filtered_applied_threshold
MDM2 TP53 1.0
PAK1 RAC1 1.0
FADD CASP8 0.9987163029525032
...
The Mutual Exclusivity results will be available in the folder ME_results. The TSN results will be available in the folder tsn_results. The MLA results will be available in the folder MLA_results. These folders will appear under the main directory, when the results are ready.
The commands to run the mutual exclusivity algorithms are shown in running_mex_methods section.
The codes regarding various analyses given in the main article.
The main source code for the evaluation ME Tests. In the main article it was discussed under the section "ME Evaluations Based on Defined Metrics". This analysis code were also used in the section "Robustness Analysis of Evaluations Based on Defined Metrics". robustness_iterations value is given as parameter i: number of iteration in the code.
As output, you get tables with all analysis results in NetCentric/ME_results. To generate the algorithm for the given input, the following script should run (c: cancer type, t: threshold, i: number of iteration, m: methods, p: p_value threshold, -ni: network index file, -e:network edge file, -r: Reference cancer genes)
cd src
evaluations_on_metrics.py -c COADREAD -t 20 -i 100 -m discover discover_strat fishers megsa memo wext -p 0.05 -ni intact_nodupl_index_file.txt -e intact_nodupl_edge_file.txt -r Census_allFri_Apr_26_12_49_57_2019.tsv
In order to run the code with STRING network, keep the edge and index files empty and use the command given below. (-str: String network)
cd src
evaluations_on_metrics.py -c COADREAD -t 20 -i 100 -m discover discover_strat fishers megsa memo wext -p 0.05 -str string_network.txt -r Census_allFri_Apr_26_12_49_57_2019.tsv
Scatterplots of percentage significance of mutual exclusivity runs vs mutation load association (MLA). In the main article it was discussed under the section "ME Evaluations Based on Corrections via MLA". As output, you get results in NetCentric/MLA_results/percent_sig_figures
cd src
evaluations_via_mla.py -c COADREAD -t 20 -m discover discover_strat fishers megsa memo wext
Scatterplots of percentage significance of mutual exclusivity runs vs mutation load association (MLA) when only CGC genes that have > 1 neighbors are included. As output, you get results in NetCentric/MLA_results/perc_sig_figures_for_multiple_neighbors
cd src
evaluations_via_mla_neighbors.py -c COADREAD -t 20 -m discover discover_strat fishers megsa memo wext
In the main article it was discussed under the section "ME Evaluations Based on Corrections via TSN". (c: cancer type, t: threshold, m: methods, ti: tissue, th: tsn threshold). As output, you get tables with all analysis results in NetCentric/tsn_results.
cd src
evaluations_via_tsn.py -c COADREAD -t 20 -m discover discover_strat fishers megsa memo wext -ti Colon -th 0.0
ROC analysis based on tissue-specificity. As output, you get results in NetCentric/tsn_results/figure_tsn_AUROC (c: cancer type, t: threshold, m: methods, th: tsn threshold, p: percentage )
cd src
me_on_tsn_ntsn_roc_curve.py -c COADREAD -t 20 -m discover discover_strat fishers megsa memo wext -th 0.0 -p 0.25