This code was generated in the context of the study “Next generation pan-cancer blood proteome profiling using proximity extension assay”, were we performed a comprehensive analysis of the plasma proteome of a pan-cancer cohort representing the major cancer types.
In this study, the plasma profiles of 1,463 proteins were measured for 1,477 cancer patients representing 12 common cancer types, including the most prevalent types such as colorectal-, breast-, lung- and prostate cancer. The plasma proteome was measured in minute amounts of blood plasma collected at the time of diagnosis and before treatment using the antibody-based PEA technology combined with next generation sequencing (developed by Olink). The plasma profiles of patients with a specific cancer type were compared to the patients with other cancer diagnosis in order to find cancer-specific signatures that can distinguish each type of cancer from other cancer types. Both differential expression and disease prediction models were used as tools for the identification of specific cancer signatures.
The results from the study are published in Nature Communications: Bueno Álvez, M., Edfors, F., von Feilitzen, K. et al. Next generation pan-cancer blood proteome profiling using proximity extension assay. Nat Commun 14, 4308 (2023). https://doi.org/10.1038/s41467-023-39765-y
This repository includes the code to generate the results describe above, as well as a synthetic dataset to test the code:
/data
: contains example data to test the code, as well as additional data files to reproduce the analysis. Note that this is not real data and should not be used in any research./scripts
: contains all necessary scripts to reproduce the analysis./results
: all the plots resulting from the analysis will be stored in this directory. Note that the results are based on synthetic data and should not be interpreted as valid biological resutlts.Pan-cancer-profiling.Rproj
: R project file.
Before running the code, make sure you have R, R studio and the packages indicated in data/processed/Sessions
installed.
The code has been developed in the following system:
- Processor Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
- Installed RAM 32,0 GB
- System type 64-bit operating system, x64-based processor
The provided run times apply for computers with similar specifications.
- Clone the repository (should take ~15 seconds).
- Open
R studio
and open thePan-cancer-profiling.Rproj
. - Start by running through the differential expression analysis using the
Differential_expression.rmd
markdown script located in the scripts folder. This will perform differential expression analysis based on the example data located in the data folder. - Continue by running the
Disease_prediction.rmd
markdown script located in the scripts folder. This script will find protein signatures for the different groups of patients using prediction models based on the glmnet and random forest algorithms. It will also combine the results from the differential expression analysis to select a panel of upregulated proteins relevant for the prediction of the disease groups. - Explore the generated results.
The expected runtimes are:
Differential_expression.rmd
: ~3 minutesDisease_prediction.rmd
: ~17 minutes
If you want to run the code using your data, make sure to format it according to the data provide here or adjust the script accordingly.
All plots resulting from the analysis are stored in results/YYYY-MM-DD
(a new folder will be created if you re-run the analysis on a different date).
All resulting data files (e.g. differential expression analysis, results from prediction models …) are stored in subfolders of the data/processed
directory as R objects.
To use this code in your own research, please cite our code and/or our study:
Bueno Álvez, M. buenoalvezm/Pan-cancer-profiling: pan-cancer-profiling (Version v2). Zenodo. https://doi.org/10.5281/zenodo.8012430 (2023).
Bueno Álvez, M., Edfors, F., von Feilitzen, K. et al. Next generation pan-cancer blood proteome profiling using proximity extension assay. Nat Commun 14, 4308 (2023). https://doi.org/10.1038/s41467-023-39765-y