Skip to content

R package to prepare proteomics datasets for the peptide level quantification

License

Notifications You must be signed in to change notification settings

bshashikadze/pepquantify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pepquantify

pepquantify (v2.1.3) takes the precursor/peptide level output of the proteomics dataset and proposes various options to pre-filter and process the data to prepare for further analysis. ‘pepquantify’ can perform missing value imputation with user-defined settings (left-censored imputation, similar to Perseus), to include peptides with a high detection rate in only one condition, but very low in another, in the quantitative analysis (peptide level only). Currently, DDA results generated by MaxQuant (both lfq and TMT) and DIA (generated by DIA-NN) are supported.

The goal of the package is two-fold:

  • prepare proteomics data for peptide level quantitative analysis (MS-EmpiRe)
  • process precursor-level output of DIA-NN for peptide- or protein-level quantitative analysis

How to install?

You can install the development version of pepquantify from GitHub with:

if(!require(devtools)){
  install.packages("devtools")}

devtools::install_github("bshashikadze/pepquantify")

Example script

load libraries

library(pepquantify)
library(msEmpiRe)

define the function which performs data loading, normalization and quantification (MS-EmpiRe)

You need to execute this function once only in each session to have it in a global environment

Comment

see: https://github.com/zimmerlab/MS-EmpiRe to know more about MS-EmpiRe package (also doi:10.1074/mcp.RA119.001509)
note1: this function consists with codes which can be found in - https://github.com/zimmerlab/MS-EmpiRe/blob/master/example.R note2: in msEmpiRe::read.standard regex used for unlisting is changed: this is iportant to remove unique number at the end of the protein ids (after the LAST dot) which is added by pepquantify read functions and is neccessary for MS-EmpiRe read.standard function. This is especially relevant for NCBI refseq ids as they themselvs contain dot and version number.

msempire_calculation <- function(data, data2 = data_raw, seed=1234, fc_threshold = 1.5) {
  
  require(magrittr)
  # read the data in the expressionset format and perform msempire normalization and quantification  
  # (https://github.com/zimmerlab/MS-EmpiRe/blob/master/example.R)
  msempiredata  <- msEmpiRe::read.standard(data[[1]], data[[2]],
                                            prot.id.generator = function(pep) unlist(strsplit(pep, "\\.[0-9]*$"))[1],
                                            signal_pattern="Intensity.*")
  
  # msempire calculations
  set.seed(seed = seed)
  msempire_results <- msempiredata  %>%
    msEmpiRe::normalize() %>%
    msEmpiRe::de.ana() %>%
    write.table(paste0(data[[3]], "/msempire_results_raw.txt"), sep = "\t", row.names = F)
  
  
  # tidy results (pepquantify package)
  pepquantify::resultstidy(data, data2,  fc_threshold = fc_threshold)}

read the data

DDA results generated by MaxQuant

Important This function expects that directory contains peptides.txt and proteingroups.txt from MaxQuant txt folder (without modification)
Arguments and default values
  • exclude_samples: if not empty, excludes specified sample/s from further analysis (only if necessary, e.g. after inspecting PCA. sample name should be written as - "Intensity.samplename")

  • lfq: if non-labelled data is loaded, lfq must be set to true, if labelling was performed (e.g. TMT) lfq should be set to false. For TMT Reporter.intensity.corrected is taken for quantification

data_raw <- pepquantify::read_mqdda()

DIA results generated by DIA-NN

Important: set experimental_library to FALSE in case of library-free search with MBR enabled and TRUE in case when empirical library is used (e.g. pre-fractionation, GPF and etc.). Other parameters could be left as default

Comment this package was only tested for the cases when uniprot database was used for search and "Genes" column was used for protein inference. In other scenarios it might be possible to adjust this function by changing "id_column" and "quantity_column" (latter not important for MS-EmpiRe)
Arguments and default values
  • exclude_samples: if not empty, excludes specified sample/s from further analysis (only if necessary, e.g. after inspecting PCA)

  • Q_Val: (Q.Value) refer to https://github.com/vdemichev/DiaNN

  • Global_Q_Val: (Global.Q.Value) refer to https://github.com/vdemichev/DiaNN

  • Global_PG_Q_Val: (Global.PG.Q.Value) refer to https://github.com/vdemichev/DiaNN

  • Lib_Q_Val: (Lib.Q.Value) refer to https://github.com/vdemichev/DiaNN

  • Lib_PG_Q_Val: (Lib.PG.Q.Value) refer to https://github.com/vdemichev/DiaNN

  • for_msempire: default T. if false pepquantify will only write peptides and prortein groups file and not the files for MS-EmpiRe

  • experimental_library: set true if you use empirical libraries (e.g. prefractionation or GPF), false in case of lib free search with mbr enabled

  • unique_peptides_only: TRUE only unique peptides will be used for quantification (recommended)

  • Quant_Qual: (Quantity.Quality) refer to https://github.com/vdemichev/DiaNN; pepquantify by default sets it to 0.5

  • id_column: default "Genes"

  • include_mod_in_pepreport default FALSE, if true includes modifications in the output peptide file (currently only Carbamidomethyl (C))

  • quantity_column: default "Genes.MaxLFQ.Unique", not important for MS-EmpiRe

data_raw <- pepquantify::read_diann(experimental_library = c("TRUE", "FALSE"))
Important

Keep the name as "data_raw", if you change make sure you indicate it in the next function as well (see pepquantify_funs())

conditions file will be generated which you should modify according to experimental conditions. For more information please refer to the function description (type ?read_mqdda in the R console)

prepare the dataset and perform normalization and quantification

Important

Two function below should be run as many times as many comparisons there are, it will generate specific folder for each comparison, e.g. if condition1 = "disease" and condition2 = "healthy", the folder will be generated automatically in the working directory named as disease_vs_healthy and all the outputs will be stored there. If you have other group(s) e.g. treated, you just copy paste two lines of code (below) and run with e.g. condition1 = "disease", condition2 = "treated" and the folder will be generated named as disease_vs_treated.

Also, order matters for the fold-change direction: proteins increased in abundance in the condition1 will have a positive l2fc, therefore it is prefferable that condition1 is always disease/treatment and etc. e.g. condition1 = "diabetes", condition2 = "control".

Arguments and default values

Options in italics are not (usually) necessary to change

  • data: list of two containing peptide and protein group data generated by the read functions of the pepquant package (default data_raw)

  • imputation: if true, imputation will be performed if set to false no imputation will be performed. Generated statistics and fold-changes should be taken into account with a caution. This function is helpful to discover proteins that are missing in of the conditions while detected in another. That said it is better if imputation will be avoided in experiments with low number of samples (consider also to set second_condition to 0 (see below) in case of very small datasets) (default false)

  • n_element_peptide: peptide data is the nth element of the list (change only if data is loaded manually as a list without using pepquantify read function) (default 1)

  • condition1: name of the first condition that should be compared (note that the order matters for the fold-change direction)

  • condition2: name of the second condition that should be compared (note that the order matters for the fold-change direction)

  • n_condition_1: minimum number of the valid values in the first condition (this value should be at least two, but default pepquant value is three)

  • n_condition_2: minimum number of the valid values in the second condition (this value should be at least two, but default pepquant value is three)

  • min_pep: minimum number of peptides for each protein (default 2)

  • downshift: see the perseus documentation "Replace missing values from normal distribution" (default 1.8) http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:imputation:replacemissingfromgaussian

  • width: see the perseus documentation "Replace missing values from normal distribution" (default 0.3) http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:imputation:replacemissingfromgaussian

  • n_ko_like: minimum number of peptides that should have missing and valid value pattern (all valid in one condition, maximum 1 valid in another, or otherwise by user defined criteria (see fraction_valid and second_condition)) to be included in quantification. "ko" here does not necessarily has biological meaning, here this term is used to refer peptides that are consistently detected in one condition and not (or with extremely low rate) in another (default 2)

  • fraction_valid: between 0-1. 1 means that imputed peptides are taken into account if they are present in all samples of one of the conditions (and max 1 in the second condition, see also option "second_condition"), 0.5 means if they are present in the half of the samples of one of the conditions. (default 1)

  • second_condition: maximum acceptable number of valid values in other condition when fraction valid is met in the other (default 1)

  • seed: as values for imputation are derived randomly, seed makes sure the reproducibility (default 1234)

  • fc_threshold: minimum fold change for the protein to be considered differentially abundant (in natural scale) (default 1.5)

msempire_data <- pepquantify::pepquantify_funs(data_raw, condition1 = "name_of_condition_one", condition2 = "name_of_condition_two", imputation = FALSE)
msempire_calculation(msempire_data, fc_threshold = 1.5)
Output
  • msempire_results_raw:
    this is the raw results of MS-EmpiRe

  • msempire_results_tidy:
    this is the results that has been cleaned-up and can be used for suppl tables

  • msempire_results_volcano: some columns was adjusted to make it suitable for the volcano plot

References:

  1. Ammar, C., et al., MS-EmpiRe Utilizes Peptide-level Noise Distributions for Ultra-sensitive Detection of Differentially Expressed Proteins. Mol Cell Proteomics, 2019. 18(9): p. 1880-1892.
  2. Flenkenthaler, F., et al., Differential Effects of Insulin-Deficient Diabetes Mellitus on Visceral vs. Subcutaneous Adipose Tissue-Multi-omics Insights From the Munich MIDY Pig Model. Frontiers in medicine, 2021. 8: p. 751277-751277.
  3. Stirm, M., et al., A scalable, clinically severe pig model for Duchenne muscular dystrophy. Disease Models & Mechanisms, 2021. 14(12).
  4. Demichev, V., et al., DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods, 2020. 17(1): p. 41-44.
  5. Tyanova, S., T. Temu, and J. Cox, The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc, 2016. 11(12): p. 2301-2319.

About

R package to prepare proteomics datasets for the peptide level quantification

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages