Skip to content

Commit

Permalink
small changes in the main document
Browse files Browse the repository at this point in the history
  • Loading branch information
ypriverol committed Aug 28, 2024
1 parent 9554911 commit b0e90d9
Showing 1 changed file with 56 additions and 38 deletions.
94 changes: 56 additions & 38 deletions sdrf-proteomics/README.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,21 @@ ifdef::env-github[]
:warning-caption: :warning:
endif::[]

[[status]]
== Status of this document

This document provides information to the proteomics community about a proposed standard for sample metadata annotations in public repositories called Sample and Data Relationship File (SDRF)-Proteomics format. Distribution is unlimited.

**Version 1.0.1** - 2023-05-24

[[abstract]]
== Abstract

The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange, and verification. This document presents a specification for a sample metadata annotation of proteomics experiments.

Further detailed information, including any updates to this document, implementations, and examples is available at https://github.com/bigbio/proteomics-metadata-standard. The official PSI web page for the document is the following: http://psidev.info/sdrf.

[[introduction]]
== Introduction

Many resources have emerged that provide raw or integrated proteomics data in the public domain. If these are valuable individually, their integration through re-analysis represents a huge asset for the community [1]. Unfortunately, proteomics experimental design and sample related information are often missing in public repositories or stored in very diverse ways and formats. For example, the CPTAC consortium (https://cptac-data-portal.georgetown.edu/) provides for every dataset a set of Excel files with the information on each sample (e.g. https://cptac-data-portal.georgetown.edu/study-summary/S048) including tumor size, origin, but also how every sample is related to a specific raw file (e.g. instrument configuration parameters). As a resource routinely re-analysing public datasets, ProteomicsDB, captures for each sample in the database a minimum number of properties to describe the sample and the related experimental protocol such as tissue, digestion method and instrument (e.g. https://www.proteomicsdb.org/#projects/4267/6228). Such heterogeneity often prevents data interpretation, reproducibility, and integration of data from different resources. This is why we propose a homogenous standard for proteomics metadata annotation. For every proteomics dataset we propose to capture at least three levels of metadata: (i) dataset description, (ii) the sample and data files related information; and (iii) the technical/proteomics specific information in standard data file formats (e.g. the PSI formats mzIdentML, mzML, or mzTab, among others).
Expand All @@ -43,6 +46,7 @@ image::https://github.com/bigbio/proteomics-metadata-standard/raw/master/sdrf-pr

**Figure 1**: SDRF-Proteomics file format stores the information of the sample and its relation to the data files in the dataset. The file format includes not only information about the sample but also about how the data was acquired and processed.

[[requirements]]
=== Requirements

The SDRF-Proteomics format describes the sample characteristics and the relationships between samples and data files included in a dataset. The information in SDRF files is organised so that it follows the natural flow of a proteomics experiment. The main requirements to be fulfilled for SDRF-Proteomics format are:
Expand All @@ -53,17 +57,20 @@ The SDRF-Proteomics format describes the sample characteristics and the relation
- The file MUST begin with columns describing the samples of origin and continue with the data files generated from their MS analyses.
- Support for handling unknown values/characteristics.

[[issues-addressed]]
=== Issues to be addressed

The main issues to be addressed by the SDRF are:

- It MUST be able to represent the sample metadata and the data files generated by the instruments or the analyses.
- It MUST be able to represent the experimental design including the way samples and data have been collected.

[[notation-conventions]]
== Notational Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMEND/RECOMMENDED”, “MAY”, “COULD BE”, and “OPTIONAL” are to be interpreted as described in RFC 2119 (2).

[[document-structure]]
== Documentation

The official website for SDRF-Proteomics project is https://github.com/bigbio/proteomics-metadata-standard. New use cases, changes to the specification and examples can be added by using Pull requests or issues in GitHub (see introduction to GitHub - https://lab.github.com/githubtraining/introduction-to-github).
Expand All @@ -76,6 +83,7 @@ Multiple tools have been implemented to validate SDRF-Proteomics files for users

- jsdrf (Java - https://github.com/bigbio/jsdrf ): These Java library and tool allow validating SDRF-Proteomics files. It also includes a generic data model that can be used by Java applications.

[[relationship-specifications]]
== Relationship to other specifications

SDRF-Proteomics is fully compatible with the SDRF file format part of https://www.ebi.ac.uk/arrayexpress/help/magetab_spec.html[MAGE-TAB]. MAGE-TAB is the file format used to store metadata and sample information for transcriptomics experiments. When the proteomeXchange project file is converted to idf file (project description in MAGE-TAB) and is combined with the SDRF-Proteomics a valid MAGE-TAB is obtained.
Expand Down Expand Up @@ -118,6 +126,7 @@ image::https://github.com/bigbio/proteomics-metadata-standard/raw/master/sdrf-pr

**Figure 2**: SDRF-Proteomics in a nutshell. The file format is a tab-delimited one where columns are properties of the sample, the data file or the variables under study. The rows are the samples of origin and the cells are the values for one property in a specific sample.

[[sdrf-file-rules]]
=== SDRF-Proteomics format rules

There are general scenarios/use cases that are addressed by the following rules:
Expand Down Expand Up @@ -157,6 +166,7 @@ The value for each property, (e.g. characteristics, comment) corresponding to ea

NT=Glu->pyro-Glu;MT=fixed;PP=Anywhere;AC=Unimod:27;TA=E

[[from-sample-metadata]]
== SDRF-Proteomics: Samples metadata

The Sample metadata has different Categories/Headings to organize all the attributes/ column headers of a given sample. Each Sample contains a _source name_ (accession) and a set of _characteristics_. Any proteomics sample MUST contain the following characteristics:
Expand Down Expand Up @@ -225,45 +235,14 @@ Examples:
• https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD011799/PXD011799.sdrf.tsv[TMT]
• https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD017710/PXD017710-silac.sdrf.tsv[SILAC]

[[dda-dia]]
=== Data acquisition method: DDA and DIA and others

Proteomics data acquisition method can happen in two ways: Data Dependent Acquisition (DDA) or Data Independent Acquisition (DIA). The SDRF-Proteomics file format allows to capture the method used for the data acquisition in the _comment[proteomics data acquisition method]_ column. The following values are RECOMMENDED for DDA and DIA:

- data-dependent acquisition
- data-independent acquisition
- parallel reaction monitoring
- selected reaction monitoring

TIP: If the SDRF do not specified the proteomics data acquisition method as _comment[proteomics data acquisition method]_, it is assumed that the method used is DDA which is the most common method used in proteomics.

You can find an example of a DIA experiment in the following link: https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD018830/PXD018830-DIA.sdrf.tsv[DIA example]

[[dia]]
==== Data Independent Acquisition - Scan window limits

Additionally to the general _comment[proteomics data acquisition method]_ column, the SDRF-Proteomics file format allows to capture other properties for the DIA method. The following properties are RECOMMENDED for DIA:

- _comment[scan window lower limit]_
- _comment[scan window upper limit]_

The scan window lower and upper limits are the m/z range used for the DIA acquisition. The values are expressed in m/z units.

Example:

|===
| | assay name | comment[scan window lower limit] | comment[scan window upper limit] | comment[data file]
|sample 1| run 1 | 400 m/z | 1200 m/z | FILE_R1.RAW
|sample 1| run 2 | 400 m/z | 1200 m/z | FILE_R2.RAW
|===

[[instrument]]
=== Type and Model of Mass Spectrometer

The model of the mass spectrometer SHOULD be specified as _comment[instrument]_. Possible values are listed under https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1000031&viewMode=All&siblings=false[instrument model term].

Additionally, it is strongly RECOMMENDED to include comment[MS2 analyzer type]. This is important, e.g., for Orbitrap models where MS2 scans can be acquired either in the Orbitrap or in the ion trap. Setting this value allows differentiating high-resolution MS/MS data. Possible values of _comment[MS2 analyzer type]_ are mass analyzer types.

[[additional-data-files]]
=== Additional Data files technical properties

It is RECOMMENDED to encode some of the technical parameters of the MS experiment as comments, including the following parameters:
Expand All @@ -276,7 +255,7 @@ It is RECOMMENDED to encode some of the technical parameters of the MS experimen
[[ptms]]
==== Protein Modifications

Sample modifications, (including both chemical modifications and post-translational modifications, PTMs) are originated from multiple sources: artifactual modifications, isotope labeling, adducts that are encoded as PTMs (e.g. sodium) or the most biologically relevant PTMs.
Sample modifications, (including both chemical modifications and post-translational modifications, PTMs) are originated from multiple sources: artifact modifications, isotope labeling, adducts that are encoded as PTMs (e.g. sodium) or the most biologically relevant PTMs.

It is RECOMMENDED to provide the modifications expected in the sample including the amino acid affected, whether it is Variable or Fixed (also Custom and Annotated modifications are supported) and included other properties such as mass shift/delta mass and the position (e.g. anywhere in the sequence).

Expand All @@ -286,7 +265,6 @@ The modification parameters are the name of the ontology term MS:1001055.

For each modification, different properties are captured using a key=value pair structure including name, position, etc. All the possible (optional) features available for modification parameters are:


|===
|Property |Key |Example | Mandatory(:white_check_mark:)/Optional(:zero:) |comment

Expand Down Expand Up @@ -332,6 +310,7 @@ An example of an SDRF-Proteomics with annotated endopeptidase:

NOTE: If no endopeptidase is used, for example, in the case of Top-down/intact protein experiments, the value SHOULD be ‘not applicable’.

[[mass-tolerances]]
==== Precursor and Fragment mass tolerances

For proteomics experiments, it is important to encode different mass tolerances (for precursor and fragment ions).
Expand All @@ -344,6 +323,7 @@ For proteomics experiments, it is important to encode different mass tolerances

Units for the mass tolerances (either Da or ppm) MUST be provided.

[[study-variables]]
== SDRF-Proteomics study variables

The variable/property under study SHOULD be highlighted using the factor value category. For example, the _factor value[tissue]_ is used when the user wants to compare expression across different tissues. You can add Multiple variables under study by providing multiple factor values.
Expand All @@ -359,6 +339,7 @@ Conventions define how to encode some particular information in the file format

In the convention section <<conventions>>, the columns are described and defined, while in the section use cases and templates <<use-cases>> the columns needed to describe a use case are specified.

[[age-encoding]]
=== How to encode age

One of the characteristics of a patient sample can be the age of an individual. It is RECOMMENDED to provide the age in the following format: {X}Y{X}M{X}D. Some valid examples are:
Expand Down Expand Up @@ -406,6 +387,7 @@ One possible exception is made for the case when one channel e.g., in a TMT/iTRA

Another possible value for _characteristics[pooled sample]_ is a string `pooled` for cases when it is known that a sample is pooled but the individual samples cannot be annotated.

[[derived-samples]]
=== Derived samples (such as patient-derived xenografts)

In cancer research, patient-derived xenografts (PDX) are commonly used. In those, the patient’s tumor is transplanted into another organism, usually a mouse. In these cases, the metadata, such as age and sex, MUST refer to the original patient and not the mouse.
Expand All @@ -414,6 +396,7 @@ PDX samples SHOULD be annotated by using the column name _characteristics[xenogr

For experiments where both the PDX and the original tumor are measured, the PDX entry SHOULD reference the respective tumor sample’s source name in the _characteristics[source name]_ column. Non-PDX samples SHOULD contain the “not applicable” value in the _characteristics[xenograft]_ and the characteristics[source name] column. Both tumor and PDX samples SHOULD reference the patient using the characteristics[individual] column. This column SHOULD contain some sort of patient identifier.

[[spiked-in]]
=== Spiked-in samples

There are multiple scenarios when a sample is spiked with additional analytes. Peptides, proteins, or mixtures can be added to the sample as controlled amounts to provide a standard or ground truth for quantification, or for retention time alignment, etc.
Expand Down Expand Up @@ -447,6 +430,7 @@ For multiple spiked components, the column _characteristics[spiked compound]_ ma

If the spiked component is another biological sample (e.g. __E. coli__ lysate spiked into human sample), then the spiked component MUST be annotated in its own row. Both components of the sample SHOULD have `characteristics[mass]` specified. Inclusion of _characteristics[spiked compound]_ is optional in this case; if provided, it SHOULD be the string `spiked` for the spiked sample.

[[synthetic-peptide]]
=== Synthetic peptide libraries

It is common to use synthetic peptide libraries for proteomics, and MS use cases include:
Expand All @@ -458,6 +442,7 @@ When describing synthetic peptide libraries, most of the sample metadata can be

It is important to annotate that the sample is a synthetic peptide library, this can be done by adding the characteristics[synthetic peptide]. The possible values are “synthetic” or “not synthetic”.

[[normal-healthy]]
=== Normal and healthy samples

Samples from healthy patients or individuals normally appear in manuscripts and annotations as healthy or normal. We RECOMMEND using the word “normal” mapped to term PATO_0000461 that is in EFO: normal PATO term. Example:
Expand All @@ -469,6 +454,7 @@ Samples from healthy patients or individuals normally appear in manuscripts and
|sample_control | homo sapiens | Whole Organism | normal | none | normal
|===

[[sample-technical-biological-replicates]]
=== Encoding sample technical and biological replicates

Different measurements of the same biological sample are often categorized as (i) Technical or (ii) Biological replicates, based on whether they are (i) matched on all variables, e.g. same sample and same protocol; or (ii) different samples matched on explanatory variable(s), e.g. different patients receiving a placebo, in a placebo vs. drug trial. Technical and biological replicates have different levels of independence, which must be taken into account during data interpretation.
Expand All @@ -483,10 +469,10 @@ In the following example, only if the technical replicate column is provided, on

|===
| source name | assay name | comment[label] | comment[fraction identifier] | comment[technical replicate] | comment[data file]
| Sample 1 | run 1 | label free sample | 1 | 1 | 000261_C05_P0001563_A00_B00K_F1_TR1.RAW
| Sample 1 | run 2 | label free sample | 2 | 1 | 000261_C05_P0001563_A00_B00K_F2_TR1.RAW
| Sample 1 | run 3 | label free sample | 1 | 2 | 000261_C05_P0001563_A00_B00K_F1_TR2.RAW
| Sample 1 | run 4 | label free sample | 2 | 2 | 000261_C05_P0001563_A00_B00K_F2_TR2.RAW
| Sample 1 | run 1 | label free sample | 1 | 1 | F1_TR1.RAW
| Sample 1 | run 2 | label free sample | 2 | 1 | F2_TR1.RAW
| Sample 1 | run 3 | label free sample | 1 | 2 | F1_TR2.RAW
| Sample 1 | run 4 | label free sample | 2 | 2 | F2_TR2.RAW
|===

The _comment[technical replicate]_ column is MANDATORY. Please fill it with 1 if technical replicates are not performed in a study.
Expand Down Expand Up @@ -535,6 +521,38 @@ We RECOMMEND including the public URI of the file if available. For example, for

Curators can decide to annotate multiple ProteomeXchange datasets into one large SDRF-Proteomics file for reanalysis purposes. If that is the case, it is RECOMMENDED to use the comment[proteomexchange accession number] to differentiate between different datasets.

[[data-acquisition-method]]
=== Data acquisition method: DDA and DIA and others

Proteomics data acquisition method can happen in two ways: Data Dependent Acquisition (DDA) or Data Independent Acquisition (DIA). The SDRF-Proteomics file format allows to capture the method used for the data acquisition in the _comment[proteomics data acquisition method]_ column. The following values are RECOMMENDED for DDA and DIA:

- data-dependent acquisition
- data-independent acquisition
- parallel reaction monitoring
- selected reaction monitoring

TIP: If the SDRF do not specified the proteomics data acquisition method as _comment[proteomics data acquisition method]_, it is assumed that the method used is DDA which is the most common method used in proteomics.

You can find an example of a DIA experiment in the following link: https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD018830/PXD018830-DIA.sdrf.tsv[DIA example]

[[dia]]
==== Data Independent Acquisition - Scan window limits

Additionally to the general _comment[proteomics data acquisition method]_ column, the SDRF-Proteomics file format allows to capture other properties for the DIA method. The following properties are RECOMMENDED for DIA:

- _comment[scan window lower limit]_
- _comment[scan window upper limit]_

The scan window lower and upper limits are the m/z range used for the DIA acquisition. The values are expressed in m/z units.

Example:

|===
| | assay name | comment[scan window lower limit] | comment[scan window upper limit] | comment[data file]
|sample 1| run 1 | 400 m/z | 1200 m/z | FILE_R1.RAW
|sample 1| run 2 | 400 m/z | 1200 m/z | FILE_R2.RAW
|===

[[use-cases]]
== SDRF-Proteomics use-cases representation (templates)

Expand Down

0 comments on commit b0e90d9

Please sign in to comment.