Differential expression format#

Use cases#

  • Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values.

  • Enable easy visualization using tools like Volcano Plot.

  • Enable easy integration with other omics data resources.

  • Store metadata information about the project, the workflow and the columns in the file.

Format#

The differential expression format by quantms is based on the MSstats output. The MSstats format is a tab-delimited file that contains the following fields - see example file:

  • protein -> Protein Accession

  • label -> Label for the contrast on which the fold changes and p-values are based on

  • log2fc -> Log2 Fold Change

  • se -> Standard error of the log2 fold change

  • df -> Degree of freedom of the Student test

  • pvalue -> Raw p-values

  • adj.pvalue -> P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg

  • issue -> Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing.

Example:

protein

label

log 2fc

se

d f

pv al ue

adj.p value

i ss ue

LV86 1_HUMAN

normal-squamous cell carcinoma

0 .60

0. 87

8

0. 51

0.62

NA

DE Header#

By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=label, Number=1, Type=String, Description="Label for the Conditions combination">
#INFO=<ID=log2fc, Number=1, Type=Double, Description="Log2 Fold Change">
#INFO=<ID=se, Number=1, Type=Double, Description="Standard error of the log2 fold change">
#INFO=<ID=df, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
#INFO=<ID=pvalue, Number=1, Type=Double, Description="Raw p-values">
#INFO=<ID=adj.pvalue, Number=1, Type=Double, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">
  • The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).

  • Protein groups are written as a list of protein accessions separated by ; (e.g. P12345;P12346)

We suggest including the following properties in the header:

  • project_accession: The project accession in PRIDE Archive

  • project_title: The project title in PRIDE Archive

  • project_description: The project description in PRIDE Archive

  • quanmts_version: The version of the quantms workflow used to generate the file

  • factor_value: The factor values used in the analysis (e.g. phenotype)

  • fdr_threshold: The FDR threshold used to filter the protein lists (e.g. adj.pvalue < 0.05)

A complete example of a quantms output file can be seen here.