Differential expression format#

Use cases#

Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values.
Enable easy visualization using tools like Volcano Plot.
Enable easy integration with other omics data resources.
Store metadata information about the project, the workflow and the columns in the file.

Format#

The differential expression format by quantms is based on the MSstats output. The MSstats format is a tab-delimited file that contains the following fields - see example file:

protein -> Protein Accession
label -> Label for the contrast on which the fold changes and p-values are based on
log2fc -> Log2 Fold Change
se -> Standard error of the log2 fold change
df -> Degree of freedom of the Student test
pvalue -> Raw p-values
adj.pvalue -> P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg
issue -> Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing.

Example:

protein	label	log 2fc	se	d f	pv al ue	adj.p value	i ss ue
LV86 1_HUMAN	normal-squamous cell carcinoma	0 .60	0. 87	8	0. 51	0.62	NA

DE Header#

By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=label, Number=1, Type=String, Description="Label for the Conditions combination">
#INFO=<ID=log2fc, Number=1, Type=Double, Description="Log2 Fold Change">
#INFO=<ID=se, Number=1, Type=Double, Description="Standard error of the log2 fold change">
#INFO=<ID=df, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
#INFO=<ID=pvalue, Number=1, Type=Double, Description="Raw p-values">
#INFO=<ID=adj.pvalue, Number=1, Type=Double, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">

The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).
Protein groups are written as a list of protein accessions separated by ; (e.g. P12345;P12346)

We suggest including the following properties in the header:

project_accession: The project accession in PRIDE Archive
project_title: The project title in PRIDE Archive
project_description: The project description in PRIDE Archive
quanmts_version: The version of the quantms workflow used to generate the file
factor_value: The factor values used in the analysis (e.g. phenotype)
fdr_threshold: The FDR threshold used to filter the protein lists (e.g. adj.pvalue < 0.05)

A complete example of a quantms output file can be seen here.