PSM table format¶

Use cases¶

The PSM table aims to cover detail on PSM level for AI/ML training and other use-cases.
Most of the content is similar to mzTab, a PSM would be a peptide identification in an specific msrun file.
The representation should be in a parquet file. *
Store details on PSM level including spectrum mz/intensity for specific use-cases such as AI/ML training.
Fast and easy visualization and scanning on PSM level.

*Parquet file is a columnar storage format that supports nested data. For these large-scale analyses, Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time (web use-cases), hence the overall costs. The following table compares the savings as well as the speedup obtained by converting data into Parquet from CSV.

Dataset	Size on Amazon S3	Query Run Time	Data Scanned
Data stored as CSV files	1 TB	236 seconds	1.15 TB
Data stored in Apache Parquet Format	130 GB	6.78 seconds	2.51 GB

Format¶

The Avro PSM schema is used to define the PSM table format. The following table describes the fields of the PSM table.

sequence: The peptide’s sequence corresponding to the PSM -> string
protein_accessions: A list protein’s accessions -> list[string]
protein_start_positions: A list of protein’s start positions -> list[int]
protein_end_positions: A list of protein’s end positions -> list[int]
protein_global_qvalue: The global q-value of the associated protein or protein group -> double
unique: Indicates whether the peptide sequence (coming from the PSM) is unique for this protein in respect to the searched database -> boolean (0/1)
modifications: A list of modifications for a give peptide [modification1, modification2, ...]. A modification should be recorded as string like modification definition-> list[string]
retention_time: The retention time of the spectrum -> float
charge: The charge assigned by the search engine/software -> integer
exp_mass_to_charge: The PSM’s experimental mass to charge (m/z) -> double
calc_mass_to_charge: The PSM’s calculated (theoretical) mass to charge (m/z) -> double
reference_file_name: The reference file name that contains the spectrum. -> string
scan_number: The scan number of the spectrum. The scan number or index of the spectrum in the file -> string
peptidoform: Peptidoform of the PSM. See more documentation here. -> string
posterior_error_probability: Posterior Error Probability score from quantms -> double
global_qvalue: Global q-value from quantms -> double
is_decoy: Indicates whether the peptide sequence (coming from the PSM) is decoy -> boolean (0/1)

Optional fields:

gene_accessions: A list of gene accessions -> list[string]
gene_names: A list of gene names -> list[string]
consensus_support: Global consensus support scores for multiple search engines -> float
mz_array: A list of mz values for the spectrum -> list[double]
intensity_array: A list of intensity values for the spectrum -> list[float]
num_peaks: The number of peaks in the spectrum, this is the size of previous lists intensity and mz -> integer
id_scores: A list of identification scores, search engine, percolator etc. Each search engine score will be a key/value pair (e.g. "MS-GF:RawScore": 78.9) -> list[string]