PSM table format¶
Use cases¶
The PSM table aims to cover detail on PSM level for AI/ML training and other use-cases.
Most of the content is similar to mzTab, a PSM would be a peptide identification in an specific msrun file.
The representation should be in a parquet file. *
Store details on PSM level including spectrum mz/intensity for specific use-cases such as AI/ML training.
Fast and easy visualization and scanning on PSM level.
*Parquet file is a columnar storage format that supports nested data. For these large-scale analyses, Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time (web use-cases), hence the overall costs. The following table compares the savings as well as the speedup obtained by converting data into Parquet from CSV.
Dataset |
Size on Amazon S3 |
Query Run Time |
Data Scanned |
---|---|---|---|
Data stored as CSV files |
1 TB |
236 seconds |
1.15 TB |
Data stored in Apache Parquet Format |
130 GB |
6.78 seconds |
2.51 GB |
Format¶
The Avro PSM schema is used to define the PSM table format. The following table describes the fields of the PSM table.
sequence
: The peptide’s sequence corresponding to the PSM ->string
protein_accessions
: A list protein’s accessions ->list[string]
protein_start_positions
: A list of protein’s start positions ->list[int]
protein_end_positions
: A list of protein’s end positions ->list[int]
protein_global_qvalue
: The global q-value of the associated protein or protein group ->double
unique
: Indicates whether the peptide sequence (coming from the PSM) is unique for this protein in respect to the searched database ->boolean (0/1)
modifications
: A list of modifications for a give peptide[modification1, modification2, ...]
. A modification should be recorded as string like modification definition->list[string]
retention_time
: The retention time of the spectrum ->float
charge
: The charge assigned by the search engine/software ->integer
exp_mass_to_charge
: The PSM’s experimental mass to charge (m/z) ->double
calc_mass_to_charge
: The PSM’s calculated (theoretical) mass to charge (m/z) ->double
reference_file_name
: The reference file name that contains the spectrum. ->string
scan_number
: The scan number of the spectrum. The scan number or index of the spectrum in the file ->string
peptidoform
: Peptidoform of the PSM. See more documentation here. ->string
posterior_error_probability
: Posterior Error Probability score from quantms ->double
global_qvalue
: Global q-value from quantms ->double
is_decoy
: Indicates whether the peptide sequence (coming from the PSM) is decoy ->boolean (0/1)
Optional fields:
gene_accessions
: A list of gene accessions ->list[string]
gene_names
: A list of gene names ->list[string]
consensus_support
: Global consensus support scores for multiple search engines ->float
mz_array
: A list of mz values for the spectrum ->list[double]
intensity_array
: A list of intensity values for the spectrum ->list[float]
num_peaks
: The number of peaks in the spectrum, this is the size of previous lists intensity and mz ->integer
id_scores
: A list of identification scores, search engine, percolator etc. Each search engine score will be a key/value pair(e.g. "MS-GF:RawScore": 78.9)
->list[string]