Peptide table format#

Use cases#

The Peptide table aims to cover detail on peptide level including peptide intensity. The most of content are from peptide part of mzTab.

Store peptide intensity to perform down-stream analysis and integration.
Enable easy visualization and scanning on peptide level.

Format#

For large-scale datasets, a peptide section would be very large. Therefore, a Parquet format is adopted and its data section mainly consists of the following column:

sequence: Peptide sequence -> string
protein_accessions: A list protein’s accessions -> list[string] (e.g. [P02768, P02769])
unique: Indicates whether the peptide is unique for this protein in respect to the searched database -> boolean (0/1)
best_id_score: A key value pair of the best search engine score selected by the algorithm (e.g. "MS-GF:RawScore": 234.0) -> string
posterior_error_probability: Posterior Error Probability scores -> double
modifications: A list of modifications for a give peptide -> [modification1, modification2, ...]. A modification should be recorded as string similarly to mztab like: - {position}({Probabilistic Score:0.9})|{position2}|..-{modification accession or name}

-> e.g 1(Probabilistic Score:0.9)|2|3-UNIMOD:35
charge: Precursor charge -> int
exp_mass_to_charge: The precursor’s experimental mass to charge (m/z) -> double
peptidoform: Peptidoform of the peptide PEPTIDE[+80.0]FORM -> string
sample_accession: A unique sample accession corresponding to the source name in the SDRF-> string
abundance: The peptide’s abundance in the given sample -> float
is_decoy: Indicates whether the peptide sequence is decoy -> boolean (0/1)

Optional fields:

number_of_psms: Number of PSMs for the peptide in the given sample sample_accession -> int
retention_time: Retention time (seconds), it can be the median across all retention times in the Peptide quantification features -> float
gene_accessions: A list of gene accessions -> list[string] (e.g. [ENSG00000139618, ENSG00000139618])
gene_names: A list of gene names -> list[string] (e.g. [APOA1, APOA1])
consensus_support: Global consensus support scores for multiple search engines -> float
id_scores: A list of identification scores, search engine, percolator etc. Each search engine score will be a key/value pair (e.g. "MS-GF:RawScore": 78.9) -> list[string]
reference_file_name: The reference file name that contains the spectrum. -> string
scan_number: The scan number of the spectrum. The scan number or index of the spectrum in the file -> string