Peptide table format

Use cases

The Peptide table aims to cover detail on peptide level including peptide intensity. The most of content are from peptide part of mzTab.

  • Store peptide intensity to perform down-stream analysis and integration.

  • Enable easy visualization and scanning on peptide level.

Format

For large-scale datasets, a peptide section would be very large. Therefore, a Parquet format is adopted and its data section mainly consists of the following column:

  • sequence: Peptide sequence -> string

  • protein_accessions: A list protein’s accessions -> list[string] (e.g. [P02768, P02769])

  • unique: Indicates whether the peptide is unique for this protein in respect to the searched database -> boolean (0/1)

  • best_id_score: A key value pair of the best search engine score selected by the algorithm (e.g. "MS-GF:RawScore": 234.0) -> string

  • posterior_error_probability: Posterior Error Probability scores -> double

  • modifications: A list of modifications for a give peptide -> [modification1, modification2, ...]. A modification should be recorded as string similarly to mztab like: - {position}({Probabilistic Score:0.9})|{position2}|..-{modification accession or name}

    -> e.g 1(Probabilistic Score:0.9)|2|3-UNIMOD:35

  • charge: Precursor charge -> int

  • exp_mass_to_charge: The precursor’s experimental mass to charge (m/z) -> double

  • peptidoform: Peptidoform of the peptide PEPTIDE[+80.0]FORM -> string

  • sample_accession: A unique sample accession corresponding to the source name in the SDRF-> string

  • abundance: The peptide’s abundance in the given sample -> float

  • is_decoy: Indicates whether the peptide sequence is decoy -> boolean (0/1)

Optional fields:

  • number_of_psms: Number of PSMs for the peptide in the given sample sample_accession -> int

  • retention_time: Retention time (seconds), it can be the median across all retention times in the Peptide quantification features -> float

  • gene_accessions: A list of gene accessions -> list[string] (e.g. [ENSG00000139618, ENSG00000139618])

  • gene_names: A list of gene names -> list[string] (e.g. [APOA1, APOA1])

  • consensus_support: Global consensus support scores for multiple search engines -> float

  • id_scores: A list of identification scores, search engine, percolator etc. Each search engine score will be a key/value pair (e.g. "MS-GF:RawScore": 78.9) -> list[string]

  • reference_file_name: The reference file name that contains the spectrum. -> string

  • scan_number: The scan number of the spectrum. The scan number or index of the spectrum in the file -> string