Peptide table format
====================

Use cases
---------

The Peptide table aims to cover detail on peptide level including
peptide intensity. The most of content are from peptide part of mzTab.

-  Store peptide intensity to perform down-stream analysis and
   integration.
-  Enable easy visualization and scanning on peptide level.

Format
---------

For large-scale datasets, a peptide section would be very large.
Therefore, a Parquet format is adopted and its data section mainly
consists of the following column:

-  ``sequence``: Peptide sequence -> ``string``
-  ``protein_accessions``: A list protein’s accessions -> ``list[string] (e.g. [P02768, P02769])``
-  ``unique``: Indicates whether the peptide is unique for this protein in respect to the searched database -> ``boolean (0/1)``
-  ``best_id_score``: A key value pair of the best search engine score selected by the algorithm ``(e.g. "MS-GF:RawScore": 234.0)`` -> ``string``
-  ``posterior_error_probability``: Posterior Error Probability scores -> ``double``
-  ``modifications``: A list of modifications for a give peptide -> ``[modification1, modification2, ...]``. A modification should be recorded as string similarly to mztab like:
   -  ``{position}({Probabilistic Score:0.9})|{position2}|..-{modification accession or name}``
      -> e.g ``1(Probabilistic Score:0.9)|2|3-UNIMOD:35``

-  ``charge``: Precursor charge -> ``int``
-  ``exp_mass_to_charge``: The precursor’s experimental mass to charge (m/z) -> ``double``
-  ``peptidoform``: Peptidoform of the peptide ``PEPTIDE[+80.0]FORM`` -> ``string``
-  ``sample_accession``: A unique sample accession corresponding to the source name in the SDRF-> ``string``
-  ``abundance``: The peptide’s abundance in the given sample -> ``float``
-  ``is_decoy``: Indicates whether the peptide sequence is decoy -> ``boolean (0/1)``

Optional fields:

-  ``number_of_psms``: Number of PSMs for the peptide in the given sample ``sample_accession`` -> ``int``
-  ``retention_time``: Retention time (seconds), it can be the median across all retention times in the :doc:`feature` -> ``float``
-  ``gene_accessions``: A list of gene accessions -> ``list[string] (e.g. [ENSG00000139618, ENSG00000139618])``
-  ``gene_names``: A list of gene names -> ``list[string] (e.g. [APOA1, APOA1])``
-  ``consensus_support``: Global consensus support scores for multiple search engines -> ``float``
-  ``id_scores``: A list of identification scores, search engine, percolator etc. Each search engine score will be a key/value pair ``(e.g. "MS-GF:RawScore": 78.9)`` -> ``list[string]``
-  ``reference_file_name``: The reference file name that contains the spectrum. -> ``string``
-  ``scan_number``: The scan number of the spectrum. The scan number or index of the spectrum in the file -> ``string``