quantms.io tools#

quantms.io tools provides a standardized set of commands to generate different files for your project. It is mainly used to consolidate the data of each file and generate a standardized file representation.

_images/map.png

You can generate separate files or complete project files depending on your needs.A completed project contains the following files:

  • project.json -> contains descriptive information about the entire project.

Example:

{
 "project_accession": "PXD014414",
 "project_title": "",
 "project_sample_description": "",
 "project_data_description": "",
 "project_pubmed_id": 32265444,
 "organisms": [
     "Homo sapiens"
 ],
 "organism_parts": [
     "mammary gland",
     "adjacent normal tissue"
 ],
 "diseases": [
     "metaplastic breast carcinomas",
     "Triple-negative breast cancer",
     "Normal",
     "not applicable"
 ],
 "cell_lines": [
     "not applicable"
 ],
 "instruments": [
     "Orbitrap Fusion"
 ],
 "enzymes": [
     "Trypsin"
 ],
 "experiment_type": [
     "Triple-negative breast cancer",
     "Wisp3",
     "Tandem mass tag (tmt) labeling",
     "Ccn6",
     "Metaplastic breast carcinoma",
     "Precision therapy",
     "Lc-ms/ms shotgun proteomics"
 ],
 "acquisition_properties": [
     {"proteomics data acquisition method": "TMT"},
     {"proteomics data acquisition method": "Data-dependent acquisition"},
     {"dissociation method": "HCD"},
     {"precursor mass tolerance": "20 ppm"},
     {"fragment mass tolerance": "0.6 Da"}
 ],
 "quantms_files": [
     {"feature_file": "PXD014414-943a8f02-0527-4528-b1a3-b96de99ebe75.featrue.parquet"},
     {"sdrf_file": "PXD014414-f05eca35-9381-40d8-a7da-2fe57745afaf.sdrf.tsv"},
     {"psm_file": "PXD014414-f4fb88f6-0a45-451d-a8a6-b6d58fb83670.psm.parquet"},
     {"differential_file": "PXD014414-3026e5d5-fb0e-45e9-a4f0-c97d86536716.differential.tsv"}
 ],
 "quantms_version": "1.1.1",
 "comments": []
}
  • absolute_expression.tsv or differential_expression.tsv

The differential expression format by quantms is based on the MSstats output.

Example:

protein

label

log 2fc

se

d f

pv al ue

adj.p value

i ss ue

LV86 1_HUMAN

normal-squamous cell carcinoma

0 .60

0. 87

8

0. 51

0.62

NA

The absolute expression format by quantms contains IBAQ message.

Example:

protein

sample_accession

condition

ibaq

ribaq

LV861_HUMAN

Sample-1

heart

1234.1

12.34

  • feature.parquet

The feature.parquet cover detail on peptide level.

Example:

sequence

protein_accessions

protein_start_positions

protein_end_positions

protein_global_qvalue

unique

modifications

retention_time

charge

exp_mass_to_charge

calc_mass_to_charge

peptidoform

posterior_error_probability

global_qvalue

is_decoy

intensity

spectral_count

sample_accession

condition

fraction

biological_replicate

fragment_ion

isotope_label_type

run

channel

id_scores

reference_file_name

best_psm_reference_file_name

best_psm_scan_number

mz_array

intensity_array

num_peaks

gene_accessions

gene_names

ASPDWGYDDK

[‘sp|CONTAMINANT_P00915|CONTAMINANT_CAH1_HUMAN’,’sp|P00915|CAH1_HUMAN’]

[1 2]

[10 11]

0.001882796

0

[‘0-UNIMOD:1’ ‘10-UNIMOD:737’]

7522.223146

2

712.831298

712.8302134

[Acetyl]-ASPDWGYDDK[TMT6plex]

4.97E-05

0

0

454585.3

1

PXD014414-Sample-10

Norm

1

10

None

L

1_1_1

TMT131

[“‘OpenMS:Best PSM Score’:0.0”,’Best PSM PEP:4.96872e-05’]

UM_F_50cm_2019_0414

UM_F_50cm_2019_0430

53434

  • psm.parquet

psm.parquet store details on PSM level including spectrum mz/intensity for specific use-cases such as AI/ML training.

Example:

sequence

protein_accessions

protein_start_positions

protein_end_positions

protein_global_qvalue

unique

modifications

retention_time

charge

exp_mass_to_charge

calc_mass_to_charge

peptidoform

posterior_error_probability

global_qvalue

is_decoy

id_scores

consensus_support

reference_file_name

scan_number

mz_array

intensity_array

num_peaks

gene_accessions

gene_names

SSPGHR

[‘sp|P29692|EF1D_HUMAN’]

[118]

[123]

0.001882796

1

[‘1-UNIMOD:737’]

1258.2

2

435.2432855

435.2431809

S[TMT6plex]SPGHR

0.35875

0

[“‘OpenMS:Target-decoy PSM q-value’: 0.040626999360205”,’Posterior error probability: 0.35875’]

UM_F_50cm_2019_0428

2193

  • sdrf.tsv

sdrf.tsv is a file used by quantMS to search the library.

Example:

source name

characteristics[organism]

characteristics[organism part]

characteristics[developmental stage]

characteristics[disease]

characteristics[histologic subtype]

characteristics[sex]

characteristics[age]

characteristics[cell type]

characteristics[cell line]

characteristics[biological replicate]

characteristics[individual]

Material Type

assay name

Technology Type

comment[label]

comment[data file]

comment[file uri]

comment[technical replicate]

comment[fraction identifier]

comment[cleavage agent details]

comment[instrument]

comment[modification parameters]

comment[modification parameters]

comment[modification parameters]

comment[modification parameters]

comment[modification parameters]

comment[modification parameters]

comment[dissociation method]

comment[collision energy]

comment[precursor mass tolerance]

comment[fragment mass tolerance]

factor value[disease]

PXD014414-Sample-1

Homo sapiens

mammary gland

adult

metaplastic breast carcinomas

Chondroid

female

43Y

not applicable

not applicable

1

C1

tissue

run 1

proteomic profiling by mass spectrometry

TMT126

UM_F_50cm_2019_0414.raw

ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/04/PXD014414/UM_F_50cm_2019_0414.raw

1

1

AC=MS:1001251;NT=Trypsin

NT=Orbitrap Fusion;AC=MS:1002416

NT=Oxidation;MT=Variable;TA=M;AC=UNIMOD:35

NT=Acetyl;AC=UNIMOD:1;PP=Protein N-term;MT=variable

NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4

NT=TMT6plex;AC=UNIMOD:737;TA=K;MT=Fixed

NT=TMT6plex;AC=UNIMOD:737;PP=Protein N-term;MT=Variable

NT=TMT6plex;AC=UNIMOD:737;TA=S;MT=Variable

NT=HCD;AC=PRIDE:0000590

55 NCE

20 ppm

0.6 Da

metaplastic breast carcinomas

  • If you want see a full example, please click here

Project converter tool#

If your project comes from the PRIDE database, you can use the pride accession to generate a project.json that contains descriptive information about the entire project. Or, customize a Project Accession to generate an entirely new project.

  • If you want to know more, please read Project file.

  • If your project is not from PRIDE, you can skip this step.

quantmsio_cli generate-pride-project-json
   --project_accession PXD014414
   --sdrf PXD014414.sdrf.tsv
   --output_folder result
  • Optional parameter

--quantms_version   Quantms version
--delete_existing   Delete existing files in the output folder(default False)

DE converter tool#

Differential expression file Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values.It can be easily visualized using tools such as Volcano Plot and easily integrated with other omics data resources.

  • If you have generated project.json, you can use this parameter --project_file to add project information for DE files.

  • If you want to know more, please read Differential expression format.

Example:

quantmsio_cli convert-de
   --msstats_file PXD014414.sdrf_openms_design_msstats_in_comparisons.csv
   --sdrf_file PXD014414.sdrf.tsv
   --output_folder result
  • Optional parameter

--project_file   Descriptive information from project.json(project json path)
--fdr_threshold   FDR threshold to use to filter the results(default 0.05)
--output_prefix_file   Prefix of the df expression file(like {prefix}-{uu.id}-{extension})
--delete_existing   Delete existing files in the output folder(default True)

AE converter tool#

The absolute expression format aims to visualize absolute expression (AE) results using iBAQ values and store the AE results of each protein on each sample.

  • If you have generated project.json, you can use this parameter --project_file to add project information for AE files.

  • If you want to know ibaq, please read ibaqpy

  • If you want to know more, please read Absolute expression format.

Example:

quantmsio_cli convert-ae
   --ibaq_file PXD004452-ibaq.csv
   --sdrf_file PXD014414.sdrf.tsv
   --output_folder result
  • Optional parameter

--project_file   Descriptive information from project.json(project json path)
--output_prefix_file    Prefix of the df expression file(like {prefix}-{uu.id}-{extension})
--delete_existing    Delete existing files in the output folder(default True)

Feature converter tool#

The Peptide table aims to cover detail on peptide level including peptide intensity. The most of content are from peptide part of mzTab. It store peptide intensity to perform down-stream analysis and integration.

In some projects, mzTab files can be very large, so we provide both diskcache and no-diskcache versions of the tool. You can choose the desired version according to your server configuration.

Example:

quantmsio_cli convert-feature
   --sdrf_file PXD014414.sdrf.tsv
   --msstats_file PXD014414.sdrf_openms_design_msstats_in.csv
   --mztab_file PXD014414.sdrf_openms_design_openms.mzTab
   --output_folder result
  • Optional parameter

--use_cache    Whether to use diskcache instead of memory(default True)
--output_prefix_file   The prefix of the result file(like {prefix}-{uu.id}-{extension})
--consensusxml_file   The consensusXML file used to retrieve the mz/rt(default None)

Psm converter tool#

The PSM table aims to cover detail on PSM level for AI/ML training and other use-cases. It store details on PSM level including spectrum mz/intensity for specific use-cases such as AI/ML training.

Example:

quantmsio_cli convert-psm
   --mztab_file PXD014414.sdrf_openms_design_openms.mzTab
   --output_folder result
  • Optional parameter

--use_cache    Whether to use diskcache instead of memory(default True)
--output_prefix_file   The prefix of the result file(like {prefix}-{uu.id}-{extension})
--verbose  Output debug information(default True)

DiaNN convert#

For DiaNN, the command supports generating feature.parquet and psm.parquet directly from diann_report files.

Example:

quantmsio_cli convert-diann
   --report_path diann_report.tsv
   --design_file PXD037682.sdrf_openms_design.tsv
   --qvalue_threshold 0.05
   --mzml_info_folder mzml
   --sdrf_path PXD037682.sdrf.tsv
   --output_folder result
   --output_prefix_file PXD037682
  • Optional parameter

--duckdb_max_memory   The maximum amount of memory allocated by the DuckDB engine (e.g 4GB)
--duckdb_threads  The number of threads for the DuckDB engine (e.g 4)
--file_num The number of files being processed at the same time (default 100)

Inject some messages for DiaNN#

For DiaNN, some field information is not available and needs to be filled with other commands.

  • bset-psm-scan-number

Example:

quantmsio_cli inject-bset-psm-scan-number
   --diann_psm_path PXD010154-f75fbb29-4419-455f-a011-e4f776bcf73b.psm.parquet
   --diann_feature_path PXD010154_map_protein_accession-88d63fca-3ae6-4eab-9262-6e7a68184432.feature.parquet
   --output_path PXD010154.feature.parquet
  • start-and-end-pisition

Example:

quantmsio_cli inject-start-and-end-from-fasta
   --parquet_path PXD010154_map_protein_accession-88d63fca-3ae6-4eab-9262-6e7a68184432.feature.parquet
   --fasta_path Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta
   --label feature
   --output_path PXD010154.feature.parquet

Compare psm.parquet#

This tool is used to compare peptide information in result files obtained by different search engines.

  • --tags or -t are used to specify the tags of the PSM table.

Example:

quantmsio_cli compare-set-psms
   -p PXD014414-comet.parquet
   -p PXD014414-sage.parquet
   -p PXD014414-msgf.parquet
   -t comet
   -t sage
   -t msgf

Generate spectra message#

generate_spectra_message support psm and feature. It can be used directly for spectral clustering.

  • --label contains two options: psm and feature.

  • --partion contains two options: charge and reference_file_name.

Since the result file is too large, you can specify –-partition to split the result file.

Example:

quantmsio_cli map-spectrum-message-to-parquet
   --parquet_path PXD014414-f4fb88f6-0a45-451d-a8a6-b6d58fb83670.psm.parquet
   --mzml_directory mzmls
   --output_path psm/PXD014414.parquet
   --label psm
   --file_num(default 10)
   --partition charge

Generate gene message#

generate_gene_message support psm and feature.

  • --label contains two options: psm and feature.

  • --map_parameter contains two options: map_protein_name or map_protein_accession.

Example:

quantmsio_cli map-gene-msg-to-parquet
--parquet_path PXD000672-0beee055-ae78-4d97-b6ac-1f191e91bdd4.featrue.parquet
--fasta_path Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta
--output_path PXD000672-gene.parquet
--label feature
--map_parameter map_protein_name
  • Optional parameter

--species species type(default human)
  • species

Common name

Genus name

human

Homo sapiens

mouse

Mus musculus

rat

Rattus norvegicus

fruitfly

Drosophila melanogaster

nematode

Caenorhabditis elegans

zebrafish

Danio rerio

thale-cress

Arabidopsis thaliana

frog

Xenopus tropicalis

pig

Sus scrofa

Map proteins accessions#

get_unanimous_name support parquet and tsv. For parquet, map_parameter have two option (map_protein_name or map_protein_accession), and the label controls whether it is PSM or Feature.

  • parquet

  • --label contains two options: psm and feature

Example:

quantmsio_cli labels convert-accession
   --parquet_path PXD014414-f4fb88f6-0a45-451d-a8a6-b6d58fb83670.psm.parquet
   --fasta Reference fasta database
   --output_path psm/PXD014414.psm.parquet
   --map_parameter map_protein_name
   --label psm
  • tsv

Example:

quantmsio_cli labels get-unanimous-for-tsv
   --path PXD014414-c2a52d63-ea64-4a64-b241-f819a3157b77.differential.tsv
   --fasta Reference fasta database
   --output_path psm/PXD014414.de.tsv
   --map_parameter map_protein_name

Compare two parquet files#

This tool is used to compare the feature.parquet file generated by two versions (diskcache or no-diskcache).

Example:

quantmsio_cli compare-parquet
   --parquet_path_one res_lfq2_discache.parquet
   --parquet_path_two res_lfq2_no_cache.parquet
   --report_path report.txt

Generate report about files#

This tool is used to generate report about all project.

Example:

quantmsio_cli generate-project-report
   --project_folder PXD014414

Register file#

This tool is used to register the file to project.json. If your project comes from the PRIDE database, You can use this command to add file information for project.json.

  • The parameter --category has three options: feature_file, psm_file, differential_file, absolute_file.You can add the above file types.

  • The parameter --replace_existing is enable then we remove the old file and add this one. If not then we can have a list of files for a category.

Example:

quantmsio_cli attach-file
   --project_file PXD014414/project.json
   --attach_file PXD014414-943a8f02-0527-4528-b1a3-b96de99ebe75.featrue.parquet
   --category feature_file
   --replace_existing

Convert file to json#

This tool is used to convert file to json.

  • parquet

  • --data_type contains two options: psm and feature

Example:

quantmsio_cli convert-parquet-json
   --data_type feature
   --parquet_path PXD014414-943a8f02-0527-4528-b1a3-b96de99ebe75.featrue.parquet
   --json_path PXD014414.featrue.json
  • tsv

Example:

quantmsio_cli json convert-tsv-to-json
   --file PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.absolute.tsv
   --json_path PXD010154.ae.json
  • sdrf

Example:

quantmsio_cli json convert-sdrf-to-json
   --file MSV000079033-Blood-Plasma-iTRAQ.sdrf.tsv
   --json_path MSV000079033.sdrf.json

Statistics#

This tool is used for statistics. Example:

quantmsio_cli project-ae-statistics
   --absolute_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.absolute.tsv
   --parquet_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.featrue.parquet
   --save_path PXD014414.statistic.txt
quantmsio_cli parquet-psm-statistics
   --parquet_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.psm.parquet
   --save_path PXD014414.statistic.txt

Plots#

This tool is used for visualization. - plot-psm-peptides

quantmsio_cli plot plot-psm-peptides
   --psm_parquet_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.psm.parquet
   --sdrf_path PXD010154.sdrf.tsv
   --save_path PXD014414_psm_peptides.svg
  • plot-ibaq-distribution

quantmsio_cli plot plot-ibaq-distribution
   --ibaq_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.ibaq.tsv
   --select_column IbaqLog
   --save_path PXD014414_psm_peptides.svg
  • plot-kde-intensity-distribution

quantmsio_cli plot plot-kde-intensity-distribution
--feature_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.featrue.parquet
--num_samples 10
--save_path PXD014414_psm_peptides.svg
  • plot-bar-peptide-distribution

quantmsio_cli plot plot-bar-peptide-distribution
--feature_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.featrue.parquet
--num_samples 10
--save_path PXD014414_psm_peptides.svg
  • plot-box-intensity-distribution

quantmsio_cli plot plot-box-intensity-distribution
--feature_path PXD010154-51b34353-227f-4d38-a181-6d42824de9f7.featrue.parquet
--num_samples 10
--save_path PXD014414_psm_peptides.svg