Query parquet¶
The query module provides the ability to quickly search through Parquet files.
Basic query¶
Basic query operations allow you to query all samples, peptides, proteins, genes, and MZML files. The query results will be deduplicated and then returned in a list format.
from quantmsio.core.query import Parquet
P = Parquet('PXD007683.feature.parquet')
P.get_unique_samples()
"""
['PXD007683-Sample-3',
'PXD007683-Sample-9',
'PXD007683-Sample-4',
'PXD007683-Sample-6',
'PXD007683-Sample-11',
'PXD007683-Sample-7',
'PXD007683-Sample-1',
'PXD007683-Sample-10',
'PXD007683-Sample-8',
'PXD007683-Sample-2',
'PXD007683-Sample-5']
"""
P.get_unique_peptides()
P.get_unique_proteins()
P.get_unique_genes()
P.get_unique_references()
Specific query¶
Specific queries allow you to individually search for certain values based on specific conditions.
The results are returned in the form of a DataFrame.
P.query_peptide('QPAYVSK')
"""
sequence protein_accessions protein_start_positions ...
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
...
"""
P.query_peptide('QPAYVSK',columns=['protein_start_positions','protein_end_positions'])
P.query_peptides(['QPAYVSK','QPCPSQYSAIK'],columns=None)
"""
sequence protein_accessions protein_start_positions ...
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
...
QPCPSQYSAIK [sp|O95861|BPNT1_HUMAN] [98]
...
"""
P.query_protein('P36016',columns=None)
P.query_proteins(['P36016','O95861'],columns=None)
"""
sequence protein_accessions protein_start_positions ...
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
QPAYVSK [sp|P36016|LHS1_YEAST] [739]
...
QPCPSQYSAIK [sp|O95861|BPNT1_HUMAN] [98]
...
"""
P.get_samples_from_database(['PXD007683-Sample-3','PXD007683-Sample-9'],columns=None)
"""
sequence protein_accessions sample_accession
AAAAAAAAAAAAAAAGAGAGAK [sp|P55011|S12A2_HUMAN] PXD007683-Sample-3
AAAAAAAAAAAAAAAGAGAGAK [sp|P55011|S12A2_HUMAN] PXD007683-Sample-3
AAAAAAAAAK [sp|Q99453|PHX2B_HUMAN] PXD007683-Sample-3
"""
P.get_report_from_database(['a05063','a05059'],columns=None) # mzml
"""
sequence protein_accessions reference_file_name
AAAAAAALQAK [sp|P36578|RL4_HUMAN] a05063
AAAAAAALQAK [sp|P36578|RL4_HUMAN] a05063
AAAAAAALQAK [sp|P36578|RL4_HUMAN] a05063
"""
Iter bacth¶
You can use the following method to produce values in batches.
1for samples,df in P.iter_samples(file_num=10,columns=None):
2 # A batch contains ten samples.
3 print(samples,df)
4
5for df in P.iter_chunk(batch_size=500000,columns=None):
6 # A batch contains 500,000 rows.
7 print(df)
8
9for refs,df in P.iter_file(file_num=20,columns=None): # mzml
10 # A batch contains 20 mzML files.
11 print(refs,df)
Inject message¶
You can use the following method to fill in additional information.
1df = P.get_report_from_database(['a05063','a05059'],columns=None)
2df = P.inject_spectrum_msg(df, mzml_directory='./mzml')
3fasta = './Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta'
4protein_dict = P.get_protein_dict(fasta_path=fasta)
5df = P.inject_position_msg(df, protein_dict)
6df = P.inject_gene_msg(df,fasta,map_parameter = "map_protein_accession",species = "human")