Feature/speedup parquet export #126

jcharkow · 2024-10-09T17:28:10Z

Re-implementation of PR #111

This uses duckdb for parquet export instead of pandas. This allows for memory and SQL queries to be handled by duckdb which is faster and uses less memory.

add new script which allows exporting to parquet file format. This can be easier to deal with because does not require sql queries to fetch results.

currently not many customizations are supported, just supports the input/output of file.

this allows for easier filtering of the parquet file.

store the osw version and the pyprophet weight table in pandas json format

also clean up the code

Remove IM specific information from export tsv.

Ensure precursor mask has same number of entires as Precursor table in .osw file. Pick the number one ranked feature for precursor mask if possible. (This is not always possible if precursor has no features or if SCORE_MS2.RANK is null for the features) Also add storage for XGB model

… docstring

remove try/except clause remove argparse for calling script directly

…threading

XGB model metadata version metadata

…t into feature/export_parquet

FEATURE_ID in tests are updated to be long integer indicative of what OpenSwath produces

…export_parquet

right now not memory efficient to append metadata and bitmasks. Have to fix these later but currently just remove the funcitonality.

Single file multithreading uses a lot less memory

allow for writing temporary files by batches for all cases (not just single file)

score_peptide and score_protein tables missing

fix improper joining occuring on transitions that do not have features

add files and scripts used to create dummy .osw files used for parquet testing

experiment wide context bug fix

for some reason there are changes to the export and data_handling files which are not used. revoke these changes

jcharkow · 2024-10-09T18:09:54Z

Currently, this method only supports a combined output file (no split based on runs) and no IPF

option --only_features allows for only exporting precursors that contain a feature ensure that previous tests still work

jcharkow · 2024-10-09T18:36:01Z

@singjc can you review this when you get a chance?

grosenberger · 2024-10-09T18:46:46Z

Looks great, thank you! Please feel free to merge when you think that it is ready.

singjc

Looks great, thanks! Just had some minor comments regarding the FEATURE_{table} score names, if some of these could be retrieved dynamically for cases when new scores are added or old ones removed.

There are some click args that are no longer used/needed. Should probably remove those.

pyprophet/export_parquet.py

pyprophet/main.py

so if more var columns are added in the future they will be parsed automatically

jcharkow · 2024-10-09T21:03:07Z

@singjc suggestions should be addressed now

singjc

Thanks for the changes. Will merge now.

jcharkow and others added 30 commits March 23, 2022 12:17

add export to parquet

1a5c25a

add new script which allows exporting to parquet file format. This can be easier to deal with because does not require sql queries to fetch results.

add export_parquet to command line interface

dad88bb

currently not many customizations are supported, just supports the input/output of file.

Add bool masks for feature, precursor and peptide

be3ea9c

this allows for easier filtering of the parquet file.

enable storage for metadata

eec9a31

store the osw version and the pyprophet weight table in pandas json format

[FEATURE] make transition output optional

c530b4f

also clean up the code

add pyarrow to list of requirements

4fb20a9

add custom IM columns to export

fae0135

[FEATURE] add transition indicies to speed export

3f168b9

Merge branch 'master' into feature/export_parquet

4fbed39

Remove IM specific information from export tsv.

[FIX] Move print statement after method description for click to read…

19e03df

… docstring

[UPDATE] print statemetns to click.echo statements for consistency

4f53ee5

remove debugging code

89f4059

remove try/except clause remove argparse for calling script directly

[REFACTOR] chunksize osw reading and multiprocessing

de2a590

[ADD] method timer|

45c75cd

[ADD] addition params

a4cd2fa

[FIX] typo

8b01dcb

[ADD] absolute outfile paths for when writing to temporary files for …

4a42efb

…threading

[ADD] check if outfile already exists, raise error

5ce3f2c

[FIX] remove check for if con is file

8259892

[FIX] typo from previous commit, left out the not

648cda6

[FIX] Restablish connection to input file after multiprocessing case

3fe22eb

[FIX] Explciitly using warnings instead of np attribute warnings

3c7e851

[FIX] bug with writing metadata

95dd95c

XGB model metadata version metadata

[REFACTOR] Restructure sql queries into small queries

bed416d

Merge branch 'feature/export_parquet' of github.com:Roestlab/pyprophe…

c45c891

…t into feature/export_parquet

[ADD] method for valid filename conversion

c8a1c81

[ADD] flag for separate run parquet files

7d6a53c

[FIX] Restructuring queries and parallel processing

e4f6c37

[FIX] validate specific column datatypes that changes due to NAs

d89e8f9

jcharkow added 14 commits March 23, 2023 15:05

update test for improper joining with datatypes

cffcd62

FEATURE_ID in tests are updated to be long integer indicative of what OpenSwath produces

[FIX] var_mass columns not exported

5a60eb7

Merge branch 'master' of github.com:PyProphet/pyprophet into feature/…

32e46c5

…export_parquet

[TMPFIX] remove metadata and bitmasks for memory efficiency

d2aee6e

right now not memory efficient to append metadata and bitmasks. Have to fix these later but currently just remove the funcitonality.

Merge branch 'master' into feature/export_parquet

cdf3104

[FIX] multithreading now uses a lot less memory

b726a63

Single file multithreading uses a lot less memory

Add function for writing by batches

e337d0c

allow for writing temporary files by batches for all cases (not just single file)

speedup parquet export with duckdb

5ebdbac

[FIX] fix bugs

3c5826b

score_peptide and score_protein tables missing

fix: minor bug fixes

85acf53

fix: featureless transition not appearing properly

67990b4

fix improper joining occuring on transitions that do not have features

test: add test files for .osw

8febb99

add files and scripts used to create dummy .osw files used for parquet testing

FIX: bug with experiment wide context

9469854

experiment wide context bug fix

minor cleanup spacing, remove print statements

8cf98c2

jcharkow mentioned this pull request Oct 9, 2024

Parquet Export #111

Closed

jcharkow added 3 commits October 9, 2024 13:32

add duckdb as a dependency

99c55cd

revoke unncessary changes

f0524d9

for some reason there are changes to the export and data_handling files which are not used. revoke these changes

doc: fix doc strings

aa69949

feat: option to only export precursors with features

312e849

option --only_features allows for only exporting precursors that contain a feature ensure that previous tests still work

singjc requested changes Oct 9, 2024

View reviewed changes

pyprophet/export_parquet.py Outdated Show resolved Hide resolved

pyprophet/export_parquet.py Outdated Show resolved Hide resolved

pyprophet/main.py Outdated Show resolved Hide resolved

jcharkow added 3 commits October 9, 2024 16:58

feature: auto populate var columns

a8bdae6

so if more var columns are added in the future they will be parsed automatically

also auto fetch var in feature_transition table

9792251

remove unneeded arguments

d89a63b

singjc approved these changes Oct 9, 2024

View reviewed changes

singjc merged commit a5b0984 into PyProphet:master Oct 9, 2024

jcharkow deleted the feature/speedup_parquet_export branch October 10, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/speedup parquet export #126

Feature/speedup parquet export #126

jcharkow commented Oct 9, 2024

jcharkow commented Oct 9, 2024

jcharkow commented Oct 9, 2024

grosenberger commented Oct 9, 2024

singjc left a comment

jcharkow commented Oct 9, 2024

singjc left a comment

Feature/speedup parquet export #126

Feature/speedup parquet export #126

Conversation

jcharkow commented Oct 9, 2024

jcharkow commented Oct 9, 2024

jcharkow commented Oct 9, 2024

grosenberger commented Oct 9, 2024

singjc left a comment

Choose a reason for hiding this comment

jcharkow commented Oct 9, 2024

singjc left a comment

Choose a reason for hiding this comment