Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/speedup parquet export #126

Merged
merged 64 commits into from
Oct 9, 2024

Conversation

jcharkow
Copy link
Contributor

@jcharkow jcharkow commented Oct 9, 2024

Re-implementation of PR #111

This uses duckdb for parquet export instead of pandas. This allows for memory and SQL queries to be handled by duckdb which is faster and uses less memory.

jcharkow and others added 30 commits March 23, 2022 12:17
add new script which allows exporting to parquet file format. This can
be easier to deal with because does not require sql queries to fetch
results.
currently not many customizations are supported, just supports the
input/output of file.
this allows for easier filtering of the parquet file.
store the osw version and the pyprophet weight table in pandas json
format
Remove IM specific information from export tsv.
Ensure precursor mask has same number of entires as Precursor table in
.osw file. Pick the number one ranked feature for precursor mask if
possible. (This is not always possible if precursor has no features or
if SCORE_MS2.RANK is null for the features)

Also add storage for XGB model
remove try/except clause

remove argparse for calling script directly
XGB model metadata
version metadata
jcharkow added 14 commits March 23, 2023 15:05
FEATURE_ID in tests are updated to be long integer indicative of what
OpenSwath produces
right now not memory efficient to append metadata and bitmasks. Have to
fix these later but currently just remove the funcitonality.
Single file multithreading uses a lot less memory
allow for writing temporary files by batches for all cases (not just
single file)
score_peptide and score_protein tables missing
fix improper joining occuring on transitions that do not have features
add files and scripts used to create dummy .osw files used for parquet
testing
experiment wide context bug fix
@jcharkow jcharkow mentioned this pull request Oct 9, 2024
for some reason there are changes to the export and data_handling files
which are not used. revoke these changes
@jcharkow
Copy link
Contributor Author

jcharkow commented Oct 9, 2024

Currently, this method only supports a combined output file (no split based on runs) and no IPF

option --only_features allows for only exporting precursors that contain
a feature

ensure that previous tests still work
@jcharkow
Copy link
Contributor Author

jcharkow commented Oct 9, 2024

@singjc can you review this when you get a chance?

@grosenberger
Copy link
Contributor

Looks great, thank you! Please feel free to merge when you think that it is ready.

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Just had some minor comments regarding the FEATURE_{table} score names, if some of these could be retrieved dynamically for cases when new scores are added or old ones removed.

There are some click args that are no longer used/needed. Should probably remove those.

pyprophet/export_parquet.py Outdated Show resolved Hide resolved
pyprophet/export_parquet.py Outdated Show resolved Hide resolved
pyprophet/main.py Outdated Show resolved Hide resolved
so if more var columns are added in the future they will be parsed
automatically
@jcharkow
Copy link
Contributor Author

jcharkow commented Oct 9, 2024

@singjc suggestions should be addressed now

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Will merge now.

@singjc singjc merged commit a5b0984 into PyProphet:master Oct 9, 2024
@jcharkow jcharkow deleted the feature/speedup_parquet_export branch October 10, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants