-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/speedup parquet export #126
Feature/speedup parquet export #126
Conversation
add new script which allows exporting to parquet file format. This can be easier to deal with because does not require sql queries to fetch results.
currently not many customizations are supported, just supports the input/output of file.
this allows for easier filtering of the parquet file.
store the osw version and the pyprophet weight table in pandas json format
also clean up the code
Remove IM specific information from export tsv.
Ensure precursor mask has same number of entires as Precursor table in .osw file. Pick the number one ranked feature for precursor mask if possible. (This is not always possible if precursor has no features or if SCORE_MS2.RANK is null for the features) Also add storage for XGB model
remove try/except clause remove argparse for calling script directly
XGB model metadata version metadata
…t into feature/export_parquet
FEATURE_ID in tests are updated to be long integer indicative of what OpenSwath produces
right now not memory efficient to append metadata and bitmasks. Have to fix these later but currently just remove the funcitonality.
Single file multithreading uses a lot less memory
allow for writing temporary files by batches for all cases (not just single file)
score_peptide and score_protein tables missing
fix improper joining occuring on transitions that do not have features
add files and scripts used to create dummy .osw files used for parquet testing
experiment wide context bug fix
for some reason there are changes to the export and data_handling files which are not used. revoke these changes
Currently, this method only supports a combined output file (no split based on runs) and no IPF |
option --only_features allows for only exporting precursors that contain a feature ensure that previous tests still work
@singjc can you review this when you get a chance? |
Looks great, thank you! Please feel free to merge when you think that it is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Just had some minor comments regarding the FEATURE_{table}
score names, if some of these could be retrieved dynamically for cases when new scores are added or old ones removed.
There are some click args that are no longer used/needed. Should probably remove those.
so if more var columns are added in the future they will be parsed automatically
@singjc suggestions should be addressed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Will merge now.
Re-implementation of PR #111
This uses duckdb for parquet export instead of pandas. This allows for memory and SQL queries to be handled by duckdb which is faster and uses less memory.