-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continous Integration Tests #129
Conversation
convert to .toml setup for building on github
both are required because of cython, works locally now
I think some of the issues may be related to numpy 2.0. I was looking into updating pyprophet to using numpy 2.0 a month ago or so, but got busy with other work, so had to put it on hold. If you have time, would you be interested in taking this on or working on it together? |
We can work on it together! I will take a look at tests/test_pyprophet_score.py and tests/test_stats.py to start with |
I am having trouble with test_osw_4. It seems to be working fine with xgboost 1.7.3 (even with numpy 2.0) however when I upgrade to xgboost 2.0 it breaks. Any ideas? |
note tests still fail though, possibly because random seed is different?
update the tests to remove old parameters and fix tests that were failing
Looking more into test_osw_4 it seems that it fails due to a change in XGBoost default parameters. From XGBoost 2.0.0 onwards the default tree_method is 'hist' instead of 'exact' which breaks the test. I am not sure if we should hardcode using 'exact' trees instead of the 'hist' tree to preserve the original functionality. I think that hist trees are a lot quicker though. |
Could we just re-run the test to regenerate the reg test outputs with using the latest version of xgboost, and then fix the version to >= 2.0.0? |
The problem is not that the regtest fails but rather that the command crashes. In XGBoost 2.0.0 the classifier outputs np.nan values causing a crash when trying to plot the histogram. |
Does it output other valid values, or is everything np.nan? If it's specifically the plot that's failing, we may need to adjust how that handles the np.nan. |
Upon further investigation, this occurs because of faulty normalization. The outputted XGBoost model does not fail however, I'm assuming because there are very few points, using the hist method results in all of the decoys having the exact same score. Thus when normalization of the decoy distribution is done, the standard deviation is computed as 0, resulting in everything having either positive or negative I suggest either: I am in favor of "B" because it is easier and in real use cases where there are many more features and scores I believe XGBoost |
This is appropriate since mean and standard deviation are always computed in the context of normalization and cannot normalize if the standard deviation is 0
update: |
I updated the snapshots for osw_4 and osw_5 and all tests pass now! I think this is ready to merge? |
this fixes the versions so tests do not fail. Will add dependabot to update the reuqirements file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for setting the CI up! I like the change to pyproject.toml, but kind of don't like that we still have to depend/keep the setup.py for the cython stuff :\
I made some suggestions/comments, should be good to merge after these are addressed.
.github/workflows/ci.yml
Outdated
strategy: | ||
matrix: | ||
os: [ubuntu-latest] | ||
#os: [ubuntu-latest, windows-latest, macos-latest] # remove mac tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to run the CI on windows and mac as well, or does it not work?
"duckdb", | ||
"duckdb-extensions", | ||
"duckdb-extension-sqlite-scanner", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duckdb is currently only used for OSW to parquet exporting right? I'm thinking if we can create a separate dependency version, so that if someone wants to be able to export to parquet, then they can install pyprophet[parquet]
or something? Just so that we reduce the number of dependencies for the main library for just performing regular scoring tsv exporting? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my initial tests duckdb tends to speed up sqlite statements with many table joins so I was thinking of extending its usage to scoring and tsv exporting as it is minimal changes required to do this.
@@ -184,7 +189,7 @@ def infer_proteins(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m | |||
con.close() | |||
|
|||
if context == 'run-specific': | |||
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette).reset_index() | |||
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is reset_index
no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing reset index is required to prevent the error
File "/home/joshua/mambaforge/envs/pyprophet_dev/lib/python3.11/site-packages/pandas/core/frame.py", line 5158, in insert
raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert run_id, already exists
Same as below. Must be a change to pandas groupby functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like it, some groupby deprecations occured for Pandas v2.2.0
Deprecated the Grouping attributes group_index, result_index, and group_arraylike; these will be removed in a future version of pandas (GH 56148)
If you don't mind, would you be able to test with a version prior to pandas v2.2.0, to see if the old code works with the .reset_index()
, just so we know for sure that is the change.
currently both mac and windows tests are failing so just keep ubuntu
replace semi-supervised learning normalization with sklearn which can handle cases where std = 0
sklearn transofmration does not use degrees of freedom. Keep the normalize_score_by_decoys method but use numpy std() method
…'double' need to cast to np.float_32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. Just had one question/comment about the random_state being set to 42 instead of using the random number generator in the classifiers code? Should be good to merge after that
This adds continuous integration tests (using the existing pyprophet tests) to pyprophet using github actions.
Note this uses pytest instead of nose because nose is no longer being maintained.
Many tests are currently failing so ideally we can address why they are failing before merging.
I also migrated setup to using a .toml file as I was having difficulties with setup.py not having a numpy requirement. Note that because cython is used a setup.py is still required.