-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933
Comments
Thanks for opening the issue. Since the errors pops out sporadically, I'll put them here as reference. Fails on:
test_kbinsdiscretizer[cudf-quantile-ordinal-20] – cuml.test.test_preprocessing
test_kbinsdiscretizer[cudf-quantile-onehot-dense-5] – cuml.test.test_preprocessing
|
@wphicks can look at this after current work on Silhouette score. |
The reported value error comes from an intermittent failure in the percentile calculation here. Occasionally the final element in the percentile array is NaN, and the error propagates from there. Breaking in at that point with a debugger, we see that numpy's |
Something has changed in the past week in terms of reproducing this. I am still able to reproduce the issue, but it takes much longer to do so, and I have only seen a failure on the I can say with some confidence now that I cannot reproduce this with any other data input format except cudf. I've extracted the sequence of conversions that we perform into a standalone example now and am attempting to reproduce this independently of cuML. |
Same, it's really hard to reproduce. I wonder if the disappearance of the |
I have fairly compelling evidence now that the ValueError is a result of some interaction with pytest's |
It is worth mentioning that I changed the fixture scope to |
Documenting some miscellaneous findings here:
My leading hypothesis based on the symptoms is a race condition, but I have not been able to further isolate it. I'm still looking into the specific race conditions reported in |
I have now confirmed that the I still am not certain about the causes of the other observed errors or even if they still apply. I was only able to reproduce them about one in every 100K runs before, and I haven't seen them in quite some time. I'll look into them just a little bit more and see if I can get a reproducer, but otherwise I'm tempted to remove the xfail once the workaround for the |
Partial fix available here: #3315. This eliminates the ValueError, but I cannot guarantee that the other sporadic failure cases have been addressed. |
Ensure that the 100th quantile value returned by cupy.percentile is the maximum of the input array rather than (possibly) NaN due to cupy/cupy#4451. This eliminates an intermittent failure observed in tests of KBinsDiscretizer, which makes use of cupy.percentile. Note that this includes an alteration of the included sklearn code and should be reverted once the upstream cupy issue is resolved. Resolve failure due to ValueError described in #2933. Authors: - William Hicks <whicks@nvidia.com> Approvers: - Dante Gama Dessavre - Victor Lafargue URL: #3315
Using Hypothesis, I was able to find either some consistent reproducers of at least one of the problems represented by this issue or else consistent reproducers for a new issue. I'll give the simplest one that I've found so far below: import cupy as cp
import numpy as np
from cuml.common.input_utils import input_to_cupy_array
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBD
from sklearn.preprocessing import KBinsDiscretizer as skKBD
magic_number=-3*2**25
X_np = np.random.rand(5, 2)
X_np[3, 1] = magic_number
X_np[4, 1] = magic_number
print(X_np)
# [[ 5.21711282e-01 3.01647383e-01]
# [ 9.73775159e-01 3.78120961e-01]
# [ 5.92897501e-01 6.84050667e-01]
# [ 2.97061781e-01 -1.00663296e+08]
# [ 2.09278387e-01 -1.00663296e+08]]
X = input_to_cupy_array(X_np).array
n_bins=20
encode='ordinal'
strategy='quantile'
cu_trans = cuKBD(n_bins=n_bins, encode=encode, strategy=strategy)
sk_trans = skKBD(n_bins=n_bins, encode=encode, strategy=strategy)
t_X = cu_trans.fit_transform(X)
print(t_X)
# [[10 5]
# [19 10]
# [15 14]
# [ 5 0]
# [ 0 0]]
skt_X = sk_trans.fit_transform(X_np)
print(skt_X)
# [[10. 6.]
# [19. 11.]
# [15. 15.]
# [ 5. 1.]
# [ 0. 1.]] As demonstrated by the random values in the input, the rest of the matrix does not seem to matter so long as this precise number appears (as far as I can tell) at least twice. Tweaking even a single digit of this magic value causes the error to disappear. This is not the only magic value that Hypothesis has found thus far, but it is the first one I've found that reproduces on such a small input matrix. This example obviously does not need the full 20 bins that it is asked to use. Hypothesis also found some much larger (500x20) inputs that reproduced this, though I am not sure if they still had too many bins. The investigation continues... |
I have some evidence now that this is the result of a numpy bug. The following returns import numpy as np
arr = np.array([-3*2**25, 1.2, -3*2**25, 2.1, 3.5])
quantiles = np.linspace(0, 100, 21)
percentile = np.percentile(arr, quantiles, interpolation='linear')
print(percentile[0] <= percentile[1]) If we follow the sklearn |
Probably related: scikit-learn/scikit-learn#13194 and numpy/numpy#10373 |
Fixed by numpy/numpy#16273, which is part of numpy 1.20.0. As of 2 days ago, 1.20.0 became available on conda-forge, and it is now the default version installed in our environments. I'm going to do some further testing to see if there is some other issue which is also cropping up here, but I'm beginning to have some hope that this has been resolved. |
There is still an issue because of cupy/cupy#4607, which presents the exact same problem as in numpy but for different "magic" input values. I'm going to put in a quick workaround and submit a fix to cupy. |
The "quick workaround" was not actually correct, but I think there's a messier workaround available. I'm putting in a PR to xfail the test since we're about to hit burndown and then try to figure out a better approach until the fix works its way into a version of cupy that we can use. |
The plan now is to wait for cupy/cupy#4617 to get in before removing the xfail again. I'll do a couple tests in the meantime to make sure there's not another issue lurking somewhere in here, but |
Yep, with rapidsai/integration#230 in place, I think it's worth removing the xfail and seeing if we see it again. I was not 100% certain that the cupy issue was the only problem here, but I'm confident enough that I think we should give it a go. |
Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close rapidsai#2933
Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close #2933 Authors: - William Hicks (https://github.com/wphicks) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #3804
Reopening the issue, as new test failures were observed with the quantile strategy. |
Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close rapidsai#2933 Authors: - William Hicks (https://github.com/wphicks) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3804
When this PR is merged #2932, it will start marking a KBinsDiscretizer tests as xfail due to sporadic failures observed in CI
The text was updated successfully, but these errors were encountered: