Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

Open
Tracked by #3483
divyegala opened this issue Oct 7, 2020 · 22 comments · Fixed by #3804
Open
Tracked by #3483

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

divyegala opened this issue Oct 7, 2020 · 22 comments · Fixed by #3804
Assignees
Labels
0 - Blocked Cannot progress due to external reasons bug Something isn't working Cython / Python Cython or Python issue

Comments

@divyegala
Copy link
Member

When this PR is merged #2932, it will start marking a KBinsDiscretizer tests as xfail due to sporadic failures observed in CI

@divyegala divyegala added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 7, 2020
@divyegala divyegala added Cython / Python Cython or Python issue and removed ? - Needs Triage Need team to review and classify labels Oct 7, 2020
@viclafargue
Copy link
Contributor

viclafargue commented Oct 8, 2020

Thanks for opening the issue. Since the errors pops out sporadically, I'll put them here as reference.

Fails on:
test_kbinsdiscretizer[cudf-quantile-ordinal-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 100 / 10000 (1%)
E           Max absolute difference: 1.
E           Max relative difference: 0.25

test_kbinsdiscretizer[cudf-quantile-ordinal-20] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 25 / 10000 (0.25%)
E           Max absolute difference: 1.
E           Max relative difference: 0.05263158

test_kbinsdiscretizer[cudf-quantile-onehot-dense-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:563:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cuml/_thirdparty/sklearn/utils/skl_dependencies.py:359: in fit_transform
    return self.fit(X, **fit_params).transform(X)
cuml/_thirdparty/sklearn/preprocessing/_discretization.py:228: in fit
    categories=np.array([np.arange(i) for i in self.n_bins_]),
/opt/conda/envs/rapids/lib/python3.7/site-packages/cupy/_creation/from_data.py:41: in array
    return core.array(obj, dtype, copy, order, subok, ndmin)
cupy/core/core.pyx:2059: in cupy.core.core.array
    ???
cupy/core/core.pyx:2138: in cupy.core.core.array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Unsupported dtype object
cupy/core/core.pyx:2210: ValueError

@JohnZed
Copy link
Contributor

JohnZed commented Nov 19, 2020

@wphicks can look at this after current work on Silhouette score.

@JohnZed JohnZed assigned wphicks and unassigned viclafargue Nov 19, 2020
@wphicks
Copy link
Contributor

wphicks commented Nov 24, 2020

The reported value error comes from an intermittent failure in the percentile calculation here. Occasionally the final element in the percentile array is NaN, and the error propagates from there. Breaking in at that point with a debugger, we see that numpy's percentile call returns the correct value, and moreover that converting the column array to a numpy array and back allows cupy to compute the correct value as well. I am working to determine the root cause, but my leading hunch is uninitialized memory.

@wphicks
Copy link
Contributor

wphicks commented Dec 1, 2020

Something has changed in the past week in terms of reproducing this. I am still able to reproduce the issue, but it takes much longer to do so, and I have only seen a failure on the cudf-quantile onehot-dense versions of this test, whereas before I saw it on at least some other cudf-quantile tests. When I began tracking it, mean time to failure was around 180 iterations. Now, it is slightly under 500. I don't have any insight as to what has changed or whether it's specific to my system.

I can say with some confidence now that I cannot reproduce this with any other data input format except cudf. I've extracted the sequence of conversions that we perform into a standalone example now and am attempting to reproduce this independently of cuML.

@viclafargue
Copy link
Contributor

viclafargue commented Dec 2, 2020

Same, it's really hard to reproduce. I wonder if the disappearance of the ordinal error is not linked to recent change in #3194 that converts cuDF dataframes to cuPy arrays before the call to input_to_cuml_array.

@wphicks
Copy link
Contributor

wphicks commented Dec 2, 2020

I have fairly compelling evidence now that the ValueError is a result of some interaction with pytest's fixture code. I have not successfully reproduced the ValueError in over 30K runs after refactoring the test to avoid using the fixture (but using exactly the same code to generate the test data). I still saw the AssertionError once.

@wphicks
Copy link
Contributor

wphicks commented Dec 2, 2020

It is worth mentioning that I changed the fixture scope to function rather than session for these tests.

@wphicks
Copy link
Contributor

wphicks commented Dec 15, 2020

Documenting some miscellaneous findings here:

  • This bug appears to reproduce less often (approx every 1500 runs) with the debug build
  • Running cuda-memcheck's racecheck tool on the debug build shows a race condition in cudf's valid_if_n_kernel
  • Running cuda-mecheck's synccheck on the debug build errors out with rmm::bad_alloc
  • Breaking after the percentile call if it returns a NaN, we generally but not always can repeatedly call cp.percentile on the input column and get the NaN everytime. Sometimes however, we can repeatedly call cp.percentile on the column that just produced the NaN and never get a NaN in the output.
  • Dropping to cuda-gdb in the debug build after a NaN has been produced in the percentile output and printing the device memory for column, we see an array of all zeroes. This may be a debugger artifact rather than a real issue.
  • Running cuda-memcheck's initcheck on the release build shows a whole assortment of issues, including some in upstream libraries
  • Running cp.sum on the column input to cp.percentile never yields a NaN

My leading hypothesis based on the symptoms is a race condition, but I have not been able to further isolate it. I'm still looking into the specific race conditions reported in valid_if_n_kernel.

@wphicks
Copy link
Contributor

wphicks commented Dec 16, 2020

I have now confirmed that the ValueError is the result of this cuPy issue and was able to reproduce it independently of cuML. I'll create a workaround.

I still am not certain about the causes of the other observed errors or even if they still apply. I was only able to reproduce them about one in every 100K runs before, and I haven't seen them in quite some time. I'll look into them just a little bit more and see if I can get a reproducer, but otherwise I'm tempted to remove the xfail once the workaround for the percentile issue is in place.

@wphicks
Copy link
Contributor

wphicks commented Dec 17, 2020

Partial fix available here: #3315. This eliminates the ValueError, but I cannot guarantee that the other sporadic failure cases have been addressed.

rapids-bot bot pushed a commit that referenced this issue Dec 17, 2020
Ensure that the 100th quantile value returned by cupy.percentile is the maximum of the input array rather than (possibly) NaN due to cupy/cupy#4451. This eliminates an intermittent failure observed in tests of KBinsDiscretizer, which makes use of cupy.percentile. Note that this includes an alteration of the included sklearn code and should be reverted once the upstream cupy issue is resolved.

Resolve failure due to ValueError described in #2933.

Authors:
  - William Hicks <whicks@nvidia.com>

Approvers:
  - Dante Gama Dessavre
  - Victor Lafargue

URL: #3315
@wphicks
Copy link
Contributor

wphicks commented Feb 1, 2021

Using Hypothesis, I was able to find either some consistent reproducers of at least one of the problems represented by this issue or else consistent reproducers for a new issue. I'll give the simplest one that I've found so far below:

import cupy as cp
import numpy as np
from cuml.common.input_utils import input_to_cupy_array
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBD
from sklearn.preprocessing import KBinsDiscretizer as skKBD
magic_number=-3*2**25
X_np = np.random.rand(5, 2)
X_np[3, 1] = magic_number
X_np[4, 1] = magic_number

print(X_np)
# [[ 5.21711282e-01  3.01647383e-01]
#  [ 9.73775159e-01  3.78120961e-01]
#  [ 5.92897501e-01  6.84050667e-01]
#  [ 2.97061781e-01 -1.00663296e+08]
#  [ 2.09278387e-01 -1.00663296e+08]]

X = input_to_cupy_array(X_np).array
n_bins=20
encode='ordinal'
strategy='quantile'
cu_trans = cuKBD(n_bins=n_bins, encode=encode, strategy=strategy)
sk_trans = skKBD(n_bins=n_bins, encode=encode, strategy=strategy)
t_X = cu_trans.fit_transform(X)
print(t_X)
# [[10  5]
#  [19 10]
#  [15 14]
#  [ 5  0]
#  [ 0  0]]
skt_X = sk_trans.fit_transform(X_np)
print(skt_X)
# [[10.  6.]
#  [19. 11.]
#  [15. 15.]
#  [ 5.  1.]
#  [ 0.  1.]]

As demonstrated by the random values in the input, the rest of the matrix does not seem to matter so long as this precise number appears (as far as I can tell) at least twice. Tweaking even a single digit of this magic value causes the error to disappear. This is not the only magic value that Hypothesis has found thus far, but it is the first one I've found that reproduces on such a small input matrix.

This example obviously does not need the full 20 bins that it is asked to use. Hypothesis also found some much larger (500x20) inputs that reproduced this, though I am not sure if they still had too many bins.

The investigation continues...

@wphicks
Copy link
Contributor

wphicks commented Feb 2, 2021

I have some evidence now that this is the result of a numpy bug. The following returns False:

import numpy as np
arr = np.array([-3*2**25, 1.2, -3*2**25, 2.1, 3.5])
quantiles = np.linspace(0, 100, 21)
percentile = np.percentile(arr, quantiles, interpolation='linear')
print(percentile[0] <= percentile[1])

If we follow the sklearn KBinsDiscretizer code, we can see that the intermediate results diverge from ours at precisely this calculation and ultimately result in results like that shown above. I'm digging into numpy now to see if I can understand precisely where the error emerges in the percentile calculation.

@wphicks
Copy link
Contributor

wphicks commented Feb 2, 2021

Probably related: scikit-learn/scikit-learn#13194 and numpy/numpy#10373

@wphicks
Copy link
Contributor

wphicks commented Feb 2, 2021

Fixed by numpy/numpy#16273, which is part of numpy 1.20.0. As of 2 days ago, 1.20.0 became available on conda-forge, and it is now the default version installed in our environments. I'm going to do some further testing to see if there is some other issue which is also cropping up here, but I'm beginning to have some hope that this has been resolved.

@wphicks
Copy link
Contributor

wphicks commented Feb 2, 2021

There is still an issue because of cupy/cupy#4607, which presents the exact same problem as in numpy but for different "magic" input values. I'm going to put in a quick workaround and submit a fix to cupy.

@wphicks
Copy link
Contributor

wphicks commented Feb 3, 2021

The "quick workaround" was not actually correct, but I think there's a messier workaround available. I'm putting in a PR to xfail the test since we're about to hit burndown and then try to figure out a better approach until the fix works its way into a version of cupy that we can use.

@wphicks
Copy link
Contributor

wphicks commented Feb 4, 2021

The plan now is to wait for cupy/cupy#4617 to get in before removing the xfail again. I'll do a couple tests in the meantime to make sure there's not another issue lurking somewhere in here, but cupy.percentile definitely needs to get updated for us to match sklearn output consistently.

@wphicks wphicks added the 0 - Blocked Cannot progress due to external reasons label Feb 4, 2021
@wphicks wphicks changed the title [BUG] Sporadic KBinsDiscretizer pytests fail [BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy Feb 10, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Mar 17, 2021

@wphicks similar to the other issue (#3481), maybe this one is actually closed by cupy fix?

@wphicks
Copy link
Contributor

wphicks commented Mar 17, 2021

Yep, with rapidsai/integration#230 in place, I think it's worth removing the xfail and seeing if we see it again. I was not 100% certain that the cupy issue was the only problem here, but I'm confident enough that I think we should give it a go.

wphicks added a commit to wphicks/cuml that referenced this issue Apr 28, 2021
Following the update to cupy 8.5.0, the bad read in the
`cupy.percentile` kernel should no longer be an issue, allowing us to
remove the xfail on this test.

Close rapidsai#2933
rapids-bot bot pushed a commit that referenced this issue Apr 28, 2021
Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test.

Close #2933

Authors:
  - William Hicks (https://github.com/wphicks)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #3804
@viclafargue
Copy link
Contributor

Reopening the issue, as new test failures were observed with the quantile strategy.

@viclafargue viclafargue reopened this Jun 8, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test.

Close rapidsai#2933

Authors:
  - William Hicks (https://github.com/wphicks)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#3804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons bug Something isn't working Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants