[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

divyegala · 2020-10-07T21:31:29Z

When this PR is merged #2932, it will start marking a KBinsDiscretizer tests as xfail due to sporadic failures observed in CI

viclafargue · 2020-10-08T08:25:59Z

Thanks for opening the issue. Since the errors pops out sporadically, I'll put them here as reference.

Fails on:
test_kbinsdiscretizer[cudf-quantile-ordinal-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 100 / 10000 (1%)
E           Max absolute difference: 1.
E           Max relative difference: 0.25

test_kbinsdiscretizer[cudf-quantile-ordinal-20] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 25 / 10000 (0.25%)
E           Max absolute difference: 1.
E           Max relative difference: 0.05263158

test_kbinsdiscretizer[cudf-quantile-onehot-dense-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:563:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cuml/_thirdparty/sklearn/utils/skl_dependencies.py:359: in fit_transform
    return self.fit(X, **fit_params).transform(X)
cuml/_thirdparty/sklearn/preprocessing/_discretization.py:228: in fit
    categories=np.array([np.arange(i) for i in self.n_bins_]),
/opt/conda/envs/rapids/lib/python3.7/site-packages/cupy/_creation/from_data.py:41: in array
    return core.array(obj, dtype, copy, order, subok, ndmin)
cupy/core/core.pyx:2059: in cupy.core.core.array
    ???
cupy/core/core.pyx:2138: in cupy.core.core.array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Unsupported dtype object
cupy/core/core.pyx:2210: ValueError

JohnZed · 2020-11-19T20:50:28Z

@wphicks can look at this after current work on Silhouette score.

wphicks · 2020-11-24T23:10:38Z

The reported value error comes from an intermittent failure in the percentile calculation here. Occasionally the final element in the percentile array is NaN, and the error propagates from there. Breaking in at that point with a debugger, we see that numpy's percentile call returns the correct value, and moreover that converting the column array to a numpy array and back allows cupy to compute the correct value as well. I am working to determine the root cause, but my leading hunch is uninitialized memory.

wphicks · 2020-12-01T23:42:07Z

Something has changed in the past week in terms of reproducing this. I am still able to reproduce the issue, but it takes much longer to do so, and I have only seen a failure on the cudf-quantile onehot-dense versions of this test, whereas before I saw it on at least some other cudf-quantile tests. When I began tracking it, mean time to failure was around 180 iterations. Now, it is slightly under 500. I don't have any insight as to what has changed or whether it's specific to my system.

I can say with some confidence now that I cannot reproduce this with any other data input format except cudf. I've extracted the sequence of conversions that we perform into a standalone example now and am attempting to reproduce this independently of cuML.

viclafargue · 2020-12-02T13:26:25Z

Same, it's really hard to reproduce. I wonder if the disappearance of the ordinal error is not linked to recent change in #3194 that converts cuDF dataframes to cuPy arrays before the call to input_to_cuml_array.

wphicks · 2020-12-02T23:13:13Z

I have fairly compelling evidence now that the ValueError is a result of some interaction with pytest's fixture code. I have not successfully reproduced the ValueError in over 30K runs after refactoring the test to avoid using the fixture (but using exactly the same code to generate the test data). I still saw the AssertionError once.

wphicks · 2020-12-02T23:17:06Z

It is worth mentioning that I changed the fixture scope to function rather than session for these tests.

wphicks · 2020-12-15T23:38:01Z

Documenting some miscellaneous findings here:

This bug appears to reproduce less often (approx every 1500 runs) with the debug build
Running cuda-memcheck's racecheck tool on the debug build shows a race condition in cudf's valid_if_n_kernel
Running cuda-mecheck's synccheck on the debug build errors out with rmm::bad_alloc
Breaking after the percentile call if it returns a NaN, we generally but not always can repeatedly call cp.percentile on the input column and get the NaN everytime. Sometimes however, we can repeatedly call cp.percentile on the column that just produced the NaN and never get a NaN in the output.
Dropping to cuda-gdb in the debug build after a NaN has been produced in the percentile output and printing the device memory for column, we see an array of all zeroes. This may be a debugger artifact rather than a real issue.
Running cuda-memcheck's initcheck on the release build shows a whole assortment of issues, including some in upstream libraries
Running cp.sum on the column input to cp.percentile never yields a NaN

My leading hypothesis based on the symptoms is a race condition, but I have not been able to further isolate it. I'm still looking into the specific race conditions reported in valid_if_n_kernel.

wphicks · 2020-12-16T20:23:34Z

I have now confirmed that the ValueError is the result of this cuPy issue and was able to reproduce it independently of cuML. I'll create a workaround.

I still am not certain about the causes of the other observed errors or even if they still apply. I was only able to reproduce them about one in every 100K runs before, and I haven't seen them in quite some time. I'll look into them just a little bit more and see if I can get a reproducer, but otherwise I'm tempted to remove the xfail once the workaround for the percentile issue is in place.

wphicks · 2020-12-17T01:36:41Z

Partial fix available here: #3315. This eliminates the ValueError, but I cannot guarantee that the other sporadic failure cases have been addressed.

Ensure that the 100th quantile value returned by cupy.percentile is the maximum of the input array rather than (possibly) NaN due to cupy/cupy#4451. This eliminates an intermittent failure observed in tests of KBinsDiscretizer, which makes use of cupy.percentile. Note that this includes an alteration of the included sklearn code and should be reverted once the upstream cupy issue is resolved. Resolve failure due to ValueError described in #2933. Authors: - William Hicks <whicks@nvidia.com> Approvers: - Dante Gama Dessavre - Victor Lafargue URL: #3315

wphicks · 2021-02-01T22:13:12Z

Using Hypothesis, I was able to find either some consistent reproducers of at least one of the problems represented by this issue or else consistent reproducers for a new issue. I'll give the simplest one that I've found so far below:

import cupy as cp
import numpy as np
from cuml.common.input_utils import input_to_cupy_array
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBD
from sklearn.preprocessing import KBinsDiscretizer as skKBD
magic_number=-3*2**25
X_np = np.random.rand(5, 2)
X_np[3, 1] = magic_number
X_np[4, 1] = magic_number

print(X_np)
# [[ 5.21711282e-01  3.01647383e-01]
#  [ 9.73775159e-01  3.78120961e-01]
#  [ 5.92897501e-01  6.84050667e-01]
#  [ 2.97061781e-01 -1.00663296e+08]
#  [ 2.09278387e-01 -1.00663296e+08]]

X = input_to_cupy_array(X_np).array
n_bins=20
encode='ordinal'
strategy='quantile'
cu_trans = cuKBD(n_bins=n_bins, encode=encode, strategy=strategy)
sk_trans = skKBD(n_bins=n_bins, encode=encode, strategy=strategy)
t_X = cu_trans.fit_transform(X)
print(t_X)
# [[10  5]
#  [19 10]
#  [15 14]
#  [ 5  0]
#  [ 0  0]]
skt_X = sk_trans.fit_transform(X_np)
print(skt_X)
# [[10.  6.]
#  [19. 11.]
#  [15. 15.]
#  [ 5.  1.]
#  [ 0.  1.]]

As demonstrated by the random values in the input, the rest of the matrix does not seem to matter so long as this precise number appears (as far as I can tell) at least twice. Tweaking even a single digit of this magic value causes the error to disappear. This is not the only magic value that Hypothesis has found thus far, but it is the first one I've found that reproduces on such a small input matrix.

This example obviously does not need the full 20 bins that it is asked to use. Hypothesis also found some much larger (500x20) inputs that reproduced this, though I am not sure if they still had too many bins.

The investigation continues...

wphicks · 2021-02-02T16:22:38Z

I have some evidence now that this is the result of a numpy bug. The following returns False:

import numpy as np
arr = np.array([-3*2**25, 1.2, -3*2**25, 2.1, 3.5])
quantiles = np.linspace(0, 100, 21)
percentile = np.percentile(arr, quantiles, interpolation='linear')
print(percentile[0] <= percentile[1])

If we follow the sklearn KBinsDiscretizer code, we can see that the intermediate results diverge from ours at precisely this calculation and ultimately result in results like that shown above. I'm digging into numpy now to see if I can understand precisely where the error emerges in the percentile calculation.

wphicks · 2021-02-02T16:30:01Z

Probably related: scikit-learn/scikit-learn#13194 and numpy/numpy#10373

wphicks · 2021-02-02T18:23:30Z

Fixed by numpy/numpy#16273, which is part of numpy 1.20.0. As of 2 days ago, 1.20.0 became available on conda-forge, and it is now the default version installed in our environments. I'm going to do some further testing to see if there is some other issue which is also cropping up here, but I'm beginning to have some hope that this has been resolved.

wphicks · 2021-02-02T22:15:00Z

There is still an issue because of cupy/cupy#4607, which presents the exact same problem as in numpy but for different "magic" input values. I'm going to put in a quick workaround and submit a fix to cupy.

wphicks · 2021-02-03T00:04:45Z

The "quick workaround" was not actually correct, but I think there's a messier workaround available. I'm putting in a PR to xfail the test since we're about to hit burndown and then try to figure out a better approach until the fix works its way into a version of cupy that we can use.

wphicks · 2021-02-04T00:25:54Z

The plan now is to wait for cupy/cupy#4617 to get in before removing the xfail again. I'll do a couple tests in the meantime to make sure there's not another issue lurking somewhere in here, but cupy.percentile definitely needs to get updated for us to match sklearn output consistently.

JohnZed · 2021-03-17T18:43:07Z

@wphicks similar to the other issue (#3481), maybe this one is actually closed by cupy fix?

wphicks · 2021-03-17T18:47:48Z

Yep, with rapidsai/integration#230 in place, I think it's worth removing the xfail and seeing if we see it again. I was not 100% certain that the cupy issue was the only problem here, but I'm confident enough that I think we should give it a go.

Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close rapidsai#2933

Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close #2933 Authors: - William Hicks (https://github.com/wphicks) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #3804

viclafargue · 2021-06-08T09:17:04Z

Reopening the issue, as new test failures were observed with the quantile strategy.

Following the update to cupy 8.5.0, the bad read in the `cupy.percentile` kernel should no longer be an issue, allowing us to remove the xfail on this test. Close rapidsai#2933 Authors: - William Hicks (https://github.com/wphicks) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3804

divyegala added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 7, 2020

divyegala assigned viclafargue Oct 7, 2020

divyegala added Cython / Python Cython or Python issue and removed ? - Needs Triage Need team to review and classify labels Oct 7, 2020

divyegala mentioned this issue Oct 7, 2020

[REVIEW] xfail for KBinsDiscretizer pytests #2932

Merged

JohnZed assigned wphicks and unassigned viclafargue Nov 19, 2020

wphicks mentioned this issue Nov 30, 2020

[WIP] Work around inconsistent KBinsDiscretizer test #3204

Closed

wphicks mentioned this issue Dec 2, 2020

[BUG] AssertionError in multithreaded KBinsDiscretizer #3234

Closed

wphicks mentioned this issue Dec 16, 2020

Read past end of array in percentile kernel can result in incorrect 100th quantile value cupy/cupy#4451

Closed

wphicks mentioned this issue Dec 17, 2020

[REVIEW] Provide workaround for cupy.percentile bug #3315

Merged

wphicks mentioned this issue Feb 3, 2021

[REVIEW] Mark kbinsdiscretizer quantile tests as xfail #3450

Merged

wphicks added the 0 - Blocked Cannot progress due to external reasons label Feb 4, 2021

wphicks changed the title ~~[BUG] Sporadic KBinsDiscretizer pytests fail~~ [BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy Feb 10, 2021

wphicks mentioned this issue Feb 10, 2021

[BUG] Sporadic failure in test_kbinsdiscretizer with uniform strategy #3481

Open

JohnZed mentioned this issue Feb 11, 2021

[FEA] Preprocessing module out of experimental #3483

Closed

10 tasks

wphicks mentioned this issue Apr 28, 2021

Remove xfail for KBinsDiscretizer quantile tests #3804

Merged

rapids-bot bot closed this as completed in #3804 Apr 28, 2021

viclafargue reopened this Jun 8, 2021

viclafargue mentioned this issue Jun 8, 2021

[REVIEW] xfailing Kbinsdiscretizer quantile strategy #3960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

divyegala commented Oct 7, 2020

viclafargue commented Oct 8, 2020 •

edited

Loading

JohnZed commented Nov 19, 2020

wphicks commented Nov 24, 2020

wphicks commented Dec 1, 2020

viclafargue commented Dec 2, 2020 •

edited

Loading

wphicks commented Dec 2, 2020

wphicks commented Dec 2, 2020

wphicks commented Dec 15, 2020 •

edited

Loading

wphicks commented Dec 16, 2020

wphicks commented Dec 17, 2020

wphicks commented Feb 1, 2021 •

edited

Loading

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 3, 2021

wphicks commented Feb 4, 2021

JohnZed commented Mar 17, 2021

wphicks commented Mar 17, 2021

viclafargue commented Jun 8, 2021

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

Comments

divyegala commented Oct 7, 2020

viclafargue commented Oct 8, 2020 • edited Loading

JohnZed commented Nov 19, 2020

wphicks commented Nov 24, 2020

wphicks commented Dec 1, 2020

viclafargue commented Dec 2, 2020 • edited Loading

wphicks commented Dec 2, 2020

wphicks commented Dec 2, 2020

wphicks commented Dec 15, 2020 • edited Loading

wphicks commented Dec 16, 2020

wphicks commented Dec 17, 2020

wphicks commented Feb 1, 2021 • edited Loading

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 2, 2021

wphicks commented Feb 3, 2021

wphicks commented Feb 4, 2021

JohnZed commented Mar 17, 2021

wphicks commented Mar 17, 2021

viclafargue commented Jun 8, 2021

viclafargue commented Oct 8, 2020 •

edited

Loading

viclafargue commented Dec 2, 2020 •

edited

Loading

wphicks commented Dec 15, 2020 •

edited

Loading

wphicks commented Feb 1, 2021 •

edited

Loading