[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

hcho3 · 2020-06-19T23:28:56Z

Recently, we updated all cuML algorithms to use cuML arrays. The convention is that fields of estimators that are of cuML array type should be prefixed with an underscore, like this:

self._coef_ = None   # accessed via estimator.coef_

It is quite easy to leave out the underscore:

self.coef_ = None  # bad!

Leaving out the underscore in the cuML array field is dangerous because, when the user accesses the field, they'll get None. This is what happened in the bug #2438.

Todo. Audit the cuML codebase to find instances of this error.

The text was updated successfully, but these errors were encountered:

JohnZed · 2020-06-22T15:09:29Z

Let's get these shaken out within 0.15. I'm sure some annoying bugs are lurking as a result.

JohnZed · 2020-06-22T16:14:34Z

@dantegd 's comment on the PR #2445:

I think it’d be better to open an issue to create a pytest that checks attributes of classes, where if an attribute is a cumlarray it must start with an _, that way we will prevent issues like this from happening. @hcho3 what do you think?

JohnZed · 2020-06-29T16:19:45Z

Tagging @mdemoret-nv who will look at this

mdemoret-nv · 2020-07-03T00:55:47Z

I'm breaking this down into several steps to incrementally improve the repo:

Modify the tests to identify current issues without a performance impact
Fix the incorrect uses of base in the repo
Develop caching system on CumlArray to avoid unnecessary to_output() calls
Determine if caching system has performance impact
If the caching system is beneficial, incorporate where possible

This should be spread out over a few PRs.

mdemoret-nv · 2020-07-30T04:50:50Z

In order to add testing to detect incorrect uses of CumlArray on the Base class, we will need a set of conventions to adhere by. The tests will then detect any violations of the rules in derived classes during testing and indicate where the violations occurred. Below are the proposed rules/conventions:

Naming Conventions:

Trailing underscore, any type:
1. Estimated attribute. Estimated attributes are expected to be overridden if a method is called multiple times.
2. i.e. self.coef_ stores the coefficients of some regression estimator after fit() has been called.
3. Convention is taken directly from scikit-learn: https://scikit-learn.org/stable/developers/develop.html#estimated-attributes
Leading underscore, any type:
1. Semi-private or protected attribute.
2. Not part of the public API of the object. Can be used by derived classes.
3. i.e. self._my_attribute = 5
Leading and trailing underscore, type = CumlArray
1. Adheres to the above rules and has special processing in cuml.common.base.Base.__getattr__().
2. The attribute itself is considered private, but it will also create a read-only public attribute that converts the CumlArray to the desired output type on access using CumlArray.to_output().
3. Example:
```
foo = MyEstimator(output_type='numpy')
print(type(foo._coeff_)) # CumlArray
print(type(foo.coef_)) # numpy.ndarray
```
1. Internally, only self._coef_ = CumlArray.zeros(10) should have been set. self.coef_ should be undefined.
Double leading underscore, any type:
1. "Super-private" attribute. Cant be used in derived classes.
2. Always considered private and will ignore any special processing, even if its a CumlArray.
3. i.e. self.__y_diff could store a temp array that is needed by one or more methods but should not be publicly accessible or converted.

To support these naming conventions, there are some rules that must be followed when deriving from Base:

Public estimated attributes (that adhere to [REVIEW] - Adding example docstrings for cuML - PCA, TSVD, and DBSCAN algos #1 above) should be stored in the object attribute as a CumlArray type.
1. Other array-like types (such as np.ndarray, cupy.DataFrame and especially cupy.ndarray) should be avoided.
2. This is to provide a consistent and predictable data type for estimated parameters that follows the output_type argument and any global output_type settings.
  1. Otherwise users may find some attributes that are always a fixed type and others that can be changed by output_type
3. Exceptions to this rule can avoid test failures by implementing custom properties. i.e.
```
@property
def coef_(self):
   return np.asarray(self.__custom_coef_) # Double underscore to prevent warnings
```
Classes that derive from Base and have an internal attribute that also derives from Base must follow specific guidelines to ensure proper handling of array-like estimated properties:
1. __init__ should always have an output_type argument that is passed down to both super().__init__() and any child attributes that derive from Base
2. Accessing estimated attributes of child objects should always use the CumlArray private object. i.e. self._coef_ = self._internal_estimator._coef_.
If a method sets estimated properties (i.e. fit()), then self._set_output_type(X) should be called at the beginning of the function.
1. This ensures the to_output() conversion works correctly for the "input" type.

Open to comments/concerns/questions.

mdemoret-nv · 2020-07-30T05:20:08Z

Caching system work is being done here: https://github.com/mdemoret-nv/cuml/tree/bug-audit-cumlarray-use-improved

JohnZed · 2020-11-16T21:04:50Z

In good shape now with the addition of CumlArrayDescriptor

hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working and removed ? - Needs Triage Need team to review and classify labels Jun 19, 2020

hcho3 self-assigned this Jun 19, 2020

hcho3 mentioned this issue Jun 19, 2020

[REVIEW] Restore access to coef_ property of Lasso #2445

Merged

JohnZed changed the title ~~[BUG] Audit cuML source code to catch incorrect use of cuML array~~ [BUG] Audit cuML source code and add test to catch incorrect use of cuML array Jun 22, 2020

mdemoret-nv self-assigned this Jul 1, 2020

mdemoret-nv mentioned this issue Jul 3, 2020

[REVIEW] Audit cuML source code and add test to catch incorrect use of cuML array #2506

Merged

This was referenced Aug 18, 2020

[REVIEW]Distributed TF-IDF transformer #2698

Merged

[BUG]Multi Partition Naive Bayes fails due to due to cuML CumlArray refactor #2704

Closed

hcho3 removed their assignment Oct 19, 2020

JohnZed closed this as completed Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

hcho3 commented Jun 19, 2020

JohnZed commented Jun 22, 2020

JohnZed commented Jun 22, 2020

JohnZed commented Jun 29, 2020

mdemoret-nv commented Jul 3, 2020 •

edited

Loading

mdemoret-nv commented Jul 30, 2020

mdemoret-nv commented Jul 30, 2020

JohnZed commented Nov 16, 2020

[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

Comments

hcho3 commented Jun 19, 2020

JohnZed commented Jun 22, 2020

JohnZed commented Jun 22, 2020

JohnZed commented Jun 29, 2020

mdemoret-nv commented Jul 3, 2020 • edited Loading

mdemoret-nv commented Jul 30, 2020

mdemoret-nv commented Jul 30, 2020

JohnZed commented Nov 16, 2020

mdemoret-nv commented Jul 3, 2020 •

edited

Loading