Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Audit cuML source code and add test to catch incorrect use of cuML array #2456

Closed
hcho3 opened this issue Jun 19, 2020 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@hcho3
Copy link
Contributor

hcho3 commented Jun 19, 2020

Recently, we updated all cuML algorithms to use cuML arrays. The convention is that fields of estimators that are of cuML array type should be prefixed with an underscore, like this:

self._coef_ = None   # accessed via estimator.coef_

It is quite easy to leave out the underscore:

self.coef_ = None  # bad!

Leaving out the underscore in the cuML array field is dangerous because, when the user accesses the field, they'll get None. This is what happened in the bug #2438.

Todo. Audit the cuML codebase to find instances of this error.

@hcho3 hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working and removed ? - Needs Triage Need team to review and classify labels Jun 19, 2020
@hcho3 hcho3 self-assigned this Jun 19, 2020
@JohnZed
Copy link
Contributor

JohnZed commented Jun 22, 2020

Let's get these shaken out within 0.15. I'm sure some annoying bugs are lurking as a result.

@JohnZed
Copy link
Contributor

JohnZed commented Jun 22, 2020

@dantegd 's comment on the PR #2445:

I think it’d be better to open an issue to create a pytest that checks attributes of classes, where if an attribute is a cumlarray it must start with an _, that way we will prevent issues like this from happening. @hcho3 what do you think?

@JohnZed JohnZed changed the title [BUG] Audit cuML source code to catch incorrect use of cuML array [BUG] Audit cuML source code and add test to catch incorrect use of cuML array Jun 22, 2020
@JohnZed
Copy link
Contributor

JohnZed commented Jun 29, 2020

Tagging @mdemoret-nv who will look at this

@mdemoret-nv mdemoret-nv self-assigned this Jul 1, 2020
@mdemoret-nv
Copy link
Contributor

mdemoret-nv commented Jul 3, 2020

I'm breaking this down into several steps to incrementally improve the repo:

  • Modify the tests to identify current issues without a performance impact
  • Fix the incorrect uses of base in the repo
  • Develop caching system on CumlArray to avoid unnecessary to_output() calls
  • Determine if caching system has performance impact
  • If the caching system is beneficial, incorporate where possible

This should be spread out over a few PRs.

@mdemoret-nv
Copy link
Contributor

In order to add testing to detect incorrect uses of CumlArray on the Base class, we will need a set of conventions to adhere by. The tests will then detect any violations of the rules in derived classes during testing and indicate where the violations occurred. Below are the proposed rules/conventions:

Naming Conventions:

  1. Trailing underscore, any type:
    1. Estimated attribute. Estimated attributes are expected to be overridden if a method is called multiple times.
    2. i.e. self.coef_ stores the coefficients of some regression estimator after fit() has been called.
    3. Convention is taken directly from scikit-learn: https://scikit-learn.org/stable/developers/develop.html#estimated-attributes
  2. Leading underscore, any type:
    1. Semi-private or protected attribute.
    2. Not part of the public API of the object. Can be used by derived classes.
    3. i.e. self._my_attribute = 5
  3. Leading and trailing underscore, type = CumlArray
    1. Adheres to the above rules and has special processing in cuml.common.base.Base.__getattr__().
    2. The attribute itself is considered private, but it will also create a read-only public attribute that converts the CumlArray to the desired output type on access using CumlArray.to_output().
    3. Example:
    foo = MyEstimator(output_type='numpy')
    print(type(foo._coeff_)) # CumlArray
    print(type(foo.coef_)) # numpy.ndarray
    1. Internally, only self._coef_ = CumlArray.zeros(10) should have been set. self.coef_ should be undefined.
  4. Double leading underscore, any type:
    1. "Super-private" attribute. Cant be used in derived classes.
    2. Always considered private and will ignore any special processing, even if its a CumlArray.
    3. i.e. self.__y_diff could store a temp array that is needed by one or more methods but should not be publicly accessible or converted.

To support these naming conventions, there are some rules that must be followed when deriving from Base:

  1. Public estimated attributes (that adhere to [REVIEW] - Adding example docstrings for cuML - PCA, TSVD, and DBSCAN algos #1 above) should be stored in the object attribute as a CumlArray type.
    1. Other array-like types (such as np.ndarray, cupy.DataFrame and especially cupy.ndarray) should be avoided.
    2. This is to provide a consistent and predictable data type for estimated parameters that follows the output_type argument and any global output_type settings.
      1. Otherwise users may find some attributes that are always a fixed type and others that can be changed by output_type
    3. Exceptions to this rule can avoid test failures by implementing custom properties. i.e.
    @property
    def coef_(self):
       return np.asarray(self.__custom_coef_) # Double underscore to prevent warnings
  2. Classes that derive from Base and have an internal attribute that also derives from Base must follow specific guidelines to ensure proper handling of array-like estimated properties:
    1. __init__ should always have an output_type argument that is passed down to both super().__init__() and any child attributes that derive from Base
    2. Accessing estimated attributes of child objects should always use the CumlArray private object. i.e. self._coef_ = self._internal_estimator._coef_.
  3. If a method sets estimated properties (i.e. fit()), then self._set_output_type(X) should be called at the beginning of the function.
    1. This ensures the to_output() conversion works correctly for the "input" type.

Open to comments/concerns/questions.

@mdemoret-nv
Copy link
Contributor

Caching system work is being done here: https://github.com/mdemoret-nv/cuml/tree/bug-audit-cumlarray-use-improved

@JohnZed
Copy link
Contributor

JohnZed commented Nov 16, 2020

In good shape now with the addition of CumlArrayDescriptor

@JohnZed JohnZed closed this as completed Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants