Merge pull request #3242 from rapidsai/branch-0.17

[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
rapidsai · Dec 3, 2020 · ff17722 · ff17722
2 parents 40d9568 + a76d80f
commit ff17722
Show file tree

Hide file tree

Showing 9 changed files with 185 additions and 100 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,12 +10,10 @@
 
 ## New Features
 - PR #3164: Expose silhouette score in Python
-- PR #3160: Least Angle Regression
-- PR #3214: Correct flaky silhouette score test by setting atol
+- PR #3160: Least Angle Regression (experimental)
 - PR #2659: Add initial max inner product sparse knn
 - PR #2836: Refactor UMAP to accept sparse inputs
 - PR #3126: Experimental versions of GPU accelerated Kernel and Permutation SHAP
-- PR #3186: Add gain to RF JSON dump
 
 ## Improvements
 - PR #3077: Improve runtime for test_kmeans
@@ -44,7 +42,7 @@
 - PR #3115: Speeding up MNMG UMAP testing
 - PR #3112: Speed test_array
 - PR #3111: Adding Cython to Code Coverage
-- PR #3129:  Update notebooks README
+- PR #3129: Update notebooks README
 - PR #3002: Update flake8 Config To With Per File Settings
 - PR #3135: Add QuasiNewton tests
 - PR #3040: Improved Array Conversion with CumlArrayDescriptor and Decorators
@@ -57,9 +55,11 @@
 - PR #3155: Eliminate unnecessary warnings from random projection test
 - PR #3176: Add probabilistic SVM tests with various input array types
 - PR #3180: FIL: `blocks_per_sm` support in Python
+- PR #3186: Add gain to RF JSON dump
 - PR #3219: Update CI to use XGBoost 1.3.0 RCs
 - PR #3221: Update contributing doc for label support
 - PR #3177: Make Multinomial Naive Bayes inherit from `ClassifierMixin` and use it for score
+- PR #3240: Minor doc updates
 
 ## Bug Fixes
 - PR #3218: Specify dependency branches in conda dev environment to avoid pip resolver issue
@@ -103,6 +103,7 @@
 - PR #3185: Add documentation for Distributed TFIDF Transformer
 - PR #3190: Fix Attribute error on ICPA #3183 and PCA input type
 - PR #3208: Fix EXITCODE override in notebook test script
+- PR #3214: Correct flaky silhouette score test by setting atol
 - PR #3216: Ignore splits that do not satisfy constraints
 
 # cuML 0.16.0 (23 Oct 2020)

diff --git a/README.md b/README.md
@@ -108,7 +108,10 @@ repo](https://github.com/rapidsai/notebooks-contrib).
 |  | Epsilon-Support Vector Regression (SVR) | |
 | **Time Series** | Holt-Winters Exponential Smoothing | |
 |  | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) |
-| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
+| **Model Explanation**                                 | SHAP Kernel Explainer                                                                                                               | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental)                                                                                                                                               |
+|                                                       | SHAP Permutation Explainer                       | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental)                                                                                                                                                |
+| **Other**                                             | K-Nearest Neighbors (KNN) Search                                                                                                          | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
+
 ---
 
 ## Installation
@@ -127,6 +130,8 @@ Please see our [guide for contributing to cuML](CONTRIBUTING.md).
 
 ## References
 
+The RAPIDS team has a number of blogs with deeper technical dives and examples. [You can find them here on Medium.](https://medium.com/rapids-ai/tagged/machine-learning)
+
 For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, see [_Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence_ (2020)](https://arxiv.org/abs/2002.04803) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.
 
 Please consider citing this when using cuML in a project. You can use the citation BibTeX:

diff --git a/docs/Makefile b/docs/Makefile
@@ -10,7 +10,7 @@ BUILDDIR      = build
 
 # Put it first so that "make" without argument is like "make help".
 help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+	@$(SPHINXBUILD) -M help -v "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 .PHONY: help Makefile
 

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -170,6 +170,9 @@ Metrics (clustering and trustworthiness)
   .. automodule:: cuml.metrics.cluster.homogeneity_score
     :members:
 
+  .. automodule:: cuml.metrics.cluster.silhouette_score
+    :members:
+
   .. automodule:: cuml.metrics.cluster.completeness_score
     :members:
 
@@ -504,3 +507,12 @@ Preprocessing
       Normalizer, RobustScaler, SimpleImputer, StandardScaler,
       add_dummy_feature, binarize, minmax_scale, normalize,
       PolynomialFeatures, robust_scale, scale
+
+
+Model Explanation (SHAP)
+------------------------
+.. autoclass:: cuml.experimental.explainer.KernelExplainer
+   :members:
+
+.. autoclass:: cuml.experimental.explainer.PermutationExplainer
+   :members:
diff --git a/python/cuml/dask/neighbors/kneighbors_classifier.py b/python/cuml/dask/neighbors/kneighbors_classifier.py
@@ -41,6 +41,33 @@ class KNeighborsClassifier(NearestNeighbors):
     K-Nearest Neighbors Classifier is an instance-based learning technique,
     that keeps training samples around for prediction, rather than trying
     to learn a generalizable set of model parameters.
+
+    Parameters
+    ----------
+    n_neighbors : int (default=5)
+        Default number of neighbors to query
+    algorithm : string (default='brute')
+        The query algorithm to use. Currently, only 'brute' is supported.
+    metric : string (default='euclidean').
+        Distance metric to use.
+    weights : string (default='uniform')
+        Sample weights to use. Currently, only the 'uniform' strategy is
+        supported.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
     """
     def __init__(self, client=None, streams_per_handle=0,
                  verbose=False, **kwargs):

diff --git a/python/cuml/dask/neighbors/kneighbors_regressor.py b/python/cuml/dask/neighbors/kneighbors_regressor.py
@@ -39,6 +39,33 @@ class KNeighborsRegressor(NearestNeighbors):
     K-Nearest Neighbors Regressor is an instance-based learning technique,
     that keeps training samples around for prediction, rather than trying
     to learn a generalizable set of model parameters.
+
+    Parameters
+    ----------
+    n_neighbors : int (default=5)
+        Default number of neighbors to query
+    algorithm : string (default='brute')
+        The query algorithm to use. Currently, only 'brute' is supported.
+    metric : string (default='euclidean').
+        Distance metric to use.
+    weights : string (default='uniform')
+        Sample weights to use. Currently, only the 'uniform' strategy is
+        supported.
+    handle : cuml.Handle
+        Specifies the cuml.handle that holds internal CUDA state for
+        computations in this model. Most importantly, this specifies the CUDA
+        stream that will be used for the model's computations, so users can
+        run different models concurrently in different streams by creating
+        handles in several streams.
+        If it is None, a new one is created.
+    verbose : int or boolean, default=False
+        Sets logging level. It must be one of `cuml.common.logger.level_*`.
+        See :ref:`verbosity-levels` for more info.
+    output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
+        Variable to control output type of the results and attributes of
+        the estimator. If None, it'll inherit the output type set at the
+        module level, `cuml.global_output_type`.
+        See :ref:`output-data-type-configuration` for more info.
     """
     def __init__(self, client=None, streams_per_handle=0,
                  verbose=False, **kwargs):

diff --git a/python/cuml/experimental/explainer/kernel_shap.pyx b/python/cuml/experimental/explainer/kernel_shap.pyx
@@ -70,26 +70,28 @@ cdef extern from "cuml/explainer/kernel_shap.hpp" namespace "ML":
 
 class KernelExplainer(SHAPBase):
     """
-    GPU accelerated of SHAP's kernel explainer, optimized for tabular data.
+    GPU accelerated of SHAP's kernel explainer (experimental).
+
     Based on the SHAP package:
     https://github.com/slundberg/shap/blob/master/shap/explainers/_kernel.py
 
     Main differences of the GPU version:
 
-    - Data generation and Kernel SHAP calculations are significantly faster,
-    but this has a tradeoff of having more model evaluations if both the
-    observation explained and the background data have many 0-valued columns.
-    - Support for SHAP's new Explanation and API will be available in the
-    next version.
-    - There is a small initialization cost (similar to training time of regular
-    Scikit/cuML models) of a few seconds, which was a tradeoff for
-    faster explanations after that.
-    - Only tabular data is supported for now, via passing the background
-    dataset explicitly. Since the new API of SHAP is still evolving, the main
-    supported API right now is the old one
-    (i.e. ``explainer.shap_values()``)
-    - Sparse data support is planned for the near future.
-    - Further optimizations are in progress.
+     - Data generation and Kernel SHAP calculations are significantly faster,
+       but this has a tradeoff of having more model evaluations if both the
+       observation explained and the background data have many 0-valued
+       columns.
+     - Support for SHAP's new Explanation and API will be available in the
+       next version.
+     - There is a small initialization cost (similar to training time of
+       regular Scikit/cuML models) of a few seconds, which was a tradeoff for
+       faster explanations after that.
+     - Only tabular data is supported for now, via passing the background
+       dataset explicitly. Since the new API of SHAP is still evolving, the
+       main supported API right now is the old one
+       (i.e. ``explainer.shap_values()``)
+     - Sparse data support is planned for the near future.
+     - Further optimizations are in progress.
 
     Parameters
     ----------
@@ -109,7 +111,7 @@ class KernelExplainer(SHAPBase):
     nsamples : int (default = 2 * data.shape[1] + 2048)
         Number of times to re-evaluate the model when explaining each
         prediction. More samples lead to lower variance estimates of the SHAP
-        values. The "auto" setting uses `nsamples = 2 * X.shape[1] + 2048`.
+        values. The "auto" setting uses ``nsamples = 2 * X.shape[1] + 2048``.
     link : function or str (default = 'identity')
         The link function used to map between the output units of the
         model and the SHAP value units. From the SHAP package: The link
@@ -159,7 +161,6 @@ class KernelExplainer(SHAPBase):
     ...     n_features=10,
     ...     noise=0.1,
     ...     random_state=42)
-
     >>>
     >>> X_train, X_test, y_train, y_test = train_test_split(
     ...     X,

diff --git a/python/cuml/experimental/explainer/permutation_shap.pyx b/python/cuml/experimental/explainer/permutation_shap.pyx
@@ -87,6 +87,7 @@ cdef extern from "cuml/explainer/permutation_shap.hpp" namespace "ML":
 
 class PermutationExplainer(SHAPBase):
     """
+    GPU accelerated of SHAP's permutation explainer (experimental)
 
     Initial experimental version of a GPU accelerated of SHAP's
     permutation explainer:
@@ -99,16 +100,16 @@ class PermutationExplainer(SHAPBase):
 
     Current limitations of the GPU version (support in progress):
 
-    - Batched, both for supporting larger daasets as well as to accelerate
-    smaller ones, is not implemented yet.
-    - Only tabular masker is supported, via passing the background
-    dataset explicitly. Since the new API of SHAP is still evolving, the
-    supported API for this version is the old one
-    (i.e. explainer.shap_values()). The new one, and the new SHAP Explanation
-    object will be supported in the next version.
-    - Hierarchical clustering for Owen values are planned for the near
-    future.
-    - Sparse data support is not yet implemented.
+     - Batched, both for supporting larger daasets as well as to accelerate
+       smaller ones, is not implemented yet.
+     - Only tabular masker is supported, via passing the background
+       dataset explicitly. Since the new API of SHAP is still evolving, the
+       supported API for this version is the old one
+       (i.e. ``explainer.shap_values()``). The new one, and the new SHAP
+       Explanation object will be supported in the next version.
+     - Hierarchical clustering for Owen values are planned for the near
+       future.
+     - Sparse data support is not yet implemented.
 
     Parameters
     ----------
@@ -119,7 +120,7 @@ class PermutationExplainer(SHAPBase):
         cuML's permutation SHAP supports tabular data for now, so it expects
         a background dataset, as opposed to a shap.masker object. To respect
         a hierarchical structure of the data, use the (temporary) parameter
-        'masker_type'
+        `masker_type`
         Acceptable formats: CUDA array interface compliant objects like
         CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas
         DataFrame/Series.
@@ -174,7 +175,6 @@ class PermutationExplainer(SHAPBase):
     ...     n_features=10,
     ...     noise=0.1,
     ...     random_state=42)
-
     >>>
     >>> X_train, X_test, y_train, y_test = train_test_split(
     ...     X,