Skip to content

Commit

Permalink
Merge pull request #3242 from rapidsai/branch-0.17
Browse files Browse the repository at this point in the history
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
  • Loading branch information
GPUtester authored Dec 3, 2020
2 parents 40d9568 + a76d80f commit ff17722
Show file tree
Hide file tree
Showing 9 changed files with 185 additions and 100 deletions.
9 changes: 5 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,10 @@

## New Features
- PR #3164: Expose silhouette score in Python
- PR #3160: Least Angle Regression
- PR #3214: Correct flaky silhouette score test by setting atol
- PR #3160: Least Angle Regression (experimental)
- PR #2659: Add initial max inner product sparse knn
- PR #2836: Refactor UMAP to accept sparse inputs
- PR #3126: Experimental versions of GPU accelerated Kernel and Permutation SHAP
- PR #3186: Add gain to RF JSON dump

## Improvements
- PR #3077: Improve runtime for test_kmeans
Expand Down Expand Up @@ -44,7 +42,7 @@
- PR #3115: Speeding up MNMG UMAP testing
- PR #3112: Speed test_array
- PR #3111: Adding Cython to Code Coverage
- PR #3129: Update notebooks README
- PR #3129: Update notebooks README
- PR #3002: Update flake8 Config To With Per File Settings
- PR #3135: Add QuasiNewton tests
- PR #3040: Improved Array Conversion with CumlArrayDescriptor and Decorators
Expand All @@ -57,9 +55,11 @@
- PR #3155: Eliminate unnecessary warnings from random projection test
- PR #3176: Add probabilistic SVM tests with various input array types
- PR #3180: FIL: `blocks_per_sm` support in Python
- PR #3186: Add gain to RF JSON dump
- PR #3219: Update CI to use XGBoost 1.3.0 RCs
- PR #3221: Update contributing doc for label support
- PR #3177: Make Multinomial Naive Bayes inherit from `ClassifierMixin` and use it for score
- PR #3240: Minor doc updates

## Bug Fixes
- PR #3218: Specify dependency branches in conda dev environment to avoid pip resolver issue
Expand Down Expand Up @@ -103,6 +103,7 @@
- PR #3185: Add documentation for Distributed TFIDF Transformer
- PR #3190: Fix Attribute error on ICPA #3183 and PCA input type
- PR #3208: Fix EXITCODE override in notebook test script
- PR #3214: Correct flaky silhouette score test by setting atol
- PR #3216: Ignore splits that do not satisfy constraints

# cuML 0.16.0 (23 Oct 2020)
Expand Down
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,10 @@ repo](https://github.com/rapidsai/notebooks-contrib).
| | Epsilon-Support Vector Regression (SVR) | |
| **Time Series** | Holt-Winters Exponential Smoothing | |
| | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) |
| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
| **Model Explanation** | SHAP Kernel Explainer | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental) |
| | SHAP Permutation Explainer | [Based on SHAP](https://shap.readthedocs.io/en/latest/) (experimental) |
| **Other** | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+[UCX](https://github.com/rapidsai/ucx-py), uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |

---

## Installation
Expand All @@ -127,6 +130,8 @@ Please see our [guide for contributing to cuML](CONTRIBUTING.md).

## References

The RAPIDS team has a number of blogs with deeper technical dives and examples. [You can find them here on Medium.](https://medium.com/rapids-ai/tagged/machine-learning)

For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, see [_Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence_ (2020)](https://arxiv.org/abs/2002.04803) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.

Please consider citing this when using cuML in a project. You can use the citation BibTeX:
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@$(SPHINXBUILD) -M help -v "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

Expand Down
12 changes: 12 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,9 @@ Metrics (clustering and trustworthiness)
.. automodule:: cuml.metrics.cluster.homogeneity_score
:members:

.. automodule:: cuml.metrics.cluster.silhouette_score
:members:

.. automodule:: cuml.metrics.cluster.completeness_score
:members:

Expand Down Expand Up @@ -504,3 +507,12 @@ Preprocessing
Normalizer, RobustScaler, SimpleImputer, StandardScaler,
add_dummy_feature, binarize, minmax_scale, normalize,
PolynomialFeatures, robust_scale, scale


Model Explanation (SHAP)
------------------------
.. autoclass:: cuml.experimental.explainer.KernelExplainer
:members:

.. autoclass:: cuml.experimental.explainer.PermutationExplainer
:members:
27 changes: 27 additions & 0 deletions python/cuml/dask/neighbors/kneighbors_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,33 @@ class KNeighborsClassifier(NearestNeighbors):
K-Nearest Neighbors Classifier is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.
Parameters
----------
n_neighbors : int (default=5)
Default number of neighbors to query
algorithm : string (default='brute')
The query algorithm to use. Currently, only 'brute' is supported.
metric : string (default='euclidean').
Distance metric to use.
weights : string (default='uniform')
Sample weights to use. Currently, only the 'uniform' strategy is
supported.
handle : cuml.Handle
Specifies the cuml.handle that holds internal CUDA state for
computations in this model. Most importantly, this specifies the CUDA
stream that will be used for the model's computations, so users can
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
verbose : int or boolean, default=False
Sets logging level. It must be one of `cuml.common.logger.level_*`.
See :ref:`verbosity-levels` for more info.
output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
Variable to control output type of the results and attributes of
the estimator. If None, it'll inherit the output type set at the
module level, `cuml.global_output_type`.
See :ref:`output-data-type-configuration` for more info.
"""
def __init__(self, client=None, streams_per_handle=0,
verbose=False, **kwargs):
Expand Down
27 changes: 27 additions & 0 deletions python/cuml/dask/neighbors/kneighbors_regressor.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,33 @@ class KNeighborsRegressor(NearestNeighbors):
K-Nearest Neighbors Regressor is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.
Parameters
----------
n_neighbors : int (default=5)
Default number of neighbors to query
algorithm : string (default='brute')
The query algorithm to use. Currently, only 'brute' is supported.
metric : string (default='euclidean').
Distance metric to use.
weights : string (default='uniform')
Sample weights to use. Currently, only the 'uniform' strategy is
supported.
handle : cuml.Handle
Specifies the cuml.handle that holds internal CUDA state for
computations in this model. Most importantly, this specifies the CUDA
stream that will be used for the model's computations, so users can
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
verbose : int or boolean, default=False
Sets logging level. It must be one of `cuml.common.logger.level_*`.
See :ref:`verbosity-levels` for more info.
output_type : {'input', 'cudf', 'cupy', 'numpy', 'numba'}, default=None
Variable to control output type of the results and attributes of
the estimator. If None, it'll inherit the output type set at the
module level, `cuml.global_output_type`.
See :ref:`output-data-type-configuration` for more info.
"""
def __init__(self, client=None, streams_per_handle=0,
verbose=False, **kwargs):
Expand Down
35 changes: 18 additions & 17 deletions python/cuml/experimental/explainer/kernel_shap.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -70,26 +70,28 @@ cdef extern from "cuml/explainer/kernel_shap.hpp" namespace "ML":

class KernelExplainer(SHAPBase):
"""
GPU accelerated of SHAP's kernel explainer, optimized for tabular data.
GPU accelerated of SHAP's kernel explainer (experimental).
Based on the SHAP package:
https://github.com/slundberg/shap/blob/master/shap/explainers/_kernel.py
Main differences of the GPU version:
- Data generation and Kernel SHAP calculations are significantly faster,
but this has a tradeoff of having more model evaluations if both the
observation explained and the background data have many 0-valued columns.
- Support for SHAP's new Explanation and API will be available in the
next version.
- There is a small initialization cost (similar to training time of regular
Scikit/cuML models) of a few seconds, which was a tradeoff for
faster explanations after that.
- Only tabular data is supported for now, via passing the background
dataset explicitly. Since the new API of SHAP is still evolving, the main
supported API right now is the old one
(i.e. ``explainer.shap_values()``)
- Sparse data support is planned for the near future.
- Further optimizations are in progress.
- Data generation and Kernel SHAP calculations are significantly faster,
but this has a tradeoff of having more model evaluations if both the
observation explained and the background data have many 0-valued
columns.
- Support for SHAP's new Explanation and API will be available in the
next version.
- There is a small initialization cost (similar to training time of
regular Scikit/cuML models) of a few seconds, which was a tradeoff for
faster explanations after that.
- Only tabular data is supported for now, via passing the background
dataset explicitly. Since the new API of SHAP is still evolving, the
main supported API right now is the old one
(i.e. ``explainer.shap_values()``)
- Sparse data support is planned for the near future.
- Further optimizations are in progress.
Parameters
----------
Expand All @@ -109,7 +111,7 @@ class KernelExplainer(SHAPBase):
nsamples : int (default = 2 * data.shape[1] + 2048)
Number of times to re-evaluate the model when explaining each
prediction. More samples lead to lower variance estimates of the SHAP
values. The "auto" setting uses `nsamples = 2 * X.shape[1] + 2048`.
values. The "auto" setting uses ``nsamples = 2 * X.shape[1] + 2048``.
link : function or str (default = 'identity')
The link function used to map between the output units of the
model and the SHAP value units. From the SHAP package: The link
Expand Down Expand Up @@ -159,7 +161,6 @@ class KernelExplainer(SHAPBase):
... n_features=10,
... noise=0.1,
... random_state=42)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
... X,
Expand Down
24 changes: 12 additions & 12 deletions python/cuml/experimental/explainer/permutation_shap.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ cdef extern from "cuml/explainer/permutation_shap.hpp" namespace "ML":

class PermutationExplainer(SHAPBase):
"""
GPU accelerated of SHAP's permutation explainer (experimental)
Initial experimental version of a GPU accelerated of SHAP's
permutation explainer:
Expand All @@ -99,16 +100,16 @@ class PermutationExplainer(SHAPBase):
Current limitations of the GPU version (support in progress):
- Batched, both for supporting larger daasets as well as to accelerate
smaller ones, is not implemented yet.
- Only tabular masker is supported, via passing the background
dataset explicitly. Since the new API of SHAP is still evolving, the
supported API for this version is the old one
(i.e. explainer.shap_values()). The new one, and the new SHAP Explanation
object will be supported in the next version.
- Hierarchical clustering for Owen values are planned for the near
future.
- Sparse data support is not yet implemented.
- Batched, both for supporting larger daasets as well as to accelerate
smaller ones, is not implemented yet.
- Only tabular masker is supported, via passing the background
dataset explicitly. Since the new API of SHAP is still evolving, the
supported API for this version is the old one
(i.e. ``explainer.shap_values()``). The new one, and the new SHAP
Explanation object will be supported in the next version.
- Hierarchical clustering for Owen values are planned for the near
future.
- Sparse data support is not yet implemented.
Parameters
----------
Expand All @@ -119,7 +120,7 @@ class PermutationExplainer(SHAPBase):
cuML's permutation SHAP supports tabular data for now, so it expects
a background dataset, as opposed to a shap.masker object. To respect
a hierarchical structure of the data, use the (temporary) parameter
'masker_type'
`masker_type`
Acceptable formats: CUDA array interface compliant objects like
CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas
DataFrame/Series.
Expand Down Expand Up @@ -174,7 +175,6 @@ class PermutationExplainer(SHAPBase):
... n_features=10,
... noise=0.1,
... random_state=42)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
... X,
Expand Down
Loading

0 comments on commit ff17722

Please sign in to comment.