Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream #20

Merged
merged 86 commits into from
Dec 28, 2020
Merged

Sync with upstream #20

merged 86 commits into from
Dec 28, 2020

Conversation

daxiongshu
Copy link
Owner

No description provided.

zbjornson and others added 30 commits November 18, 2020 06:56
`#include <cuml/manifold/umap.hpp>` works now.

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>
* Moving conftest.py files around and adding quick_run plugin

* Adding PR to CHANGELOG

* Incorporating feedback from code review
* Initial cython test commit

* Update changelog

* Style fixes

Co-authored-by: Nanthini Balasubramanian <nathanb@nvidia.com>
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
…precation warnings (#3155)

* Get rid of warnings in random projections test

* Update changelog

* Fix style

* Update other deprecated make_blob imports
* FIX Force local install by specifying exact build string

* DOC Update changelog

* Update ci/gpu/build.sh

Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com>

Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com>
* Update flake8 config to join python/cython configuration and improve setup to check __init__.py files

* Fixing linting issues in previously ignored __init__.py files

* Update flake8 config to join python/cython configuration and improve setup to check __init__.py files

* Fixing linting issues in previously ignored __init__.py files

* Adding PR to CHANGELOG

* Incorporating feedback from code review

* Fixing style issues after merge with branch-0.17

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
…kip-ci] (#3144)

* Adding ability to set arbitrary cmake flags in ./build.sh via the $CUML_ADDL_CMAKE_ARGS variable

* Adding PR to CHANGELOG

* Adding more help info requested from code review.

Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
* Adding brute force knn shell to sparse

* Stubbing out algorithm flow

* Adding initial headers to wrapper

* Performing idx batching

* Starting to full in cusparse calls

* Checking in

* Beginning to add selection kernel

* Finished header

* Updates. Need to finish populating merge buffer

* Using block select for selecting k and using 3-partition merge buffer

* Logic is just about done.

* Checking in changes. Need to swap out cuda 11 cusparse calls for cuda 10.2 version

* Everything is building. Need end-to-end test

* Running clang format

* Updating changelog

* Using raft's cusparse_wrappers.h instead of cuml

* Removing cuda11-required GEMM calls (commenting them out for now, will swap them out shortly)

* Fixing clang style

* Separating distance computation from selection from general brute force algorithm to make pieces more reusable

* Updating clang style

* Adding batcher to help ease batch state management

* Fixing clang style

* MOre clang fixes

* IP distance is computed using search * index.T.

* Making type template for value_t all the way through knn_merge_parts

* Adding simple googletest for sparse pairwise dists. The transpose conversion seems super expensive, but maybe it's necessary.

* Completing test for basic inner product distances

* Removing prints from test

* Cleaning up batching for knn. Ready to gtest

* KNN w/ max inner product is working

* Adding guts of expanded l2 computation.

* Cleaning up some debug prints

* Fixing clang format

* More cleanup and clang style fix

* Fixing style for sparse knn prim test

* Hoping i've captured all the clang updates

* Updating per include_checker

* I feel like I"m bouncing back and forth between clang and include checker

* Refactoring sparse pairwise dists to return dense outputs

* Beginning python layer

* iAdding python layer for sparse inputs to nearest neighbors

* End to end sparse knn works. Need to finish norms for expanded euclidean and expose it.

* Removing unused file

* Adding gtest for expanded l2.

* Sparse l2 matches sklearn

* Fixing clang format style

* Fixing dstyle in gtests

* Lots of changes and cleanup. Still need to flip the batching

* Progress on tiling. Still a failure when tile sizes don't match up.

* Tiling w/ uneven batch sizes works! Now just need to figure out what to do when the leftover values are <k

* Some further optinmizations are necessary, but this works for now.

* Ready for cleanup

* Parametrizing sparse knn tests

* More cleanup.

* Fixing clang format

* Fixing clang format style

* Fixing flake8 for sparse nn tests

* Fixing googletests

* More cleanup of sparse knn

* Adding sparse support to UMAP by abstracting the inputs

* Everything's building. Have one template issue to fix in the sparse knn

* Updates to API

* Usig a struct to manage the knn graph output state

* C++ side is largely done. Still need to figure out what to do w/ the separate int64_t type in the sparse knn

* Removing examples/comms, which seems to have gotten re-checked in by mistake

* Fixing c++ style

* Fixing include checks

* This darn style checker is going to kill me.....

* Adding template type params for output

* UMAP is officially accepting sparse inputs

* More cleanup

* Cleaning up gtests and making them easier to write

* Fixing up and parametrizing tests

* Fixing style

* Fixing python style

* More clang format style fixes

* Pulled umap inputs classes to more shared location so tsne can use them.

Added kselection gtest

* Updating clang format

* Fixing bad ide refactor

* Updating changelog

* Fixing more clang format

* Fixing flake8 style. Not sure why these didn't show up locally

* Decomposing sparse knn into a class.

* Review feedback

* Better umap sparse test

* More testing updates

* Adding docs to some of the remaining prims in csr.cuh

* Adding gtests for transpose and row slice. Need to add one for todense

* GTest for csr to dense

* Fixing style

* Removing debug logging from new gtests

* Fixing flake8  style

* Getting build to pass

* Running clang-tidy

* Fixing format for sparse gtests

* Adding 'algo_params' to get_param_names()

* Removing cumlarray output in kneighbors

* Finishing review feedback

* Fixing style

* Fixing format

* clang-format

* Style changes

* More review updates

* Style updates

* Running clang format on distance.cuh

* Runing clang format on tests

* Fixing cython style

* Updating RAFT commit

* Updating neighbors from bad merge
…mples_leaf (#3132)

* Enforce min_rows_per_node in experimental RF backend

* Add min_samples_split hyperparameter

* Use correct definition of min_samples_split

* Rename range_len -> n_samples

* Add min_samples_split to Dask docstring

* Rename min_rows_per_node -> min_samples_leaf

* Update docstring for min_samples_leaf

* Correctly apply min_samples_split in new RF backend

* Address reviewer's comment

* Fix broken tests in BatchedLevelAlgo/DtRegTestF.Test

* Adjust accuracy requirement in test RFBatchedRegTests/RFBatchedRegTestF.Fit/5

* Add unit tests for min_samples_split, min_samples_leaf

* Add descriptive comments for compound literals

* Fix formatting

* Add changelog

* Organize unit tests under prefix BatchedLevelAlgoUnitTest

* Change default value for min_samples_leaf to 1

* Deprecate min_rows_per_node; guide users to use min_samples_leaf

* Fix style error
…ors (#3113)

* FEA Add preferred_order class parameter to linear models

* ENH adopt tags from scikit-learn API to support preferred order attribute

* DOC remove attribute docstrings

* FIX Change straggling classes

* FIX Change straggling classes

* FIX Add missing self

* FIX straggling attribute

* ENH Add device data tag for proposal

* FEA Add all scikit-learn API tags to base and improve gpu input types tag

* FEA Add preferred_order tag to cluster models

* FEA Add preferred_order tag to most models

* ENH Improvements and PR review feedback

* DOC add tag documentation to estimator guide

* DOC add scikit link

* Update wiki/python/ESTIMATOR_GUIDE.md

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>

* Update wiki/python/ESTIMATOR_GUIDE.md

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>

* Update wiki/python/ESTIMATOR_GUIDE.md

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>

* Update wiki/python/ESTIMATOR_GUIDE.md

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>

* Update wiki/python/ESTIMATOR_GUIDE.md

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>

* ENH Rename test_fit to test_api and add tags tests

* FIX fixes from PR review

* DOC Added entry to changelog

* FIX PEP8 fixes

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>
* Removing extra unneeded file

* Updating changelog
…#3152)

* FIX Access to attributes of individual NB objects in dask NB

* DOC Added entry to changelog

* ENH Add pytest

* FIX PEP8 fixes

Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
…the tiniest models (#3032)

* just control block count

* blocks_per_sm can now be passed through treelite_params_t or forest_params_t

* changelog

* made blocks_per_sm mandatory; added tests; fixed a bug

* changelog

* added tests, moved __syncthreads() to common for all acc's, removed most blockIdx.x uses

* removed blocks_per_sm from python API, to avoid a longer discussion on best set

* simplified output loops

* addressed other review comments

* fixed bad merge conflict resolution

* comment for blocks_per_sm in fil.pyx

* style
* binary reduction: half way there

* quaternary reduction

* changelog

* remove accidental files

* generalize the multireduction

* adding dedicated tests for multireduction; style

* change trap; into setting an atomic.

* split into n tests, one per size

* ?

* tried thrust + rmm, no rmm dependency in tests it seems

* no rmm, sync allocations

* style

* fixed some testing bugs; expanded test to all block sizes; better documentation

* fixed wrong test

* simplify comparison

* member -> non-member function pointer as test template argument

* style

* replaced reduction with simpler code; tuned radix towards fewer classes

* fixed compile dependency and runtime discrepancy

* long comment line

* fix build issues

* Apply suggestions from code review

Co-authored-by: Andy Adinets <adinetz@gmail.com>

* addressed review comments

Co-authored-by: Andy Adinets <adinetz@gmail.com>
* add dask-glm demo link

* add to changelog

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
Updated with 0.15 and 0.16 release dates.

Co-authored-by: Corey J. Nolet <cjnolet@users.noreply.github.com>
* Remove outdated, extraneous file

* Update changelog
* Expose silhouette score in Python

* Style fix

* Correct dtypes used in silhouette_score

* Update changelog

* Fix style

* Update linebreaks

* Add copyright headers

* Collapse Python silhouette_score to single file

* Restructure silhouette_score for consistency

* Fix style

* Loosen silhouette score test tolerance
…[skip-ci] (#3175)

* FIX Fix gtest pinned cmake version for build from source option

* DOC Added entry to changelog
…3176)

* Add probabilistic SVM tests with various input array types

* DOC update changelog
* Fix a bug in MSE metric calculation

* Style fix

* Add changelog

* Try smaller grid dimensions
* blocks_per_sm FIL parameter in Python.

* Updated CHANGELOG.md.

* Fixed style errors.

* Reduced the number of parameter combinations in the Python test.
)

* Enable pipeline usage for OneHotEncoder and LabelEncoder

* Changelog update
* Adding simple dask estimator notebook to demonstrate saving/loading

* Renaming and updating cells

* Updating source.rst

* Updating changelog

* Updating pickling notebook

* Review updates

* More review feedback

Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
* Fix + multiple improvements

* Update changelog

* Update model output and testing

* Check style update

* Update comments

* Test one query partition

* Check style
…dically-failing FIL test [skip ci] (#3196)

* Disable ascending=false path for sortColumnsPerRow

* DOC Update chanegelog

* Disable flaky FIL test

Co-authored-by: John Zedlewski <jzedlewski@nvidia.com>
Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
* FIX Fix EXITCODE override in test_notebooks script

* DOC Changelog update

* FIX Move bash trap to after the GTests so they fail immediately

* FIX Move codecov block to gpu build
* Fix cuDF to cuPy conversion (missing value)

* Changelog update

* Introducing fail_on_nan parameter

* Adding test with fail_on_nan=True

* Updating conversion

* Rename fail_on_nan into fail_on_null
This PR is fixing the attribute error of #3183, and additional bugs on the input type of PCA (`sparse_scipy_to_cp()` function call missed an argument) and on the shape of `self.singular_values_`.

I am also adding additional tests on the bug fixed here.

Authors:
  - Mickael Ide <ide.mickael@gmail.com>
  - John Zedlewski <904524+JohnZed@users.noreply.github.com>

Approvers:
  - Divye Gala
  - John Zedlewski

URL: #3190
GPUtester and others added 29 commits December 3, 2020 02:56
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
Authors:
  - divyegala <divyegala@gmail.com>
  - John Zedlewski <904524+JohnZed@users.noreply.github.com>

Approvers:
  - Dante Gama Dessavre
  - Dante Gama Dessavre

URL: #3241
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
…'s RF(#3245)

Rename rows_sample -> max_samples to be consistent with sklearn's RF.

From https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html:

> **max_samples**: int or float, default=None
> If bootstrap is True, the number of samples to draw from X to train each base estimator.
> If None (default), then draw X.shape[0] samples.
> If int, then draw max_samples samples.
> If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).
> New in version 0.22.

Authors:
  - Hyunsu Cho <chohyu01@cs.washington.edu>

Approvers:
  - John Zedlewski

URL: #3245
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
…tical(#3243)

Closes #3231 
Closes #3128
Partially addresses #3188 

The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case.

The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k.

As a result, the MSE gain is computed very similarly as the MAE gain.

Disadvantage: now we always make two passes over data to compute the gain.

cc @teju85 @vinaydes @JohnZed

Authors:
  - Hyunsu Cho <chohyu01@cs.washington.edu>
  - Philip Hyunsu Cho <chohyu01@cs.washington.edu>

Approvers:
  - Thejaswi Rao
  - John Zedlewski

URL: #3243
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
Authors:
  - Corey J. Nolet <cjnolet@gmail.com>
  - Corey J. Nolet <cjnolet@users.noreply.github.com>

Approvers:
  - John Zedlewski

URL: #3250
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
)

* Hide silhouette_score Python binding

Remove this feature due to memory issues in C++ implementation for
anything but modest numbers of samples

* Remove silhouette_score tests

* Update changelog

* Remove unused import

* Remove silhouette_score from new features list

* Add note on reason for hiding silhouette_score

* Update docstrings with silhouette_score warning

Also remove sillhouette_score from api.rst docs

* Update CHANGELOG to restore reference to reverted PR
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
Answers #3232.
Explicitly specify `batch_size` as parameter to MNMG KNN models in order to make it visible in the documentation.

Authors:
  - viclafargue <viclafargue@nvidia.com>
  - Corey J. Nolet <cjnolet@gmail.com>

Approvers:
  - Corey J. Nolet
  - John Zedlewski

URL: #3246
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
…#3282)

* FIX Add secondary test to kernel explainer pytests for stability in Volta

* DOC Added entry to changelog

* FIX PR review feedback
[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
* Correct pure virtual declaration in manifold_inputs_t

* Update changelog
Remove keyword "stops" from call to cudf.core.column.string.slice, which no longer accepts arbitrary keywords.

cuDF change introduced in rapidsai/cudf#6750.

Authors:
  - William Hicks <whicks@nvidia.com>

Approvers:
  - John Zedlewski
  - Micka

URL: #3289
Linear SVR has the coef_ attribute in the python layer. In the C++ unit test the same vector is denoted by _w_, and it is defined as a linear combination of the support vectors

![image](https://user-images.githubusercontent.com/3671106/101908077-ce3d9e80-3bbb-11eb-98ff-e7be90828dde.png)

The number of elements in _w_ is n_cols. One of the SVR tests only defined 1 expected value for _w_, instead of the expected n_cols=2 values, which lead to accessing an uninitialized value. This would fail the test unless the value is accidentally zero initialized. Surprisingly this happened extremely rarely.

This PR fixes the expected value _w_exp_.

Authors:
  - Tamas Bela Feher <tfeher@nvidia.com>

Approvers:
  - Dante Gama Dessavre

URL: #3294
Closes #1780

Adding kNN graph input functionality to t-SNE, a request broken off of the issue #1733. t-SNE gathers kNN indices and distances in the first stage of it's computation, by allowing the user to input their own kNN graph, they can skip this step. This should follow #1815 as closely as possible.

**Benefits of this**:
- allow user custom run of kNN algorithm
- can use different distance function instead of t-SNE euclidean default
- allows for speedup if performing grid search by storing and reusing kNN graph

**Includes:**
- [x] Abstracted `extract_knn_graph` so it can be used for both UMAP and t-SNE
- [x] Implemented kNN graph input to Python/Cython layer and C++/CUDA layer
- [x] C++/CUDA Barnes Hut and Exact t-SNE tests
- [x] Python t-SNE tests
- [x] General code cleanup wherever needed

Authors:
  - Aleksander Ficek <alex.ficek99@gmail.com>
  - Corey J. Nolet <cjnolet@gmail.com>
  - Ray Douglass <3107146+raydouglass@users.noreply.github.com>
  - Corey J. Nolet <cjnolet@users.noreply.github.com>

Approvers:
  - Corey J. Nolet

URL: #2592
* FEA Consolidate linear model gemm based predicts on one function on C++

* FEA Consolidate linear model gemm based predicts on one function on Python

* DOC Added entry to changelog

* FIX PEP8 fixes

* FIX Forgot clang-format

* FIX Remove C++ sync calls and unnecessary delete on Python based on PR feedback

* DOC Remove changelog entry
…3292)

* Refactoring: move internal FIL interface to a separate file.

- move the functions not related to treelite import, prediction
  or freeing the model to a separate file

* Fixed style errors.
This PR will enable the usage of multiple KNN strategies as alternatives to the current default bruteforce method. See #574

Authors:
  - wxbn <wxbn@live.fr>
  - viclafargue <viclafargue@nvidia.com>
  - Corey J. Nolet <cjnolet@gmail.com>

Approvers:
  - Corey J. Nolet

URL: #2780
This PR fixes CI fails that happen on `test_naive_bayes` when the machine can't download the 20 newsgroup dataset.

It closes #3260

Authors:
  - Mickael Ide <ide.mickael@gmail.com>

Approvers:
  - John Zedlewski

URL: #3291
* Adding NotFittedError to PCA

* Fixed typo in PCA import

* Fixed check_is_fitted call

* Fixed missing parenthesis

* Added test on svd_flip

* fix style ipca

* Fixed whitespace style

* Removed useless test
- only the node types without the `_t` suffix are now used
- removed the functions necessary to handle node types with the `_t` suffix
Ensure that the 100th quantile value returned by cupy.percentile is the maximum of the input array rather than (possibly) NaN due to cupy/cupy#4451. This eliminates an intermittent failure observed in tests of KBinsDiscretizer, which makes use of cupy.percentile. Note that this includes an alteration of the included sklearn code and should be reverted once the upstream cupy issue is resolved.

Resolve failure due to ValueError described in #2933.

Authors:
  - William Hicks <whicks@nvidia.com>

Approvers:
  - Dante Gama Dessavre
  - Victor Lafargue

URL: #3315
This PR aims at converting the confusion matrix to int when possible, to avoid the scientific notation when possible.

See this example:

![image](https://user-images.githubusercontent.com/9810050/101400035-9808d200-38d0-11eb-9f81-4d217a5ff202.png)

Authors:
  - Mickael Ide <ide.mickael@gmail.com>
  - Mickael Ide <mide@nvidia.com>

Approvers:
  - John Zedlewski

URL: #3275
…#3281)

Replace "constexpr static" member variables in DecisionTree unit test fixture with "const" member variables for compliance with C++14, which otherwise requires that const static data members be separately defined in a namespace scope if it is ODR-used (See sections 3.2 and 9.4.2 of the C++11 standard, which remain relevant until C++17).

Authors:
  - William Hicks <whicks@nvidia.com>

Approvers:
  - Dante Gama Dessavre

URL: #3281
@daxiongshu daxiongshu merged commit 8b1b7c3 into daxiongshu:branch-0.18 Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.