Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] [dask] Add DaskLGBMRanker #3708

Merged
merged 20 commits into from
Jan 22, 2021
Merged

[python-package] [dask] Add DaskLGBMRanker #3708

merged 20 commits into from
Jan 22, 2021

Conversation

ffineis
Copy link
Contributor

@ffineis ffineis commented Jan 2, 2021

Hello,

This is a PR to include support for a DaskLGBMRanker. Previously only DaskLGBMClassifier and DaskLGBMRegressor were supported. This PR was originally for the dask-lightgbm repo, but is migrated here after the incorporation of the recent. lightgbm.dask module. Tests added and were passing from an image built from a modification of dockerfile-python.

Note: it's easier to use dask dataframes with the DaskLGBMRanker than it is arrays through the use of df.set_index, although it's possible to use Dask arrays if the trainset is built in group-wise chunks. The constraints of the ranking task force us to not break up contiguous groups, which is why chunks/partitions need to be constructed from groups somehow. No tests for sparse arrays and DaskLGBMRanker at the moment.

Thanks! Excited for this new module.

  • Frank

@ffineis ffineis requested a review from jameslamb as a code owner January 2, 2021 23:15
@ghost
Copy link

ghost commented Jan 2, 2021

CLA assistant check
All CLA requirements met.

@jameslamb jameslamb changed the title Dask/ranker [python-package] [dask] Add DaskLGBMRanker Jan 3, 2021
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry it's taking me so long to review and test this! I'm planning to review it tomorrow.

@ffineis
Copy link
Contributor Author

ffineis commented Jan 11, 2021

Hey yeah no problem. Unfortunately, I can't get the test_dask.py tests passing 100% reliably. The errors are random, sometimes passing, they're hard to reproduce. Failures seem limited to GPU builds and the ranker tests. But the [LightGBM] [Fatal] Socket recv error, code: 104 issue I believe predates these new ranker tests, affecting regressor/classifier.

The strange part is that when I execute the container (i.e. instead of through a bash shell), the tests have passed every time. It's only upon running pytest test_dask.py within a REPL in interactive mode that exceptions pop up.

Dockerfile:

FROM ubuntu:16.04

ARG CONDA_DIR=/opt/conda
ENV PATH $CONDA_DIR/bin:$PATH

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        ca-certificates \
        cmake \
        build-essential \
        gcc \
        g++ \
        git \
        vim \
        wget && \
    # python environment
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    /bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p $CONDA_DIR && \
    export PATH="$CONDA_DIR/bin:$PATH" && \
    conda config --set always_yes yes --set changeps1 no && \
    # lightgbm
    conda install -q -y numpy scipy scikit-learn pandas pytest dask[array]>=2.0.0 dask[dataframe]>=2.0.0 dask[distributed]>=2.0.0 && \
    git clone --recursive --branch dask/ranker --depth 1 https://github.com/ffineis/LightGBM && \
    cd LightGBM/python-package && python setup.py install && \
    # clean
    apt-get autoremove -y && apt-get clean && \
    conda clean -a -y && \
    rm -rf /usr/local/src/*

CMD pytest -s /LightGBM/tests/python_package_test/test_dask.py -k "ranker" --log-cli-level DEBUG
  • Instability appears to be networking-related. When I run tests from within an interactive shell on the container (in -it mode), I often run into LightGBMError:"Binding port 13xxx failed" running locally. On Travis the errors include [LightGBM] [Fatal] Socket recv error, code: 104, which I have yet to encounter locally. Both of these issues have been hard to get answers on.
  • I've tried using my own client fixture built from a distributed.LocalCluster via distributed.Client (below) as well as upping the default Client timeout, no dice on either - I'll still occasionally get the "Binding port" exception on a random regressor/ranker/classifier test:
@pytest.fixture()
def client(request, loop):
    print('setting up client')
    cli = Client(loop=loop, timeout='120s')

    def teardown():
        cli.close()
        print('closing client')

    request.addfinalizer(teardown)

    return cli
  • The test test Exceptions on Travis are being triggered on client teardown. I think there's a worker that hangs and it causes the disconnect coroutine to wait for more than the default of 3 seconds, and the Exception seems second-hand to the test Failure - i.e. I think these would go away if the underlying test failure went away.

Hey @StrikerRUS sorry to drag you in here! Wondering if you might had any thoughts on networking port/socket issues?

@StrikerRUS
Copy link
Collaborator

@ffineis

Wondering if you might had any thoughts on networking port/socket issues?

I'm sorry, have no ideas...

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking awesome, thank you! Sorry it took so long to get you a first review. Now that I have a testing framework set up for reviewing LightGBM Dask stuff (https://github.com/jameslamb/lightgbm-dask-testing), I'll be able to give you feedback a lot more quickly.

I tested this PR on that setup and everything looks good! I like the way you set up the unit tests, thank you so much for all the effort you put into that!

I left some recommended changes. I also have one more quesiton below.

although it's possible to use Dask arrays if the trainset is built in group-wise chunks. The constraints of the ranking task force us to not break up contiguous groups, which is why chunks/partitions need to be constructed from groups somehow.

Does this mean "one chunk / partition per group" or do you mean "all documents for a group need to be in the same chunk / partition"? Like, if I have 10 groups could I split them into 2 partitions?

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
"""Helper function taken from dask_ml.metrics: computes coefficient of determination."""
numerator = ((dy_true - dy_pred) ** 2).sum(axis=0)
denominator = ((dy_true - dy_pred.mean(axis=0)) ** 2).sum(axis=0)
return (1 - numerator / denominator).compute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was it necessary to replace the dask-ml one with this function?

If that's just to cut out our testing dependency on dask-ml (which would be much appreciated!), could you make that a separate PR with this change plus removing dask-ml from here:

  • conda install -q -y -n $CONDA_ENV dask dask-ml distributed joblib matplotlib numpy pandas psutil pytest scikit-learn scipy

I think that PR would be small and non-controversial, so we could review and merge it quickly.

Copy link
Contributor Author

@ffineis ffineis Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, yeah I'll follow up 3708 with a PR to remove of dask-ml as a test dependency. See reply to comment re: accuracy_score (above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d6e5209

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @ffineis is it cool if I pick up the "remove testing dependency on dask-ml" thing?

If you want to make the contribution I'll totally respect that, but if it's just extra work you'd rather not do, I'm happy to make a PR right now for it. Wanna be respectful of your time.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
.ci/test.sh Outdated Show resolved Hide resolved
@ffineis
Copy link
Contributor Author

ffineis commented Jan 16, 2021

Does this mean "one chunk / partition per group" or do you mean "all documents for a group need to be in the same chunk / partition"? Like, if I have 10 groups could I split them into 2 partitions?

Yep, the latter - you can place whole separate groups into different partitions as long as all documents in a group are in the same partition. The partitioning needs to work like a GROUP BY statement. Basically, if documents in the same group were broken up across separate partitions, then the value of the group vector passed to the LightGBMRanker constructor on each local worker would not be accurate, and this would degrade the quality of the gradients calculated under the lambdarank objective.

Dask.DataFrame is amenable to preserving group IDs by using set_index. Dask.array does not seem to have a similar "group partitions according to this column" functionality, hence the use of da.concatenate.

@ffineis
Copy link
Contributor Author

ffineis commented Jan 16, 2021

Hmmm ok so latest test issues (only on Azure pipeline):

  • Getting random teardown Exceptions (note: Exceptions, not test failures) caused by timeouts in the client fixture. These are random re: regressor/classifier/ranker. I can search around for related issues. If anything maybe I can use partial and re-parameterize the disconnect method in distributed.utils_test.py
  • On GPU builds, getting generic "Invalid value" exceptions intermixed with "Socket recv error, code: 104
    " exceptions
    only for ranking task tests. I'm going to try out lightgbm-dask-testing a little bit to see if I can reproduce, but probably unlikely as this seems limited to GPU builds.

@jameslamb
Copy link
Collaborator

ok I can also try to look into those errors

@ffineis
Copy link
Contributor Author

ffineis commented Jan 18, 2021

Hey so updates on this. I've been able to find is this - dask/dask-ml#611 - looks like the same exception triggered by a timeout during do_disconnect.

I'm thinking that we just install a wrapper that uses pytest.skip if the TimeoutError triggered from do_disconnect, basically like this:

import os

import functools
import pytest

import dask.array as da
import numpy as np
from distributed.utils_test import client, cluster_fixture, gen_cluster, loop

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso

num_tests = 5
alpha_vec = np.linspace(1e-5, stop=2, num=num_tests)

def handle_async_timeout_exceptions(f):
    @functools.wraps(f)
    def wrapper(client, **kwargs):
        try:
            return f(client, **kwargs)
        except ValueError as e:
            print(str(e))
            if 'my fake exception' in str(e):
                msg = 'Ignoring do_disconnect error in fixture teardown: ' + str(e) + ' kwargs: ' + str(kwargs)
                print(msg)
                pytest.skip(msg)
            else:
                raise(e)
    return wrapper

def bootstrap_lasso(X, y, alpha):
    n = X.shape[0]
    boot_idx = np.random.choice(range(n), size=n, replace=True)
    oob_idx = np.array(np.setdiff1d(range(n), boot_idx))

    reg = Lasso(alpha=alpha)
    reg.fit(X[boot_idx, :], y[boot_idx])

    return reg.score(X[oob_idx, :], y[oob_idx])


@pytest.mark.parametrize('alpha', alpha_vec)
@handle_async_timeout_exceptions
def test_lasso(client, alpha):
    if alpha >= 1:
        raise ValueError('my fake exception')

    X, y = make_regression(n_samples=int(1e5), n_features=10, random_state=123)
    r2_futures = client.map(lambda i: bootstrap_lasso(X=X, y=y, alpha=alpha), range(10))
    r2_values = client.gather(r2_futures)

    print('alpha = {:.6f} -- avg(r2) = {:.2f}'.format(alpha, np.mean(r2_values)))

    client.close() # maybe will this help avoid errors in teardown...?

FWIW I ran this with num_tests = 100 and couldn't reproduce the teardown exception.

@jameslamb
Copy link
Collaborator

Hey so updates on this. I've been able to find is this - dask/dask-ml#611 - looks like the same exception triggered by a timeout during do_disconnect.

I'm thinking that we just install a wrapper that uses pytest.skip if the TimeoutError triggered from do_disconnect, basically like this:
FWIW I ran this with num_tests = 100 and couldn't reproduce the teardown exception.

Thanks for the investigation @ffineis . Since this is only an error in fixture teardown, I'm good with your proposal. I want to go as fast as possible on Dask changes between now and LightGBM 3.2.0, so if this helps get this PR through I think it's ok and we can try to fix the upstream issue later as a maintenance task. I'm convinced by what you wrote and my own tests that this wouldn't cover up a legitimate problem with lightgbm itself.

Can you just add a comment or docstring (whichever you want) on that fixture, linking to the dask-ml issue?

@jameslamb
Copy link
Collaborator

ayyy those changes seem to have worked! Nice idea closing down the cluster.

Seeing that that works, maybe in the future I'll just cut every test over to @gen_cluster. I just learned about https://distributed.dask.org/en/latest/develop.html#writing-tests tonight, and in there they describe @gen_cluster like this

easier to debug and cause the fewest problems with shutdowns.

But now that this seems to be working ok, don't touch it in this PR 😆

I'll give this another review right now. Thanks for working through it!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thank you! I really really like the functions you've added for creating a ranking dataset in tests.

I added a few minor comments. After you address those, I'd like to request another review from @StrikerRUS for general Python things.

For the other TODOs that got pulled out of this PR, I've created the following issues:

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Show resolved Hide resolved
@ffineis
Copy link
Contributor Author

ffineis commented Jan 20, 2021

ayyy those changes seem to have worked! Nice idea closing down the cluster.

Seeing that that works, maybe in the future I'll just cut every test over to @gen_cluster. I just learned about https://distributed.dask.org/en/latest/develop.html#writing-tests tonight, and in there they describe @gen_cluster like this

easier to debug and cause the fewest problems with shutdowns.

But now that this seems to be working ok, don't touch it in this PR 😆

I'll give this another review right now. Thanks for working through it!

Haha thx, man I was ready to just stop using the distributed fixtures and use a localcluster setup/teardown. This was a guess but glad it worked. Yep absolutely will address the outstanding issues. Lol @ gen_cluster, yeah I'll wait to see it.

@StrikerRUS
Copy link
Collaborator

After you address those, I'd like to request another review from @StrikerRUS for general Python things.

Sure! I'd be happy to do this.

@ffineis
Copy link
Contributor Author

ffineis commented Jan 20, 2021

Thanks for such a thorough investigation @ffineis ! That all sounds totally fine to me.

The smaller n_iterations and num_leaves was a tiny optimization to save CI time. If it's causing issues with checking correctness, it's totally not worth it. I don't want you to do a ton of work just to make it work with those smaller settings just to save a few seconds of test time.

For the GPU builds...I think it's ok to skip them for LGBMRanker. You can rely on os.environ["TASK"] == "gpu" to detect it (https://github.com/microsoft/LightGBM/blob/master/.vsts-ci.yml#L47). Those tests aren't even checking that lightgbm.dask works with GPU-accelerated training. It's just doing regular CPU training using a version of LightGBM that happens to have been compiled with GPU support. So we don't lose anything by skipping the ranker tests there, and can figure out the issues in a later PR.

By the way, I have a feature request up to legit support GPU-based training in the Dask interface, like allowing you to train on data in dask-cudf dataframes (#3776 ). But until that gets picked up, we're not making any guarantees about the Dask interface working with GPU training.

Hey sorry never got back to you on this - yeah totally agree that special dask-GPU support is the way to go! Just a heads up - I got the teardown error on a DaskLGBMRegressor task during a GPU test (logs here) so I added the pytest.skip for all GPU tasks in 56bb2c9.

@ffineis
Copy link
Contributor Author

ffineis commented Jan 21, 2021

Hey can we merge? Just want to unblock the several PR's I feel are waiting in the wings, haha.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work! From my perspective, this is ready to be merged.

However, I'd still like @StrikerRUS to do one more review for general Python things or similarity of patterns between here and the non-Dask sklearn interface.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffineis Thank you very much for awesome contribution! Below are just my minor comments, please check.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
class DaskLGBMRanker(_LGBMModel, LGBMRanker):
"""Docstring is inherited from the lightgbm.LGBMRanker."""

def fit(self, X, y=None, sample_weight=None, group=None, client=None, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb What about init_score? Is it supported or we should add feature request for it?

init_score : array-like of shape = [n_samples] or None, optional (default=None)
Init score of training data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a feature request. I'll write it up and add a link here.

@ffineis, could you add init_score=None here between sample_weight and group, so the order matches the sklearn interface for LGBMRanker? (

sample_weight=None, init_score=None, group=None,
). That way, if people have existing sklearn code with positional arguments to fit(), they won't accidentally have their init_score interpreted as group.

And can you just then add a check like this?

if init_score is not None:
    raise RuntimeError("init_score is not currently supported in lightgbm.dask")

@StrikerRUS speaking of positional arguments, I'll open another issue where we can discuss how to handle the client argument. But let's leave that out of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb
Yes, sure! Agree with all your intents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_score: #3807

client placement: #3808

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no prob

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
@@ -96,6 +234,8 @@ def test_classifier(output, centers, client, listen_port):
assert_eq(y, p2)
assert_eq(p1_proba, p2_proba, atol=0.3)

client.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Are you agree with this unrelated to Ranker itself change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm ok with this. I'm not 100% sure that it is unrelated, even though it's touching non-ranker tests. Since this PR is close to being merged and since there aren't other PRs waiting on this fix (the timeout / shutdown issues in Dask tests have not been as severe elsewhere), I'm ok leaving it.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
Comment on lines +418 to +421
rnkvec_local = dask_ranker.to_local().predict(X)

# distributed and to-local scores should be the same.
assert_eq(rnkvec_dask, rnkvec_local)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb I like this test very much! Can we add the same for other estimators? Seems that Classifier tests do not asserts real Dask predict() equals to .to_local().predict().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the classifier tests do check that

I'd be open to a PR that makes it more explicit and cuts out the total number of tests by combining test_classifier_local_predict (

def test_classifier_local_predict(client, listen_port):
) into test_classifier (
def test_classifier(output, centers, client, listen_port):
) though!

Not an action we need on this PR, I'll write it up as a "good first issue".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb

comparison between Dask predict() and to_local().predict():

Actually this checks to_local().predict(dX) and local predict(). It is not the same as to_local().predict(dX) and Dask predict(), given that asserts you pointing are placed in two different (in theory unrelated) tests, I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing because it's two tests, so I think we should merge the tests in the future like I said above (not anything you need to do on here, @ffineis).

I'm saying that taken together, they show transitively that dask_model.predict() == dask_model.to_local().predict(), even though neither one does that direct comparison.

  • (line 92) dask_model.predict() == sklearn_model.predict()
  • (line 138) dask_model.to_local().predict() == sklearn_model.predict()

But that's only partially true, since test_local_predict only does binary classification, and only on array data. So combining them would give more coverage and would show whether that result holds up for multi-class classification or for other input data types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing because it's two tests, so I think we should merge the tests in the future like I said above

Totally agree!

I'm saying that taken together, they show transitively that dask_model.predict() == dask_model.to_local().predict()

Also agree. But two different tests can take different inputs (params, data), can be modified in future separately, etc. So it can be not fully fair comparison. Merging them into one test will solve all issues. 🙂

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost one my suggestion during previous review about issubclass and docstring about group param still doesn't match the same docstring in other places.

Everything else looks good to me. Thanks!

I'm approving this PR to not delay it anymore and let @jameslamb merge it finally.
You can commit my current suggestions right from the web interface. Just click Commit suggestion button if you agree with suggestion. No need to waste time re-typing them locally and then pushing with git.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
ffineis and others added 3 commits January 21, 2021 22:40
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@jameslamb
Copy link
Collaborator

All the tests are passing! 🎉

I just took one more look at the diff, and all looks good. So I'm going to merge this. Thanks so much for this fantastic contribution @ffineis , and to @StrikerRUS for the help with thorough reviews.

@jameslamb jameslamb merged commit 3c7e7e0 into microsoft:master Jan 22, 2021
@ffineis
Copy link
Contributor Author

ffineis commented Jan 22, 2021

Thanks @jameslamb and @StrikerRUS for bearing with me!! Glad we could make it work.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants