Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] Increase number of partitions in data #4149

Closed
wants to merge 7 commits into from

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Apr 1, 2021

This increases the default partitions of the collections returned by _create_data to 20. The purpose is to make it less likely that a worker gets all the partitions and be more confident that distributed training is being performed across all tests.

@jameslamb
Copy link
Collaborator

Linking the discussion this came from: #3829 (comment)

@@ -255,7 +255,7 @@ def test_classifier(output, task, boosting_type, tree_learner, client):
'bagging_fraction': 0.9,
})
elif boosting_type == 'goss':
params['top_rate'] = 0.5
params['top_rate'] = 0.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this was added since I last reviewed (981084f). Can you please explain why it's necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_classifier became flaky in this PR. I assume it's because previously we weren't performing distributed training or at least not everytime, so adding this generated some fails in multiclass classification for data_parallel-dart, voting_parallel-rf (this one is very surprising, given that the atol is 0.8), voting_parallel-gbdt, voting_parallel-dart, voting_parallel-goss. Most of them are for dataframe with categoricals but there are a couple with sparse matrices. I have to debug them to see what's actually happening, this is a very simple classification problem and I'd expect to get a perfect score with little effort. I'll ping you here once I'm done but it could take a bit haha.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks! Let me know if you need any help

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect to get a perfect score with little effort

Given the small dataset sizes we use in tests, I think it would be useful to set min_data_in_leaf: 0 everywhere. That might improve the predictability of the results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is taking so long, I haven't had much time and I'm really confused by this. The same data point makes the test fail even for data_parallel and gbdt, I'm trying to figure out what's exactly going on here, I have the test in a while loop and it eventually fails because of that data point, I'm not sure what's wrong with it haha.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, setting min_data_in_leaf=0 gives this error: LightGBMError: Check failed: (best_split_info.right_count) > (0) at /hdd/github/LightGBM/src/treelearner/serial_tree_learner.cpp, line 663. Do you think this could be related to #4026? This data is shuffled but I think forcing few samples in a leaf gives more chance to getting an empty split in one of the workers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's abs(local_probas - dask_probas) per iteration for data_parallel gbdt for just that one sample (index 377):
image
So from the 7th iteration onwards the probabilities start to increasingly differ, I think there's definitely something strange going on here.

@jameslamb
Copy link
Collaborator

@jmoralez we fixed a few more CI issues today (#4168 , #4158 , #4167 ). Whenever you return to this, please update to the latest master. Sorry for the inconvenience.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Apr 10, 2021

Will do. I'm actually looking into this right now, seems to be related to the amount of data each worker gets. So both get data using more partitions but I believe it may not be balanced sometimes and those times the tests fail because adding a client.rebalance() and bumping up the threshold for the probas from 0.03 to 0.05 makes the tests pass. What are your thoughts about using rebalance here?

Edit
Adding a rebalance kind of beats the purpose of this PR haha. Let me just keep moving the chunksize a bit, 100 looks promising.

@jameslamb
Copy link
Collaborator

Adding a rebalance kind of beats the purpose of this PR haha. Let me just keep moving the chunksize a bit, 100 looks promising.

Yeah, exactly haha. You could also try increasing n_samples from 1e3 to 1e4 or something! It's possible that you're running into some problems that are more severe with tiny amounts of data. pytest will report the total timing so you should be able to see (and share with reviewers) the impact of using more data on runtime.

@jameslamb
Copy link
Collaborator

And the same goes for increasing n_estimators or num_leaves. If training a slightly larger model improves the stability of the tests, that's totally fine.

@jmoralez
Copy link
Collaborator Author

It's very strange because it's always the same point that gets a lower score and it doesn't seem logical. The local model gives 99.8% and the distributed one 93.4%.
image
The red dot is the one that causes the test to fail.

@jameslamb
Copy link
Collaborator

Sorry, I don't understand the axes in that plot or what you mean by "the local model gives 99.8%".

@jmoralez
Copy link
Collaborator Author

Haha, sorry. This is a zoom of the lower left section of the data that gets generated for classification. These points all correspond to the same class (centered at [-4, -4]), the axes are the continuous features. The percentages are the probabilities that each model gives to the class (class 1 in this case). It seems strange that the red dot gets a lower probability given that it's not that far away from the center and there are others further away.

@jmoralez
Copy link
Collaborator Author

@jameslamb I just tried this again today and that single point still makes the test fail. Should I close this? Or I can try to make a notebook for you to debug and maybe you can find something else, I'm not sure if #4220 is the reason or if it's something else.

@jameslamb
Copy link
Collaborator

@jameslamb I just tried this again today and that single point still makes the test fail. Should I close this? Or I can try to make a notebook for you to debug and maybe you can find something else, I'm not sure if #4220 is the reason or if it's something else.

so weird! Thanks for all your investigation so far.

Could you merge the latest master into this branch? I can pick it up from here and see if I find anything else. Sometimes when you've been staring at the same problem for this long, it just requires a second set of eyes. I want to see if we can figure this out, because I think there's a chance we'll uncover a bug in distributed training similar to #4026.

@jmoralez
Copy link
Collaborator Author

I have a notebook that I've been using for this, I can maybe upload it here. Do you think that'd help you?

@jmoralez
Copy link
Collaborator Author

I uploaded my notebook here, I forgot to specify the cpus so it only has two but changing the threads_per_worker to 1 in the Client allows you to replicate the issue there.

@jameslamb
Copy link
Collaborator

Perfect, thanks!

@jmoralez jmoralez closed this Jun 1, 2022
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants