-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] Increase number of partitions in data #4149
Conversation
Linking the discussion this came from: #3829 (comment) |
@@ -255,7 +255,7 @@ def test_classifier(output, task, boosting_type, tree_learner, client): | |||
'bagging_fraction': 0.9, | |||
}) | |||
elif boosting_type == 'goss': | |||
params['top_rate'] = 0.5 | |||
params['top_rate'] = 0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this was added since I last reviewed (981084f). Can you please explain why it's necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_classifier
became flaky in this PR. I assume it's because previously we weren't performing distributed training or at least not everytime, so adding this generated some fails in multiclass classification for data_parallel-dart, voting_parallel-rf (this one is very surprising, given that the atol is 0.8), voting_parallel-gbdt, voting_parallel-dart, voting_parallel-goss. Most of them are for dataframe with categoricals but there are a couple with sparse matrices. I have to debug them to see what's actually happening, this is a very simple classification problem and I'd expect to get a perfect score with little effort. I'll ping you here once I'm done but it could take a bit haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks! Let me know if you need any help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect to get a perfect score with little effort
Given the small dataset sizes we use in tests, I think it would be useful to set min_data_in_leaf: 0
everywhere. That might improve the predictability of the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is taking so long, I haven't had much time and I'm really confused by this. The same data point makes the test fail even for data_parallel
and gbdt
, I'm trying to figure out what's exactly going on here, I have the test in a while loop and it eventually fails because of that data point, I'm not sure what's wrong with it haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, setting min_data_in_leaf=0
gives this error: LightGBMError: Check failed: (best_split_info.right_count) > (0) at /hdd/github/LightGBM/src/treelearner/serial_tree_learner.cpp, line 663
. Do you think this could be related to #4026? This data is shuffled but I think forcing few samples in a leaf gives more chance to getting an empty split in one of the workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'm actually looking into this right now, seems to be related to the amount of data each worker gets. So both get data using more partitions but I believe it may not be balanced sometimes and those times the tests fail because adding a Edit |
Yeah, exactly haha. You could also try increasing |
And the same goes for increasing |
Sorry, I don't understand the axes in that plot or what you mean by "the local model gives 99.8%". |
Haha, sorry. This is a zoom of the lower left section of the data that gets generated for classification. These points all correspond to the same class (centered at [-4, -4]), the axes are the continuous features. The percentages are the probabilities that each model gives to the class (class 1 in this case). It seems strange that the red dot gets a lower probability given that it's not that far away from the center and there are others further away. |
@jameslamb I just tried this again today and that single point still makes the test fail. Should I close this? Or I can try to make a notebook for you to debug and maybe you can find something else, I'm not sure if #4220 is the reason or if it's something else. |
so weird! Thanks for all your investigation so far. Could you merge the latest |
I have a notebook that I've been using for this, I can maybe upload it here. Do you think that'd help you? |
I uploaded my notebook here, I forgot to specify the cpus so it only has two but changing the |
Perfect, thanks! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This increases the default partitions of the collections returned by
_create_data
to 20. The purpose is to make it less likely that a worker gets all the partitions and be more confident that distributed training is being performed across all tests.