Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check sampling aliases are set correctly in H2O XGBoost #8458

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 17 comments
Closed

Check sampling aliases are set correctly in H2O XGBoost #8458

exalate-issue-sync bot opened this issue May 11, 2023 · 17 comments

Comments

@exalate-issue-sync
Copy link

[~accountid:5d9dc9eb87dd6f0dcb4d4d98] reported a bug when col_sample_rate is not working with tree_method=”approx” in XGBoost core. This issue will be solved in a new Jira ticket: [https://h2oai.atlassian.net/browse/PUBDEV-8368|https://h2oai.atlassian.net/browse/PUBDEV-8368|smart-link] .

Within this ticket, we improved H2O XGBoost API to be sure both col_sample_rate and colsample_bylevel (and other XGBoost parameters aliases) are set correctly.

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98]. Thank you for reporting this bug.

Could you please provide more information on how you set up the model parameters, please? Are you using the flow only?

I found the problem is when you have set the parameter col_sample_rate as well as colsample_bylevel to a different value than 1. For example, if you have colsample_bylevel=0.5 and also col_sample_rate=0.3, the colsample_bylevel overwrites col_sample_rate value*.* In this case, you can see this message in the log: {{Using user-provided parameter colsample_bylevel instead of col_sample_rate.}}

If I try to train two models with different col_sample_rate for example in Python, everything works as we are expecting. But when you are reusing native parameters from one model, where colsample_bylevel was already set, the models will be always the same, if you are changing only col_sample_rate.

It is definitely a bug on our side and we are working on fixing it. However, you can solve your issue by setting colsample_bylevel manually. Let me know if you have any other questions about this issue. 🙂

Veronika

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: Hi [~accountid:5bd237b8dd3cc64b77e71676] , thanks for the detailed explanation.

I define the model and its parameters through pysparkling python code, and I set only {{col_sample_rate}} or {{colsample_bylevel}}. After that, I checked the parameters in Flow and in the MOJO file, where they look as expected. For either parameter, when I choose it to be < 1.0, I didn’t see any sampling happening.

I will have a look again if setting only {{colsample_bylevel}} will do the trick, and if it doesn’t I will share a MWE that can reproduce the issue. I am out of office for the next two weeks, so unfortunately I can only check when I am back, I hope that’s ok. :)

Thanks!

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98], thank you for your response.

I am sure, that the problem is with setting {{colsample_bylevel}} together with {{col_sample_rate}}. I can see your setting in the first image. You have {{colsample_bylevel}} set to 0.3 so if you change {{col_sample_rate}} in this setting, nothing happens. But let me know when you try it. 🙂

!param.png|width=846,height=1089!

I hope I will fix this as soon as possible.

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], I finished the fix, where we check the dual parameters are not set simultaneously on different values. The rules are now:

if you set col_sample_rate to a different value than default and don't change colsample_bylevel default value, col_sample_rate value will be used

if you set colsample_bylevel to a different value than default and don't change col_sample_rate default value, colsample_bylevel value will be used

if you set both col_sample_rate and colsample_bylevel to the same value, colsample_bylevel is used

if you set both col_sample_rate and colsample_bylevel to the different value, error is thrown

This change will be available from the next major release (version 3.34.0.1), which will be out soon. Let me know if it helps to you. 🙂

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] Thanks for the detailed debugging, and apologies for the late response.

Unfortunately when I try out the different parameters, I get different results from your debugging. When I set only {{col_sample_rate}} or only {{colsample_bylevel}} to < 1.0, the result is the same: there is no stochasticity, which I assume means that no sampling takes place (I do not fix the seed of course; when I set another stochastic parameter like {{sample_rate}} to < 1.0, this does result in stochasticity).

The screenshot that you describe in your comment was the result of a model that had only {{colsample_bylevel}} set (to 0.3). I see indeed that {{col_sample_rate}} also has an input value (of 1.0), but this was not set when I defined the model in Python. I just double checked that. Maybe that indicates that somehow H2O thinks that the parameter is given as input, even though it’s not.

I also tried setting both {{colsample_bylevel}} and {{col_sample_rate}} to the same value, and that results in the screenshot that I attach. However, also this does not give any stochasticity, so I think still no sampling takes place in that case.

This gives me the idea that we might be chasing two different bugs here?

!Screenshot 2021-09-06 at 16.33.16.png|width=557,height=722!

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: Hi [~accountid:5d9dc9eb87dd6f0dcb4d4d98], thank you for your description. Could you send me the logs from the H2O server, please? There we can see which parameters were exactly sent to the machine.

Or if you can provide an example to reproduce this issue, let me know! I am currently trying to reproduce it, but I am not successful yet.

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: Hi [~accountid:5bd237b8dd3cc64b77e71676] , I have attached a MWE notebook to this comment. I added a few comments, to explain what I am testing. I hope this helps to reproduce the issue.

I can of course also provide H2O logs, can you provide more detail about which logs exactly are useful, and how to obtain them?

[^MWE_PUBDEV_8266_1.ipynb]

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98] , thank you for MWE. I tried to reproduce the issue, but your MWE does not work for me. I tried both 3.32.0.1 and the current master branch and it takes me always what we are expecting… 🤔

[^MWE_PUBDEV_8266_maurever.ipynb]

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] The plot thickens. When I created the MWE, I ran it on our company H2O cluster, and then column sampling does not work (like I said). When I try it on my local machine (macOS), then I get the same results as you. In both cases with H2O version {{3.32.0.2}}.

Is there a way to debug this further, based on logs for example?

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], yes, logs could be beneficial. Here is the detailed description, how to download logs: [https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html|smart-link] . If you have any questions, let me know!

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] I did some additional debugging with the help of the logs, and I found the following details:

  • When I run my MWE on a 1-node cluster, the problem of the column sampling does not show up, but when I run the same MWE on the same machine on a multi-node cluster, the problem always shows up. This explains why our results from the MWE were different.
  • The native XGBoost parameters in the logs are the same for a 1-node and a multi-mode cluster, with the exception of {{tree_method}}: this parameter is {{exact}} on a 1-node cluster, and {{approx}} on a multi-node cluster.
  • When I manually set {{tree_method='approx'}} on the 1-node cluster, the problem of non-stochasticity for column sampling is also reproduced on the 1-node cluster. This setting would allow you to reproduce the issue on your side.
  • I tried running native XGBoost directly (not the H2O implementation) with our dataset, for the different parameters. I can exactly reproduce this behaviour in native XGBoost: when {{tree_method='approx'}} and I use column sampling (and set different seeds), there is no stochasticity for {{colsample_bynode}} and {{colsample_bylevel}} (but there is for {{colsample_bytree}}).

This leaves two options:

There is a bug in XGBoost itself (not in the H2O implementation)

I have an incorrect understanding of approximate tree building: when the approximate tree building method is used, do we not expect column sampling to take place for {{colsample_bynode}} and {{colsample_bylevel}}?

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98] , thank you for debugging! It looks like an XGBoost bug. I just quickly look into their code and found this issue: [https://github.com/dmlc/xgboost/issues/7002|https://github.com/dmlc/xgboost/issues/7002|smart-link].

In XGBoost they mixed approx histogram creation with sampling together. The parameter colsample_bytree should not affect histograms, so it works. However, the colsample_bynode and colsample_bylevel are affected by histogram creation, I think. I will try to find more details in the XGBoost code to be sure.

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] I can create an issue for the xgboost team, but I am also happy if you would want to do it (given that you have debugged it in more detail). Do you have a preference?

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: [~accountid:5d9dc9eb87dd6f0dcb4d4d98], please, make definitely an issue. I reproduced the problem with xgboost directly, so I am sure it is their bug. I tried to find what is wrong, but I was not successful. I also closed this Jira as resolved. We improved our API a little bit based on this bug, and that is perfect. 🙂 Thank you so much for being so cooperative!

@exalate-issue-sync
Copy link
Author

Veronika Maurerová commented: We can't fix this issue because it is a bug on the XGBoost side. Otherwise, we improve our xgboost API to be sure alias parameters are used correctly.

@exalate-issue-sync
Copy link
Author

Mathijs de Jong commented: [~accountid:5bd237b8dd3cc64b77e71676] Thanks again for your help. As FYI, here is the issue I reported to XGBoost:
[https://github.com/dmlc/xgboost/issues/7244|https://github.com/dmlc/xgboost/issues/7244|smart-link]

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-8266
Assignee: Veronika Maurerová
Reporter: Mathijs de Jong
State: Resolved
Fix Version: 3.34.0.3
Attachments: Available (Count: 8)
Development PRs: Available

Linked PRs from JIRA

#5641

Attachments From Jira

Attachment Name: MWE_PUBDEV_8266_1.ipynb
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/MWE_PUBDEV_8266_1.ipynb

Attachment Name: MWE_PUBDEV_8266_maurever.ipynb
Attached By: Veronika Maurerová
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/MWE_PUBDEV_8266_maurever.ipynb

Attachment Name: param.png
Attached By: Veronika Maurerová
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/param.png

Attachment Name: Screenshot 2021-08-10 at 11.21.10.png
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/Screenshot 2021-08-10 at 11.21.10.png

Attachment Name: Screenshot 2021-08-10 at 11.21.17.png
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/Screenshot 2021-08-10 at 11.21.17.png

Attachment Name: Screenshot 2021-08-10 at 11.32.03.png
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/Screenshot 2021-08-10 at 11.32.03.png

Attachment Name: Screenshot 2021-08-10 at 11.32.18.png
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/Screenshot 2021-08-10 at 11.32.18.png

Attachment Name: Screenshot 2021-09-06 at 16.33.16.png
Attached By: Mathijs de Jong
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8266/Screenshot 2021-09-06 at 16.33.16.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant