-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6740 +/- ##
=======================================
Coverage 81.83% 81.83%
=======================================
Files 13 13
Lines 3809 3809
=======================================
Hits 3117 3117
Misses 692 692 Continue to review full report at Codecov.
|
Thanks for the PR and detailed description, could you please share a reproducible example that produces inf split? |
This seems to (not) work
|
Em, so data contains |
Yes, although the handling of I've also added |
Hi, I opened a different PR for checking invalid data: #6742 . |
Hi, could you please try latest master branch and see if the |
Sorry, I'm unable to build the package from source so can't test this, however I do disagree with the change made #6742. |
I see your point. Yes, it's possible for a decision tree to split on |
That makes sense, thanks! |
We @RAMitchell @hcho3 talked about the issue with data containing For the first one, we believe xgboost doesn't need to rush into supporting it right now since we have a For the second issue, in the future, it's possible that xgboost can generate |
As per the discussion on (#3900) splits on
inf
were previously being incorrectly parsed withNA
s due to a failed regex match.Since the change (#6109 Sep 2020) to remove
stringi
dependency, the handling of failed regex matches has changed and can now cause a number of different errors (detailed below).I have added
inf
to the regex and added a simple test case that fails for previous versions.Problem
If you have
inf
splits you now get one of three undesirable behaviours detailed below using dummy tree dumps.Note: This code was run on Windows 10 using xgboost 1.3.2.1 and R 4.0.3.
1. If you have multiple non-inf splits you get a data.table error
This seems to be the most common case, and was the error I stumbled upon that led me here.
The following cases are both improbable in practice but included for reference, 3 being particularly misleading
2. If you only have only inf splits, you get a subscript out of bounds error
3. If you only have only 1 non-inf split, then it's details get copied onto all rows
This is the special case of recycling that data.table allows (avoiding the error in 1)
Cause
As discussed above, the nodes are not parsed correctly as
anynumber_regex
does not matchinf
.The code change on lines
121-125
ofR-package/R/xgb.model.dt.tree.R
usesto replace the old line
This code change has altered the behaviour when the regex fails to find a match, for example
Solution
As suggested by @dshopin and seconded by @hcho3 in #3900 I have changed the
anynumber_regex
to includeinf
: