Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

thatchersj
Copy link

As per the discussion on (#3900) splits on inf were previously being incorrectly parsed with NAs due to a failed regex match.

Since the change (#6109 Sep 2020) to remove stringi dependency, the handling of failed regex matches has changed and can now cause a number of different errors (detailed below).

I have added inf to the regex and added a simple test case that fails for previous versions.

Problem

If you have inf splits you now get one of three undesirable behaviours detailed below using dummy tree dumps.

Note: This code was run on Windows 10 using xgboost 1.3.2.1 and R 4.0.3.

1. If you have multiple non-inf splits you get a data.table error

This seems to be the most common case, and was the error I stumbled upon that led me here.

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
    "1:[f2<3] yes=2,no=3,missing=3,gain=0.1,cover=1",
    "2:[f1<2] yes=4,no=3,missing=3,gain=0.3,cover=4",
    "3:leaf=0.2,cover=1",
    "4:leaf=0.5,cover=1"
  )
)

# Error in `[.data.table`(td, isLeaf == FALSE, `:=`((branch_cols), { : 
#  Supplied 2 items to be assigned to 3 items of column 'Feature'. If you wish to 'recycle'
#  the RHS please use rep() to make this intent clear to readers of your code.

The following cases are both improbable in practice but included for reference, 3 being particularly misleading

2. If you only have only inf splits, you get a subscript out of bounds error

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=2,missing=2,gain=0.5,cover=4",
    "2:leaf=0.2,cover=1"
  )
)

# Error in do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE] : 
#   subscript out of bounds

3. If you only have only 1 non-inf split, then it's details get copied onto all rows

This is the special case of recycling that data.table allows (avoiding the error in 1)

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
    "1:[f2<3] yes=2,no=3,missing=3,gain=0.5,cover=1",
    "2:leaf=0.2,cover=1",
    "3:leaf=0.5,cover=1"
  )
)

#    Tree Node  ID Feature Split  Yes   No Missing Quality Cover
# 1:    0    0 0-0       2     3  0-2  0-3     0-3     0.5     1
# 2:    0    1 0-1       2     3  0-2  0-3     0-3     0.5     1
# 3:    0    2 0-2    Leaf    NA <NA> <NA>    <NA>     0.2     1
# 4:    0    3 0-3    Leaf    NA <NA> <NA>    <NA>     0.5     1

Cause

As discussed above, the nodes are not parsed correctly as anynumber_regex does not match inf.

The code change on lines 121-125 of R-package/R/xgb.model.dt.tree.R uses

(A)  matches <- regmatches(t, regexec(branch_rx, t))
     #skip some indices with spurious capture groups from anynumber_regex
     xtr <- do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

to replace the old line

(B)  xtr <- stri_match_first_regex(t, branch_rx)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

This code change has altered the behaviour when the regex fails to find a match, for example

txt = c(
  "booster[0]",
  "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
  "1:[f2<3] yes=2,no=3,missing=3,gain=0.5,cover=1",
  "2:leaf=0.2,cover=1",
  "3:leaf=0.5,cover=1"
)

anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
branch_rx <- paste0(
  "f(\\d+)<(", anynumber_regex, ")\\] yes=(\\d+),no=(\\d+),missing=(\\d+),",
  "gain=(", anynumber_regex, "),cover=(", anynumber_regex, ")"
)

(A) do.call(rbind, regmatches(txt[2:3], regexec(branch_rx, txt[2:3])))[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

#      [,1] [,2] [,3] [,4] [,5] [,6]  [,7]
# [1,] "2"  "3"  "2"  "3"  "3"  "0.5" "1" 

(B) stringi::stri_match_first_regex(txt[2:3], branch_rx)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

#      [,1] [,2] [,3] [,4] [,5] [,6]  [,7]
# [1,] NA   NA   NA   NA   NA   NA    NA  
# [2,] "2"  "3"  "2"  "3"  "3"  "0.5" "1" 

Solution

As suggested by @dshopin and seconded by @hcho3 in #3900 I have changed the anynumber_regex to include inf:

anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?|[-+]?[Ii]nf"

@thatchersj thatchersj marked this pull request as ready for review March 2, 2021 21:42
@thatchersj thatchersj changed the title Parse splits on inf in xgb.model.dt.tree (#3900) [R] Parse splits on inf in xgb.model.dt.tree (#3900) Mar 2, 2021
@codecov-io
Copy link

codecov-io commented Mar 2, 2021

Codecov Report

Merging #6740 (a2b61f9) into master (a9b4a95) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #6740   +/-   ##
=======================================
  Coverage   81.83%   81.83%           
=======================================
  Files          13       13           
  Lines        3809     3809           
=======================================
  Hits         3117     3117           
  Misses        692      692           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9b4a95...a2b61f9. Read the comment docs.

@trivialfis
Copy link
Member

Thanks for the PR and detailed description, could you please share a reproducible example that produces inf split?

@thatchersj
Copy link
Author

thatchersj commented Mar 3, 2021

This seems to (not) work

set.seed(115)
xg <- xgboost::xgb.train(
  data = xgboost::xgb.DMatrix(matrix(c(-Inf, Inf, 0), 3, 2), label = c(1, 0, 1)), 
  objective = "reg:squarederror", 
  booster = "gbtree",
  nrounds = 1, 
)

xgboost::xgb.dump(xg)
# [1] "booster[0]"                      "0:[f0<inf] yes=1,no=2,missing=1" "1:leaf=0.100000009"             
# [4] "2:leaf=-0.075000003" 

xgboost::xgb.model.dt.tree(model=xg)
# Error in do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE] : 
#   subscript out of bounds

@trivialfis
Copy link
Member

Em, so data contains inf but xgboost doesn't throw an error.

@thatchersj
Copy link
Author

thatchersj commented Mar 3, 2021

Em, so data contains inf but xgboost doesn't throw an error.

Yes, although the handling of inf is weirdly inconsistent. I'll raise a new issue for that

I've also added NaN to the regex in this PR as it is currently a possible value for the node split, rightly or wrongly.

@trivialfis
Copy link
Member

Hi, I opened a different PR for checking invalid data: #6742 .

@trivialfis
Copy link
Member

Hi, could you please try latest master branch and see if the inf split is still reproducible? Right now the DMatrix should throw an error when data contains inf but missing is set to other value.

@thatchersj
Copy link
Author

Sorry, I'm unable to build the package from source so can't test this, however I do disagree with the change made #6742.
I don't see why Inf should be an invalid value for decision tree regression. There seems to be a perfectly reasonable notion of splitting at Inf, and there is well-defined comparison between Inf and any real number. Moreover, this behaviour was previously available in xgboost (see the last example in my ticket you have closed #6741) and #6742 breaks/backs out this functionality.

@trivialfis
Copy link
Member

I see your point. Yes, it's possible for a decision tree to split on inf as a trivial case, but right now we don't have uniformed handling of inf in various tree building algorithm. I will see if it make sense to revert that commit.

@thatchersj
Copy link
Author

I see your point. Yes, it's possible for a decision tree to split on inf as a trivial case, but right now we don't have uniformed handling of inf in various tree building algorithm. I will see if it make sense to revert that commit.

That makes sense, thanks!

@trivialfis
Copy link
Member

We @RAMitchell @hcho3 talked about the issue with data containing inf and using inf as split offline. So these are 2 separate issues.

For the first one, we believe xgboost doesn't need to rush into supporting it right now since we have a missing parameter in DMatrix for specifying this kind of data, also users can handle it by preprocessing. Providing full support for inf requires careful inspection into various language bindings and internal algorithms.

For the second issue, in the future, it's possible that xgboost can generate inf in split for trivial split value, but that would be a different topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants