Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Fix parsing decision stump. #7689

Merged
merged 4 commits into from
Mar 16, 2022
Merged

Conversation

trivialfis
Copy link
Member

  • Fix in dataframe construction.
  • Fix in plotting.

Still need tests for more sophisticated cases like mixed tree types.

Close #7669

@trivialfis trivialfis changed the title [WIP] [R] Fix parsing decision stump. [R] Fix parsing decision stump. Feb 28, 2022
@trivialfis trivialfis marked this pull request as ready for review February 28, 2022 08:41
@trivialfis
Copy link
Member Author

@hetong007 Could you please take a look into this fix? I'm a little bit confused by the regex parsing.

@hetong007
Copy link
Member

Sure. What exactly is the confusing part?

@trivialfis
Copy link
Member Author

trivialfis commented Mar 15, 2022

@hetong007 For instance this comment in the code:

skip some indices with spurious capture groups from anynumber_regex

I'm not entirely sure what's it skipping. In general, I would have more confidence about the correctness of the code if we use the JSON dump along with jsonlite for parsing. But right now in the R API, the user can directly pass a text string input so I can't make the change otherwise it would be breaking.

@trivialfis
Copy link
Member Author

Would be great if you can review the PR and see if the fix makes sense.

@hetong007
Copy link
Member

I looked into the code with an example.

The to-parse text is "f0<0.5] yes=1,no=2,missing=1,gain=0.0366666317,cover=2", with branch_rx="f(\\d+)<([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)\\] yes=(\\d+),no=(\\d+),missing=(\\d+),gain=([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?),cover=([-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?)".

After the regmatches call, we have the results as

[[1]]
 [1] "f0<0.5] yes=1,no=2,missing=1,gain=0.0366666317,cover=2"
 [2] "0"
 [3] "0.5"
 [4] ""
 [5] "1"
 [6] "2"
 [7] "1"
 [8] "0.0366666317"
 [9] ""
[10] "2"
[11] ""

So basically the skipped indices contains either the input or an empty string, thus the list c(2, 3, 5, 6, 7, 8, 10) just slices those informative ones out.

Overall I think your fix here makes sense, with one inline comment.

@trivialfis
Copy link
Member Author

trivialfis commented Mar 16, 2022

@hetong007 Thank you for digging into this! Your explanation of the code is helpful.

On the issue of using constant instead of string value NA, I reverted the change as by the end of the function, there are procedures to convert each column to the correct type using as.numeric and as.integer. We need to use string value NA first, as the parsed values are strings and need to be consistent with the columns initialized by strings NA.

@hetong007
Copy link
Member

@hetong007 Thank you for digging into this! Your explanation of the code is helpful.

On the issue of using constant instead of string value NA, I reverted the change as by the end of the function, there are procedures to convert each column to the correct type using as.numeric and as.integer. We need to use string value NA first, as the parsed values are strings and need to be consistent with the columns initialized by strings NA.

Cool! This makes sense.

@trivialfis trivialfis merged commit da35162 into dmlc:master Mar 16, 2022
@trivialfis trivialfis deleted the fix-R-parse-dump branch March 16, 2022 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xgb.plot.tree stops with "Non-tree model detected"
2 participants