-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add customized parser support #4782
Add customized parser support #4782
Conversation
353041d
to
af70998
Compare
@jameslamb @StrikerRUS @guolinke Could you please have a look at this PR if you have time. Thanks. |
@chjinche From the first glance this PR looks really good and doesn't break backward compatibility. Thanks a lot for finding a way to support feature transformations in a non-invasive for core LightGBM API way! I'll provide more thoughtful review when have time. |
@StrikerRUS Much appreciated for reviewing and glad that you like the implementation 😍 |
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-917bd78bb1fe442a8c8d245586e29b13 |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like merging this PR as-is would break the ability to use the LightGBM R package on Solaris, which would result in that package being rejected and removed from CRAN.
Here is the relevant error from https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-917bd78bb1fe442a8c8d245586e29b13
#> In file included from ./include/LightGBM/config.h:16:0,
1782#> from ./include/LightGBM/boosting.h:8,
1783#> from boosting/boosting.cpp:5:
1784#> ./include/LightGBM/utils/common.h:207:82: error: ‘INT_MAX’ was not declared in this scope
1785#> inline static std::string LoadStringFromFile(const char* filename, int row_num = INT_MAX) {
1786#> ^
1787#> gmake: *** [/opt/csw/lib/R/etc/Makeconf:177: boosting/boosting.o] Error 1
1788#> ERROR: compilation failed for package ‘lightgbm’
1789#> * removing ‘/export/home/XjiOipx/R/lightgbm’
1790#> Warning message:
1791#> In i.p(...) :
1792#> installation of package ‘/export/home/XjiOipx/Rtemp/RtmpoRqZQc/file6c8076a9dbf/lightgbm_3.3.1.99.tar.gz’ had non-zero exit status
I think this is might be a compiler-specific problem, and maybe explicitly including <climits>
would fix it? Based on this Stack Overflow answer which seems related.
Could you try that and re-run the Solaris CI job? (link to instructions)
Sorry for the inconvenience. I hope the fix is that simple 😬
Other than the issue with the Solaris job, I agree with @StrikerRUS ... this looks like a good approach for the integration! Thanks very very much for your efforts there. I'm definitely interested in trying this out once the relevant changes have been merged in LightGBM and the |
We can also use
so it should be safe to use. Inclusion of |
Oh nice! I like that even better, and since it's used in other places in the code I think it's likely to work on Solaris. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this PR!
Please check some my suggestions and questions below:
f622583
to
bc1099f
Compare
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-af4c61e8f58a4461865e362f5f8289d7 |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me now that valgrind and solaris tests are passing, thanks! Please wait for a review from @shiyu1994 before merging, my review shouldn't be enough for a merge on this one.
By the way, we use squash merging in this project, so it's not necessary to rebase, squash commits, and force-push. In fact, as a reviewer I personally would prefer that you didn't do squashing within a pull request, so it's possible to see the changes in each subsequent commit pushed in response to reviewers' comments.
changes have been addressed, but wanted to only leave a "comment" follow-up review so my review doesn't count towards a merge
I would kindly ask next time address review comments in a separate commits and don't perform force pushes in a such kind of big PRs. Separate commits speed up next round of reviews by allowing to check that only requested changes were done and nothing else were touched by a bad merge commit for example. Without separate commits one should review entire PR from scratch again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
LGTM, except two very minor suggestions below! Thank you very much for for this enhancement!
I'm leaving only "comment" because I believe that only review from cpp maintainer should count towards a merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM!
@chjinche Thanks for this great work! And thank @StrikerRUS @jameslamb for your review. BTW, @StrikerRUS, could you please mark the request changes as resolved? It seems that the requested changes have been addressed, but they still block the merging. Thanks for your detailed review. |
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-6d170faaa30041cdbbe614bea360064b |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
I'm going to merge this given that all the checks have been passed. The latest suggestions of @StrikerRUS have been addressed by @chjinche. Thank you. |
Sorry about that! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Motivation
Currently, lightGBM can only support three text data formats (tsv, csv, libsvm), and data type check process is kind of coarse. For example, a tsv may be judged as libsvm if there is string feature containing colon. Besides, lightGBM does not support feature transformation, users need to do separate transformation themselves, which may be a bit inconvenient.
Proposal
In this PR, we'd like to add customized parser to unlock free data preprocessing scenario. Users could use lightgbm as dependency, and follow
parser
interface to design their own parsers or could simply try freeform parser.Compared to separate transformation. this way has several pros:
Changes
The main code changes are:
parser_config_file
to config.