Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add FreeForm2 Into LightGBM #4742

Closed
shiyu1994 opened this issue Oct 29, 2021 · 19 comments · Fixed by #4782
Closed

[RFC] Add FreeForm2 Into LightGBM #4742

shiyu1994 opened this issue Oct 29, 2021 · 19 comments · Fixed by #4782

Comments

@shiyu1994
Copy link
Collaborator

Summary

FreeForm2 is a module that transforms original feature values according to the arithmetic and logic expressions specified by users. The transformed features will be used as inputs to the model for training. We plan to integrate FreeForm2 into LightGBM.

Motivation

Quote the description from @chjinche in #4733:
FreeForm2 is a flexible type of feature transforms, created by Microsoft Core Ranking and used widely over Microsoft production model training. As the name indicates, FreeForm2 empowers users to compose a free combination of features as they like. It is expressed by formulas to be applied in the model inputs. The surface syntax is s-expression, with parentheses in a LISP-like fashion to delimit. FreeForm2 has implicit type systems and evaluate a single, nested expression that returns a floating-point number. Please see more info about FreeForm in doc PR #4735.

Integration of FreeForm2 allows LightGBM to be more widely used in our products.

Description

We plan to divide the integration into 3 parts:

  1. Add the source code of FreeForm2 into LightGBM code base (PR Add transform support for LightGBM by open source FreeForm2 library #4733). The source code was originally a part of our product code, and we've got the approval to open source it. Though this is a large PR, it will have no interference with our current code base.
  2. Adding interface in LightGBM allows use of FreeForm2 end-to-end with the model training. This PR adds two parameters for data construction from file, transform_file and header_file. transform_file describes the transform expressions, and header_file is a separate file for the original feature names. This PR also adds some tests and enables checks for transform in both Github action workflow and Azure Pipeline. (Integrate transform implementation with lightGBM, add separate header file support. #4734)
  3. Update the documents for how to use transform in LightGBM (Add docs for transform installation, tutorial, freeform2 language spec. #4735)
@shiyu1994
Copy link
Collaborator Author

@StrikerRUS @jameslamb @guolinke @btrotta Let's have discussion about adding the FreeForm2 feature here. Feel free to express any concerns and suggestions. Thanks!

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 29, 2021

Thanks for this additional information. This seems like a very interesting and powerful project!

But I am concerned about the proposal to add it directly to LightGBM.

Could you please explain why adding this code to the LightGBM repo is preferable to releasing the project under its own repo?

I don't agree with the statement that this change "will have no interference with our current code base".

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Oct 29, 2021

Hi @jameslamb, after some discussion with @chjinche, we agree that we can make the large PR #4733 a separate repo, and use it in LightGBM as an external library, just like compute, eigen and fmt. Do you think that's acceptable?

@shiyu1994
Copy link
Collaborator Author

By the way, I updated your post above to change the wrongly referred PR number #4374 to #4734.

@chjinche
Copy link
Contributor

chjinche commented Oct 29, 2021

hi @jameslamb Thanks for your insights! And thanks @shiyu1994 for proposing the discussion. I'd like to share more comments from my perspective. Feel free to share your views, and hope we could solve the issue asap.

  • As @shiyu1994 mentioned, we agree to move PR Add transform support for LightGBM by open source FreeForm2 library #4733 to a separate repo. One question to discuss: could we name the repo as lightgbm-transform? Cause the transform interfaces are designed specially for lightgbm usage, not general.

  • For integration PR, looks still necessary. Shared more context here. We want to provide a built-in transform way to lightGBM, because it is a very common practice in tree based model training in Microsoft products, and benefits us a lot, now we are glad to use lightGBM as model in products and need to fill this gap of data transformation. @jameslamb Could you please kindly review PR Integrate transform implementation with lightGBM, add separate header file support. #4734 in advance? So we could find more problems in advance.
    To clarify, this PR has no interference with current code base. The compliation is controlled by a new option USE_TRANSFORM with default value OFF. No impact to original codes if not turn on. And ci for transform task is also necessary, or we cannot run the integration test. To save ci cost, we could prepare a docker image packed with transform dependencies, and I've tested the additional python tests only cost 3 mins. Sure we will modify the cmake list part and include header part, as transform code base will be a separate external lib.

@StrikerRUS
Copy link
Collaborator

Thanks for inviting to this discussion!

I'm absolutely agree with each point @jameslamb has already written above. I just want to highlight some moments about what I'm personally concerned.

As it was fairly noted, the main purpose of adding FreeForm2 to the LightGBM repo is to help developers use it inside different Microsoft departments.

The purpose of this integration is to enable LightGBM to be used in some of our products.

Integration of FreeForm2 allows LightGBM to be more widely used in our products.

We want to provide a built-in transform way to lightGBM, because it is a very common practice in tree based model training in Microsoft products, and benefits us a lot, now we are glad to use lightGBM as model in products and need to fill this gap of data transformation.

I agree that this project is quite interesting and it's great that Microsoft decided to open-source it. For sure, this project can gain some popularity among ML community. But it's debatable statement that it should be a part of LightGBM itself as a general-purpose tree-based gradient boosting framework. Adding FreeForm2 can shift development direction to an autoML field.

I strongly believe that a separate project with LightGBM as a dependency is the best option. Thanks to git and GitHub it can be done by various methods. Going this way maintainers attention won't be diffused by issues and PRs to FreeForm2.

I'm a kind of OK with a lightweight PR adding some core interface enhancements without which FreeForm2 can't work. But it'll be much better if they done inside FreeForm2 own repository as they are not used anywhere else. But I'm strongly against adding new CI jobs into LightGBM repo. We already have a lot, really a lot, of CI jobs to cover as much possible combinations of different environments as we can. Unfortunately, the number of CI jobs are not unlimited and we made a lot of efforts to overcome this without significantly losing the coverage. One another frustrating thing is that CI environments are not stable and sometimes (to be honest, quite often) we spend several days to simply repair something in a CI environment. During that time PRs are blocked for the merge and even are not being reviewed because of the limit of human resources here. Summing up all the above, any problems with FreeForm2 CI will hurt the development speed of LightGBM. Sorry, but I really can't get it why this all should be a part of this repo.

@jameslamb
Copy link
Collaborator

I updated your post above to change the wrongly referred PR number

ha thank you! sorry about that.

could we name the repo as lightgbm-transform? Cause the transform interfaces are designed specially for lightgbm usage, not general.

Oh wow! Until this comment I did not know that this "FreeForm2" project is intended to be LightGBM-specific.

If that's true then yes, I think it makes sense to include lightgbm- in the name.

we can make the large PR #4733 a separate repo, and use it in LightGBM as an external library, just like compute, eigen and fmt

Thank you very much. I definitely think that is preferable. That makes it clearer that that other repo is the place where issues and pull requests related to this large library should go.

the compliation is controlled by a new option USE_TRANSFORM with default value OFF. No impact to original codes if not turn on

I don't agree with this reasoning. As maintainers of this project, we're responsible for supporting all configurations of LightGBM. And users should expect that if we provide an alternative configuration, it has the full support of the projects' maintainers unless it is explicitly documented as not being well-supported.

USE_GPU and USE_CUDA are also OFF by default in this project, and we still receive many bug reports, feature requests, and pull requests related to those types of builds. Those requests do not take up any less maintainer time and energy just because those options are OFF by default.

Could you please kindly review PR Integrate transform implementation with lightGBM, add separate header file support. #4734 in advance?

I took a quick look, and see some immediate concerns:

  1. It looks like FreeForm2 is only supported using gcc on unix-like systems (linux + mac). That makes it far less portable than LightGBM itself, which is supported on linux, mac, and windows using MSVC, gcc, clang, and others. When users ask for support to be extended to other systems where they're already using LightGBM...will you work on such requests? Or is the vision for this feature that it will be gcc-only, unix-like-only until someone from the community volunteers to try to add support for other OSes or compilers?

  2. How do you expect users of the Python package to take advantage of this? Will you propose in a future PR that we build and publish wheels to PyPI that include the transform code? As of Integrate transform implementation with lightGBM, add separate header file support. #4734, I think the only way someone would be able to use this feature in the Python package is to clone the repo and build from source.

  3. Similar question for R....the main package manager for R, CRAN, has very strict requirements for code that requires compilation, including a requirement that it work on all of the platforms CRAN considers important (several linux flavors, Windows, macOS, Solaris...and with both gcc and clang). Do you anticipate some people from Microsoft adding support for this in the R package in the future?

No need to answer each of those specific questions here (to avoid this thread getting too large), but I just raise them as some tangible examples of the maintenance costs that would be introduced by moving forward with this proposal.

@shiyu1994
Copy link
Collaborator Author

@StrikerRUS @jameslamb, thanks for your suggestions. I think these PRs are only initial steps towards the full support of FreeForm2. And we can spend efforts in the future to make the support of FreeForm2 more portable. Actually, we have some ideas to create a separate library for data handling in LightGBM, which can contain current preprocessing logics in LightGBM (such as discretization), and the target encoding in PR #3234, and FreeForm2. This allows LightGBM to focus more on the GBDT algorithm. And the data library can handle feature engineering related issues.

@jameslamb jameslamb changed the title Add FreeForm2 Into LightGBM [RFC] Add FreeForm2 Into LightGBM Oct 29, 2021
@chjinche
Copy link
Contributor

chjinche commented Nov 3, 2021

@StrikerRUS @jameslamb, thanks for your comments. With some discussions, we figured out a way that may solved concerns you raised 😉. Could we add a reflection mechanism for base class Parser in lightGBM?
In this way, users could use lightgbm as dependency and customize the parser they like to do data preprocessing in a built-in way. And no extra ci and limited compiler/os problems are introduced.
We could prepare examples and tutorials in lightgbm-transform repo, user could go there and learn how to play with parser.

@StrikerRUS
Copy link
Collaborator

@chjinche Thank you very much for your response!

In this way, users could use lightgbm as dependency and customize the parser they like to do data preprocessing in a built-in way.

I believe this is the best way to go! And I think it's OK to add some API changes that are required by lightgbm-transform. We are doing almost the same with the SWIG wrapper: API is added in this repo but all tests are done in consumer's repository and we sometimes receive bug fixes (e.g. #2958).

user could go there and learn how to play with parser.

Please don't forget to add some links in LightGBM docs to lightgbm-transform with the aim to help users discover it (e.g. like here #4584).

@chjinche
Copy link
Contributor

chjinche commented Nov 4, 2021

Thanks @StrikerRUS for the plan approval! We will prepare a new pr soon as we discussed.

@jameslamb
Copy link
Collaborator

I don't understand what "a reflection mechanism for base class Parser in LightGBM" means, but as long as LightGBM isn't responsible for compiling FreeForm2 and pull requests / tests / bug reports for FreeForm2 live in a separate repo, I'm supportive of that proposal. Thanks for finding an alternative integration plan.

@chjinche chjinche linked a pull request Nov 8, 2021 that will close this issue
@chjinche
Copy link
Contributor

chjinche commented Nov 8, 2021

hi @shiyu1994 @jameslamb @StrikerRUS hope you are doing well! 😄
I've sent out the new integration PR #4782, could you please kindly review it? And looks most ci jobs passed except three R ones, but I did not change R package nor c api, could you please help take a look or share some simple debug way? Thanks a lot.

@chjinche
Copy link
Contributor

Thanks @shiyu1994 for review the PR! @jameslamb @StrikerRUS May I know if you have time to take a look? 😄 all ci jobs passed now.

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 12, 2021

May I know if you have time to take a look?

Sorry for the delay, I've been traveling this week. I just reviewed and I'm comfortable with those changes, just looks like there is a bit of extra work to do to ensure the library continues to meet the strict portability requirements for CRAN. (#4782 (review))

@StrikerRUS
Copy link
Collaborator

Hey @chjinche ! Can the branch feat/enable_transform_to_lgbm be removed now from the LightGBM repo?

@StrikerRUS
Copy link
Collaborator

Gently ping @chjinche

@chjinche
Copy link
Contributor

Hey @chjinche ! Can the branch feat/enable_transform_to_lgbm be removed now from the LightGBM repo?

Sorry for the late response, deleted the unused branch.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants