Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Possibly-incorrect handling of duplicate parameters #4521

Closed
jameslamb opened this issue Aug 15, 2021 · 2 comments · Fixed by #4914
Closed

[R-package] Possibly-incorrect handling of duplicate parameters #4521

jameslamb opened this issue Aug 15, 2021 · 2 comments · Fixed by #4914

Comments

@jameslamb
Copy link
Collaborator

Description

In the CLI, if you provide a training config with bagging_fraction=0.8 and bagging_fraction=0.5, you will see a warning like this during training.

bagging_fraction is set=0.8, bagging_fraction=0.5 will be ignored. Current value: bagging_fraction=0.8

If you provide a config with those duplicate values to lgb.train() in the R package, you'll see this instead:

bagging_fraction is set=0.8, bagging_fraction=0.8 will be ignored. Current value: bagging_fraction=0.8

I need to do more investigation, but I think this means that the R package is not handling duplicates in parameters correctly.

Reproducible example

This behavior cannot be reproduced in the Python package because Python code in this project uses dictionaries for parameters at all user-facing entrypoints and Python dictionaries enforce the uniqueness of keys.

However, the R package uses R lists, which can have duplicate keys (e.g. list(a = 5, a = 6)).

library(lightgbm)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

model <- lgb.train(
  params = list(
      objective = "regression",
      bagging_fraction = 0.8
  )
  , data = dtrain
  , nrounds = 5L
  , bagging_fraction = 0.5
)

[LightGBM] [Warning] bagging_fraction is set=0.8, bagging_fraction=0.8 will be ignored. Current value: bagging_fraction=0.8

With the CLI, the same duplicated parameters results in the expected warning message.

cd ./examples/regression
echo "bagging_fraction = 0.5" >> train.conf
echo "" >> train.conf

"../../lightgbm" config=train.conf

[LightGBM] [Warning] bagging_fraction is set=0.8, bagging_fraction=0.5 will be ignored. Current value: bagging_fraction=0.8

Environment info

LightGBM version or commit hash: latest master (86ead20)

Command(s) you used to install LightGBM

# R package
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

# CLI
mkdir build
cd build
cmake ..
make -j4

Additional Comments

This is probably related to the R package's use of append() to handle parameters passed through ... (R's equivalent of **kwargs in Python). Combining two lists in R with append() doesn't merge keys.

append(list(a = 6), list(b = 7, a = 9))

$a
[1] 6

$b
[1] 7

$a
[1] 9

It's possible that changing these uses to modifyList() would help with this issue.

In Python package, in the scikit-learn interface, parameters passed through **kwargs take precedence over those passed in the params dictionary.

self.set_params(**kwargs)

for key, value in params.items():
setattr(self, key, value)
if hasattr(self, f"_{key}"):
setattr(self, f"_{key}", value)
self._other_params[key] = value

However, since the current behavior of the R package is for parameters in params to take precedence over those from ..., I think that that behavior should be preserved in the 3.3.0 release (#4310). All uses of ... are going to be removed in LightGBM 4.0.0 (#4226).

@jameslamb
Copy link
Collaborator Author

Now that the work for #4226, the provided example code is no longer possible in the R package.

library(lightgbm)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

model <- lgb.train(
    params = list(
        objective = "regression",
        bagging_fraction = 0.8
    )
    , data = dtrain
    , nrounds = 5L
    , bagging_fraction = 0.5
)

Error in lgb.train(params = list(objective = "regression", bagging_fraction = 0.8), :
unused argument (bagging_fraction = 0.5)

However, adding the same parameter to params twice still results in a different log message in the R package than in the CLI.

model <- lgb.train(
    params = list(
        objective = "regression"
        , bagging_fraction = 0.8
        , bagging_fraction = 0.5
    )
    , data = dtrain
    , nrounds = 5L
)
[LightGBM] [Warning] bagging_fraction is set=0.8, bagging_fraction=0.8 will be ignored. Current value: bagging_fraction=0.8

Looking into this more tonight, I think this is because of how the internal function used to to convert params (an R list) to a string to be passed through the C API.

params <- list(
    objective = "regression"
    , bagging_fraction = 0.8
    , bagging_fraction = 0.5
)
lightgbm:::lgb.params2str(params)

[1] "objective=regression bagging_fraction=0.8 bagging_fraction=0.8"

I'll propose a fix to lgb.params2str() to handle situations like this.

StrikerRUS pushed a commit that referenced this issue Dec 29, 2021
* [R-package] fix handling of duplicate parameters (fixes #4521)

* add gsub stuff back in for now
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
1 participant