Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Model Results with Identical Dataset and Parameters #6604

Closed
dameiqinghai opened this issue Aug 13, 2024 · 7 comments
Closed

Inconsistent Model Results with Identical Dataset and Parameters #6604

dameiqinghai opened this issue Aug 13, 2024 · 7 comments
Labels

Comments

@dameiqinghai
Copy link

Description

I'm encountering an issue where I obtain different model results each time I train an LGBM model, even though I'm using the exact same dataset and identical parameters. This behavior is unexpected, as I would assume that using a fixed seed, dataset, and hyperparameters should produce consistent results though the differnecees are minor. I wonder if this is normal?

Environment info

LightGBM version: V4.5.0
Command(s) you used to install LightGBM

pip install --upgrade lightgbm

Additional Comments

I am using device_type = "gpu" along with common parameters.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

There are significant details you've omitted, which we'd need to help investigate whether what you're seeing is a bug or expected behavior.

For example:

  • what operating system?
  • what Python version?
  • what type of GPU?
  • same physical machine?
  • same software environment? (e.g., using identical docker image, virtual machine image, etc.)
  • same Python version?
  • how are you performing training? (scikit-learn interface? lgb.train()? Booster()?
  • does "same Datasetmean alightgbm` binary file? or same raw data? or something else?
  • exact reproducible example showing all the code you're using?

Investigating these sorts of reports of the form "I expected these to be identical and they're different" requires a lot of work eliminating possible sources of difference. There is lots of prior discussion on that topic in this issue tracker:

See for example:

We need to add some documentation explaining all of these topics (#6094), but until then.... unless you're willing to work with us and provide a reproducible example with these types of details, it's unlikely you'll find the answers you're looking for here.

@dameiqinghai
Copy link
Author

Hi, thanks for your reply. I'm running the exact same fitting process sequentially on the same machine, using identical raw data. My operating system is RHEL 8.x, and I'm using lgb.LGBMRegressor. Unfortunately, I can't share my source code or raw data due to confidentiality. However I find that my problem is closely related to #559. i tried to use device_type="cpu" which is perfectly fine, i am training using double precision as mentioned in above issue to see how things goes.

@jameslamb
Copy link
Collaborator

Thanks. Can you at least share the full set of parameters being passed into LGBMRegressor() and LGBMRegressor.fit()?

@dameiqinghai
Copy link
Author

Of course, all none default params :
{n_estimators: 3000, max_depth: 12, num_leaves: 189, learning_rate: 0.08, subsample: 0.98, colsample_bytree: 0.74, min_child_samples: 140000, n_jobs: 60, bagging_freq: 62, lambda_l2: 85, max_bin: 40, device_type: 'gpu'}

@jameslamb
Copy link
Collaborator

Thanks for that.

Please try the following changes to the parameters:

  • set seed to something other than 0 (like 708) (or random_state via keyword arguments)
    • this will control the randomness in bagging and column sub-selection, which is important because you're passing "subsample": 0.98, "colsample_bytree": 0.74
  • set "num_thread": 1 (or n_jobs=1 via keyword arguments)
    • this will make training slower, but in exchange it'll remove some sources of non-determinism. The way that results from multiple threads are combined together can result in non-deterministic results because of numerical precision issues ... e.g. imagine multiple 10 very small floating point numbers by each other in random order. The resulting value will vary slightly based on the order, even though symbolically order is irrelevant in multiplication
  • set "deterministic": True
    • this controls some the approach for computing some loss values and comparing candidate tree splits

There are some forms of non-deterministic results using the GPU version that are unavoidable. But hopefully those changes will reduce most of the difference, and hopefully you'll see very similar models and performance metrics from multiple repeated training runs.

@dameiqinghai
Copy link
Author

Hi, thanks for your advice. It takes some time to conduct all experiments. My observation for my case is that to make the results consistent we need to set seed to a specific number as you mentioned above and gpu_use_dp to true. Though it takes me nearly 2x more time in training compared with single precision.

@jameslamb
Copy link
Collaborator

Great, thanks for checking!

Yes this all sounds right to me. I'd forgotten about setting gpu_use_dp=true, we do mention that here as well:

LightGBM/docs/FAQ.rst

Lines 54 to 59 in d67ecf9

5. When using LightGBM GPU, I cannot reproduce results over several runs.
-------------------------------------------------------------------------
This is normal and expected behaviour, but you may try to use ``gpu_use_dp = true`` for reproducibility
(see `Microsoft/LightGBM#560 <https://github.com/microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
You may also use the CPU version.

And it sounds right that enabling that might lead to slower training... that is why it's false by default.

I'm going to close this as resolved. As you mentioned above, "the differences are minor" ... that is expected. You will have to choose what matters more for your use case... training speed or perfect determinism across multiple runs. Hopefully you're able to use the faster, not-quite-deterministic settings and trust your evaluation setup to help choose between models for tasks like hyperparameter tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants