Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

Closed
asindus opened this issue Aug 21, 2018 · 25 comments · Fixed by #1964
Closed

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

asindus opened this issue Aug 21, 2018 · 25 comments · Fixed by #1964

Comments

@asindus
Copy link

asindus commented Aug 21, 2018

Operating System: Ubuntu 16.04

AWS EC2 Instance : r5.24xlarge(vcpu: 96 , Memory: 768 GB)

Train Shape: 89M rows, 224 features

Error message

Traceback (most recent call last):
File "Model_run.py", line 336, in
lgbmodel = lgb.train(params, train_lgb)
File "/usr/local/lib/python3.5/dist-packages/lightgbm/engine.py", line 228, in train
booster._load_model_from_string(booster._save_model_to_string(), False)
File "/usr/local/lib/python3.5/dist-packages/lightgbm/basic.py", line 1719, in _save_model_to_string
return string_buffer.value.decode()
SystemError: Negative size passed to PyBytes_FromStringAndSize

Parameters

params = {
'objective': 'mape',
'metric': 'mape',
'boosting': 'gbdt',
'learning_rate': 0.012,
'verbose': 0,
'num_leaves': 150000,
'bagging_freq': 0,
'min_data_in_leaf': 300,
'max_bin': 255,
'max_depth': 28,
'num_rounds': 200,
'min_gain_to_split': 0.0,
'save_binary': True
}

Python3: Model is generated successfully with nrounds=100.

PS: The same parameters (with nrounds=200) successfully yield a model object in R.

@guolinke
Copy link
Collaborator

guolinke commented Aug 21, 2018

It seems the string length exceed the range of 32-bit integer. Please consider the smaller num_leaves.

@asindus
Copy link
Author

asindus commented Aug 21, 2018

@guolinke Unfortunately the performance of the model reduces with reduction in num_leaves. The best performance was achieved with the same params with nrounds=800 on R.

Do you think there is a workaround to this issue by tweaking the datatype of the string?

@guolinke
Copy link
Collaborator

@asindus you can try to set `keep_training_booster=True' in https://lightgbm.readthedocs.io/en/latest/Python-API.html#lightgbm.train.

@asindus
Copy link
Author

asindus commented Aug 22, 2018

@guolinke Thanks, seems to work with `keep_training_booster=True'.

However, comparing the output of R and Python models with same parameters, I'm getting considerably different performance on test data. MAPE R: 16% vs MAPE Python: 22%. Any reason why this should happen? The features are all numerical and do not contains any null/ Inf values.

@StrikerRUS
Copy link
Collaborator

@guolinke fix or FAQ?

@guolinke
Copy link
Collaborator

Sorry for missing this issue.
@asindus could you provide a reproduce example with randomly generated data (or small fraction of your data)?
I think their performance should be the same.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Nov 24, 2018

@guolinke Oh, I didn't notice that this issue contains actually two problems. I was interesting about model string length limit.

@guolinke
Copy link
Collaborator

@StrikerRUS From the code, I think the string length constraints was fixed: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L2062

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Nov 25, 2018

@guolinke That's true. Git blame shows that the last modification of this line was performed a year ago in #1080. However, this issue was raised 3 months ago.

@asindus On which version of LightGBM did you observed the problem?

@StrikerRUS
Copy link
Collaborator

@guolinke While working on reproducing this issue I've encountered bad allocation error:

image

As you can see, it's a Warning. Shouldn't it be Fatal? Unfortunately, search over the sources with "bad allocation" gave nothing.

@guolinke
Copy link
Collaborator

guolinke commented Dec 1, 2018

@StrikerRUS
I think these warnings are caused in openmp exception handling.
https://github.com/Microsoft/LightGBM/blob/6488f319f243f7ff679a8e388a33e758c5802303/include/LightGBM/utils/openmp_wrapper.h#L43

Throwing exceptions inside openmp loop will crash.
Therefore, the error message is printed by warning, then the excpetion are thrown after omp loop.

Maybe we need a new key word for errors inside omp loop, like "OMP Loop Fatal" or something similar.

@StrikerRUS
Copy link
Collaborator

@guolinke Ah, got it!

Unfortunately, there was no error after these warnings

Therefore, the error message is printed by warning, then the excpetion are thrown after omp loop.

In case of Jupyter Notebook a lot of warnings are printed and then kernel dies without exception:

image

Ordinary Python script doesn't produce any exception in the end too:

image

@guolinke
Copy link
Collaborator

guolinke commented Dec 1, 2018

I guess some errors will crash directly before backing to main thread.
You can try to use one thread or disable openmp.

@StrikerRUS
Copy link
Collaborator

ping @asindus

@asindus
Copy link
Author

asindus commented Dec 10, 2018

@guolinke @StrikerRUS Unfortunaley, I no longer have access to the environment (and data) where this error was encountered so replication of the error would be quite tricky (also finding the version of LightGBM). If it helps, the package was installed using "pip3 install lightgbm" on the Ubuntu environment listed above.

@StrikerRUS
Copy link
Collaborator

@asindus Ah, it's a pity!

@guolinke pip3 install lightgbm makes me think that it was 2.1.2 version, in which string model length bug had been already fixed. Maybe this was something else?..

@guolinke
Copy link
Collaborator

@StrikerRUS
It is hard to say. maybe we can add some tests for the large string?

@StrikerRUS
Copy link
Collaborator

@guolinke Fair and good point!

@StrikerRUS
Copy link
Collaborator

@guolinke I think we can train on one dataset from examples, save and upload in the repo some huge LightGBM model. And at CI side only load it and try to predict to save the CI time.

@guolinke
Copy link
Collaborator

@StrikerRUS will the size of this model too large?

@StrikerRUS
Copy link
Collaborator

@guolinke Hmm... Very likely. What size do we need? len(model_str) > MAX_INT32, right? Maybe then we can store some "template" (let say, with few trees) and then, at CI side, duplicate these tree to hit the needed model string length?

@guolinke
Copy link
Collaborator

yeah, I think it is a good idea.
we also can do this in python side directly, by calling the model_from_string directly, without using file.

@StrikerRUS
Copy link
Collaborator

@guolinke I'm afraid we will hit the RAM limit with such big string at CI side. For example, Travis' macOS has 4GB: https://docs.travis-ci.com/user/reference/overview/#virtualisation-environment-vs-operating-system.

@guolinke
Copy link
Collaborator

@StrikerRUS maybe we can do the test only in Linux?

@PeterPann23
Copy link

Hi
I still regularly get the error, sometimes simply restarting the job allow the job to end Normally.

the lof [201904.28 093411_log.txt] show the activity till the crash (https://github.com/Microsoft/LightGBM/files/3127233/201904.28.093411_log.txt)

My Console output:

Input file H:\MLData\Training\H0023.csv will be used
Training started at 4/28/2019 9:34:11 AM
 Memory usage       : current 17.43 MB, max. 30.59 GB
 1 Light GBM        : Start 28.04.2019 09:34:12
Status update       10:48:35: Start training LightGBM on 1,957,913 rows of data
Exception(s): Non critical error: LightGBM Error, code is -1, error message is 'bad allocation'.
Log: 12:12:57:[Source=LightGBMMulticlass; Loading data for LightGBM, Kind=Trace] Channel disposed. Elapsed 1.00:46:54.4866596.26:31.0423158.9218
..

            var loader = mlContext.Data.CreateTextLoader(options: new TextLoader.Options()
            {
                Columns = new[] {
                    new TextLoader.Column(name:"Label", dataKind: DataKind.String, index: 0),
                    new TextLoader.Column(name:"Features",dataKind:DataKind.Single,minIndex:1,maxIndex:40731)
                },
                HasHeader = false,
                Separators = new[] { '|' },
                UseThreads = true,

            });
            var dv = loader.Load(trainingFile.FullName);

            dataset = mlContext.Data.TrainTestSplit(dv, testFraction: 0.1);
            var start = DateTime.Now;
            (long? training, long? validating) montecarlo = ((dataset.TrainSet.GetColumn<string>(Label).LongCount(), dataset.TestSet.GetColumn<string>(Label).LongCount()));
                       
            var options =new LightGbmMulticlassTrainer.Options {
                              LabelColumnName   = KeyColumn,
                              FeatureColumnName = Features,
                              Silent= false,
                              Verbose=true,
                              NumberOfThreads=8
                          };


            var pipeline = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: KeyColumn, inputColumnName: Label)
                         .Append(mlContext.MulticlassClassification.Trainers.LightGbm(options))                          
                          .Append(mlContext.Transforms.CopyColumns(inputColumnName: KeyColumn, outputColumnName: nameof(PredictedResult.PredictedLabelIndex)))                         
                         ;



            this.OnUpdate?.Invoke($"Start training LightGBM on {montecarlo.training.Value:N0} rows of data");
            // Train the model.
            var model = pipeline.Fit(dataset.TrainSet);

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants