Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

asindus · 2018-08-21T00:11:10Z

Operating System: Ubuntu 16.04

AWS EC2 Instance : r5.24xlarge(vcpu: 96 , Memory: 768 GB)

Train Shape: 89M rows, 224 features

Error message

Traceback (most recent call last):
File "Model_run.py", line 336, in
lgbmodel = lgb.train(params, train_lgb)
File "/usr/local/lib/python3.5/dist-packages/lightgbm/engine.py", line 228, in train
booster._load_model_from_string(booster._save_model_to_string(), False)
File "/usr/local/lib/python3.5/dist-packages/lightgbm/basic.py", line 1719, in _save_model_to_string
return string_buffer.value.decode()
SystemError: Negative size passed to PyBytes_FromStringAndSize

Parameters

params = {
'objective': 'mape',
'metric': 'mape',
'boosting': 'gbdt',
'learning_rate': 0.012,
'verbose': 0,
'num_leaves': 150000,
'bagging_freq': 0,
'min_data_in_leaf': 300,
'max_bin': 255,
'max_depth': 28,
'num_rounds': 200,
'min_gain_to_split': 0.0,
'save_binary': True
}

Python3: Model is generated successfully with nrounds=100.

PS: The same parameters (with nrounds=200) successfully yield a model object in R.

guolinke · 2018-08-21T03:07:17Z

It seems the string length exceed the range of 32-bit integer. Please consider the smaller num_leaves.

asindus · 2018-08-21T03:11:53Z

@guolinke Unfortunately the performance of the model reduces with reduction in num_leaves. The best performance was achieved with the same params with nrounds=800 on R.

Do you think there is a workaround to this issue by tweaking the datatype of the string?

guolinke · 2018-08-22T03:49:50Z

@asindus you can try to set `keep_training_booster=True' in https://lightgbm.readthedocs.io/en/latest/Python-API.html#lightgbm.train.

asindus · 2018-08-22T20:47:00Z

@guolinke Thanks, seems to work with `keep_training_booster=True'.

However, comparing the output of R and Python models with same parameters, I'm getting considerably different performance on test data. MAPE R: 16% vs MAPE Python: 22%. Any reason why this should happen? The features are all numerical and do not contains any null/ Inf values.

StrikerRUS · 2018-11-23T21:02:08Z

@guolinke fix or FAQ?

guolinke · 2018-11-24T06:33:35Z

Sorry for missing this issue.
@asindus could you provide a reproduce example with randomly generated data (or small fraction of your data)?
I think their performance should be the same.

StrikerRUS · 2018-11-24T10:39:27Z

@guolinke Oh, I didn't notice that this issue contains actually two problems. I was interesting about model string length limit.

guolinke · 2018-11-25T03:35:20Z

@StrikerRUS From the code, I think the string length constraints was fixed: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L2062

StrikerRUS · 2018-11-25T12:32:00Z

@guolinke That's true. Git blame shows that the last modification of this line was performed a year ago in #1080. However, this issue was raised 3 months ago.

@asindus On which version of LightGBM did you observed the problem?

StrikerRUS · 2018-11-30T23:46:43Z

@guolinke While working on reproducing this issue I've encountered bad allocation error:

As you can see, it's a Warning. Shouldn't it be Fatal? Unfortunately, search over the sources with "bad allocation" gave nothing.

guolinke · 2018-12-01T04:08:19Z

@StrikerRUS
I think these warnings are caused in openmp exception handling.
https://github.com/Microsoft/LightGBM/blob/6488f319f243f7ff679a8e388a33e758c5802303/include/LightGBM/utils/openmp_wrapper.h#L43

Throwing exceptions inside openmp loop will crash.
Therefore, the error message is printed by warning, then the excpetion are thrown after omp loop.

Maybe we need a new key word for errors inside omp loop, like "OMP Loop Fatal" or something similar.

StrikerRUS · 2018-12-01T17:08:04Z

@guolinke Ah, got it!

Unfortunately, there was no error after these warnings

Therefore, the error message is printed by warning, then the excpetion are thrown after omp loop.

In case of Jupyter Notebook a lot of warnings are printed and then kernel dies without exception:

Ordinary Python script doesn't produce any exception in the end too:

guolinke · 2018-12-01T23:55:32Z

I guess some errors will crash directly before backing to main thread.
You can try to use one thread or disable openmp.

StrikerRUS · 2018-12-10T10:49:14Z

ping @asindus

asindus · 2018-12-10T11:33:08Z

@guolinke @StrikerRUS Unfortunaley, I no longer have access to the environment (and data) where this error was encountered so replication of the error would be quite tricky (also finding the version of LightGBM). If it helps, the package was installed using "pip3 install lightgbm" on the Ubuntu environment listed above.

StrikerRUS · 2018-12-10T20:01:38Z

@asindus Ah, it's a pity!

@guolinke pip3 install lightgbm makes me think that it was 2.1.2 version, in which string model length bug had been already fixed. Maybe this was something else?..

guolinke · 2018-12-28T02:32:11Z

@StrikerRUS
It is hard to say. maybe we can add some tests for the large string?

StrikerRUS · 2018-12-28T05:48:23Z

@guolinke Fair and good point!

StrikerRUS · 2019-01-14T10:57:52Z

@guolinke I think we can train on one dataset from examples, save and upload in the repo some huge LightGBM model. And at CI side only load it and try to predict to save the CI time.

guolinke · 2019-01-14T11:53:17Z

@StrikerRUS will the size of this model too large?

StrikerRUS · 2019-01-14T12:15:00Z

@guolinke Hmm... Very likely. What size do we need? len(model_str) > MAX_INT32, right? Maybe then we can store some "template" (let say, with few trees) and then, at CI side, duplicate these tree to hit the needed model string length?

guolinke · 2019-01-14T13:06:44Z

yeah, I think it is a good idea.
we also can do this in python side directly, by calling the model_from_string directly, without using file.

StrikerRUS · 2019-01-22T21:51:54Z

@guolinke I'm afraid we will hit the RAM limit with such big string at CI side. For example, Travis' macOS has 4GB: https://docs.travis-ci.com/user/reference/overview/#virtualisation-environment-vs-operating-system.

guolinke · 2019-01-23T01:44:54Z

@StrikerRUS maybe we can do the test only in Linux?

PeterPann23 · 2019-04-29T11:46:02Z

Hi
I still regularly get the error, sometimes simply restarting the job allow the job to end Normally.

the lof [201904.28 093411_log.txt] show the activity till the crash (https://github.com/Microsoft/LightGBM/files/3127233/201904.28.093411_log.txt)

My Console output:

Input file H:\MLData\Training\H0023.csv will be used
Training started at 4/28/2019 9:34:11 AM
 Memory usage       : current 17.43 MB, max. 30.59 GB
 1 Light GBM        : Start 28.04.2019 09:34:12
Status update       10:48:35: Start training LightGBM on 1,957,913 rows of data
Exception(s): Non critical error: LightGBM Error, code is -1, error message is 'bad allocation'.
Log: 12:12:57:[Source=LightGBMMulticlass; Loading data for LightGBM, Kind=Trace] Channel disposed. Elapsed 1.00:46:54.4866596.26:31.0423158.9218

..

            var loader = mlContext.Data.CreateTextLoader(options: new TextLoader.Options()
            {
                Columns = new[] {
                    new TextLoader.Column(name:"Label", dataKind: DataKind.String, index: 0),
                    new TextLoader.Column(name:"Features",dataKind:DataKind.Single,minIndex:1,maxIndex:40731)
                },
                HasHeader = false,
                Separators = new[] { '|' },
                UseThreads = true,

            });
            var dv = loader.Load(trainingFile.FullName);

            dataset = mlContext.Data.TrainTestSplit(dv, testFraction: 0.1);
            var start = DateTime.Now;
            (long? training, long? validating) montecarlo = ((dataset.TrainSet.GetColumn<string>(Label).LongCount(), dataset.TestSet.GetColumn<string>(Label).LongCount()));
                       
            var options =new LightGbmMulticlassTrainer.Options {
                              LabelColumnName   = KeyColumn,
                              FeatureColumnName = Features,
                              Silent= false,
                              Verbose=true,
                              NumberOfThreads=8
                          };


            var pipeline = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: KeyColumn, inputColumnName: Label)
                         .Append(mlContext.MulticlassClassification.Trainers.LightGbm(options))                          
                          .Append(mlContext.Transforms.CopyColumns(inputColumnName: KeyColumn, outputColumnName: nameof(PredictedResult.PredictedLabelIndex)))                         
                         ;



            this.OnUpdate?.Invoke($"Start training LightGBM on {montecarlo.training.Value:N0} rows of data");
            // Train the model.
            var model = pipeline.Fit(dataset.TrainSet);

StrikerRUS added the awaiting response label Nov 28, 2018

StrikerRUS removed the awaiting response label Dec 10, 2018

StrikerRUS mentioned this issue Jan 24, 2019

[tests][python] added test for huge string model #1964

Merged

StrikerRUS closed this as completed in #1964 Jan 30, 2019

Ivanidzo4ka mentioned this issue Apr 15, 2019

LightGBM Error, code is -1, error message is 'bad allocation' dotnet/machinelearning#3340

Closed

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

asindus commented Aug 21, 2018

guolinke commented Aug 21, 2018 •

edited

Loading

asindus commented Aug 21, 2018 •

edited

Loading

guolinke commented Aug 22, 2018

asindus commented Aug 22, 2018

StrikerRUS commented Nov 23, 2018

guolinke commented Nov 24, 2018

StrikerRUS commented Nov 24, 2018 •

edited

Loading

guolinke commented Nov 25, 2018

StrikerRUS commented Nov 25, 2018 •

edited

Loading

StrikerRUS commented Nov 30, 2018

guolinke commented Dec 1, 2018

StrikerRUS commented Dec 1, 2018

guolinke commented Dec 1, 2018

StrikerRUS commented Dec 10, 2018

asindus commented Dec 10, 2018

StrikerRUS commented Dec 10, 2018

guolinke commented Dec 28, 2018

StrikerRUS commented Dec 28, 2018

StrikerRUS commented Jan 14, 2019

guolinke commented Jan 14, 2019

StrikerRUS commented Jan 14, 2019

guolinke commented Jan 14, 2019

StrikerRUS commented Jan 22, 2019

guolinke commented Jan 23, 2019

PeterPann23 commented Apr 29, 2019

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

Error training model: Negative size passed to PyBytes_FromStringAndSize #1598

Comments

asindus commented Aug 21, 2018

Error message

Parameters

guolinke commented Aug 21, 2018 • edited Loading

asindus commented Aug 21, 2018 • edited Loading

guolinke commented Aug 22, 2018

asindus commented Aug 22, 2018

StrikerRUS commented Nov 23, 2018

guolinke commented Nov 24, 2018

StrikerRUS commented Nov 24, 2018 • edited Loading

guolinke commented Nov 25, 2018

StrikerRUS commented Nov 25, 2018 • edited Loading

StrikerRUS commented Nov 30, 2018

guolinke commented Dec 1, 2018

StrikerRUS commented Dec 1, 2018

guolinke commented Dec 1, 2018

StrikerRUS commented Dec 10, 2018

asindus commented Dec 10, 2018

StrikerRUS commented Dec 10, 2018

guolinke commented Dec 28, 2018

StrikerRUS commented Dec 28, 2018

StrikerRUS commented Jan 14, 2019

guolinke commented Jan 14, 2019

StrikerRUS commented Jan 14, 2019

guolinke commented Jan 14, 2019

StrikerRUS commented Jan 22, 2019

guolinke commented Jan 23, 2019

PeterPann23 commented Apr 29, 2019

guolinke commented Aug 21, 2018 •

edited

Loading

asindus commented Aug 21, 2018 •

edited

Loading

StrikerRUS commented Nov 24, 2018 •

edited

Loading

StrikerRUS commented Nov 25, 2018 •

edited

Loading