Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while loading model #5826

Closed
nathanhack opened this issue Jun 25, 2020 · 18 comments · Fixed by #5831
Closed

Error while loading model #5826

nathanhack opened this issue Jun 25, 2020 · 18 comments · Fixed by #5831

Comments

@nathanhack
Copy link

After train a model for a long time I saved it using

bst.save_model("xgboostModel.trail04.json")

I latter tried to load it with the same params to continue training

filemodel = 'xgboostModel.trail04.json'
bst = xgb.train(param, xgTrain, num_boost_round=numOfRounds, evals=watchList, early_stopping_rounds=2000, xgb_model=filemodel)

but I get this error:

  File "/home/user/programs/pycharm/plugins/python-ce/helpers/pydev/pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/user/programs/pycharm/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/user/projects/temp/pythontmp/lastPointClassifiers.py", line 270, in <module>
    trial04()
  File "/home/user/projects/temp/pythontmp/lastPointClassifiers.py", line 256, in trial04
    xgb_model=filemodel)
  File "/home/user/.local/lib/python3.7/site-packages/xgboost/training.py", line 212, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/home/user/.local/lib/python3.7/site-packages/xgboost/training.py", line 37, in _train_internal
    model_file=xgb_model)
  File "/home/user/.local/lib/python3.7/site-packages/xgboost/core.py", line 1175, in __init__
    self.load_model(model_file)
  File "/home/user/.local/lib/python3.7/site-packages/xgboost/core.py", line 1804, in load_model
    self.handle, c_str(os_fspath(fname))))
  File "/home/user/.local/lib/python3.7/site-packages/xgboost/core.py", line 190, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [19:57:42] /workspace/src/common/json.cc:413: Expecting: ",", got: "

When I try the same exact thing but run it over a smaller number of rounds it loads fine and trains like a champ.

def Example():
    X, Y = getData(randomize=True)
    X = np.array(X)
    Y = np.array(Y)
    l = int(len(X) * .5)
    xgTrain = xgb.DMatrix(X[:l], label=Y[:l])
    xgTest = xgb.DMatrix(X[l:], label=Y[l:])
    param = {
        'objective': 'multi:softmax',
        'eta': 0.0001,
        'max_depth': 6,
        'nthread': 16,
        'num_class': 2,
        'tree_method': 'gpu_hist',
    }
    watchList = [(xgTrain, 'train'), (xgTest, 'test')]
    numOfRounds = 1000000
    filemodel = 'xgboostModel.trail04.json'

    print("fit()")
    if path.exists(filemodel):
        bst = xgb.train(param, xgTrain, num_boost_round=numOfRounds, evals=watchList, early_stopping_rounds=2000,
                        xgb_model=filemodel)
    else:
        bst = xgb.train(param, xgTrain, num_boost_round=numOfRounds, evals=watchList, early_stopping_rounds=2000)

    bst.save_model(filemodel)
@hcho3
Copy link
Collaborator

hcho3 commented Jun 25, 2020

Can you share the content of getData() in your example?

@nathanhack
Copy link
Author

@hcho3 I think it would be "easier" to share the model. Without any context to the data, I'm sure I can share the model. The only possible issue is it's 2.9GB. Would the model be helpful?
Something that may be of help is that this error still occurs even when using this method to load it.

    param = {
        'objective': 'multi:softmax',
        'eta': 0.0001,
        'max_depth': 6,
        'nthread': 16,
        'num_class': 2,
        'tree_method': 'gpu_hist',
    }
    bst = xgb.Booster(param)
    bst.load_model('xgboostModel.trail04.json')

@hcho3
Copy link
Collaborator

hcho3 commented Jun 25, 2020

@nathanhack I'd like to see how XGBoost produces a potentially invalid JSON representation, given some data. Do you see the same error with other data?

@nathanhack
Copy link
Author

@hcho3 I think I see what your getting at. So the weird thing is, the above def Example() was literally what was ran to generate the problematic json.. But I've re-ran it after with no problems (on fewer rounds). I use pycharm and I've looked through my local history to verify I didn't change anything in between runs. I've tried recreating the issue but so far I've been unable to reproduce it. Since I couldn't reproduce it maybe it's a fluke, but I switched to saving .bin files since. I lost so much time on that one model (I should have made smaller rounds and made checkpoints), so I figured I should post it so someone with more experience might help figure out the bug.

That said, I'm sure I can give a few samples of input data if you think that will help figure it out.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 25, 2020

@nathanhack In that case, you should upload the model to Dropbox or a similar service and post a link.

@nathanhack
Copy link
Author

@hcho3 https://file.io/utRYr3SY here's a link to the json file. Note this link will only work one time. If needed again, I can post a new link it just let me know.

@trivialfis
Copy link
Member

Will look into it.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 26, 2020

@nathanhack I'm getting a 404.

@nathanhack
Copy link
Author

@hcho3 this link is less ephemeral: https://drive.google.com/file/d/1bRcVE7wnbBhzYliH_9FSDjJMZdtQekBY/view?usp=sharing
let me know if you can't get it.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 26, 2020

@nathanhack I tried loading the JSON on my machine and did not see any error. I used

import xgboost as xgb
param = {
    'objective': 'multi:softmax',
    'eta': 0.0001,
    'max_depth': 6,
    'nthread': 16,
    'num_class': 2,
    'tree_method': 'gpu_hist'
}
bst = xgb.Booster(param)
bst.load_model('xgboostModel.trail04.json')
print(bst)   # prints <xgboost.core.Booster object at 0x7f8033025450>

@nathanhack
Copy link
Author

nathanhack commented Jun 27, 2020

@hcho3 So. I definitely do not get that, check out my screenshot of pycharm with code you gave.
Screenshot from 2020-06-26 21-01-22

Here are some of my system details:
Fedora 31
Python 3.7.7
Cuda 10.2
RTX 2080 TI

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2020

How did you install XGBoost? Did you use pip or did you build from the source?

@nathanhack
Copy link
Author

I used pip.
pip install --user xgboost

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2020

@nathanhack So I was using my Macbook (2019, MacOS Catalina) first time. When I ran the same program on a Linux machine (Ubuntu 18.04 LTS), it actually produced the same error

xgboost.core.XGBoostError: [18:40:58] /workspace/src/common/json.cc:413: Expecting: ",", got: "

This is strange, since the Mac and Linux binaries have been built using the same source. I will investigate further.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2020

The error exists with the latest source too. Stacktrace:

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    bst.load_model('xgboostModel.trail04.json')
  File "/home/phcho/tmp2/xgboost/python-package/xgboost/core.py", line 1501, in load_model
    self.handle, c_str(os_fspath(fname))))
  File "/home/phcho/tmp2/xgboost/python-package/xgboost/core.py", line 187, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [19:34:32] ../src/common/json.cc:449: Stack trace:
  [bt] (0) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const+0x4ed) [0x7f70c2dbd177]
  [bt] (1) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Expect(char, char)+0x135) [0x7f70c2dbfea9]
  [bt] (2) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseArray()+0x105) [0x7f70c2dbdb1b]
  [bt] (3) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x92) [0x7f70c2dbcacc]
  [bt] (4) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseObject()+0x2e1) [0x7f70c2dbdecf]
  [bt] (5) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x69) [0x7f70c2dbcaa3]
  [bt] (6) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseArray()+0xbb) [0x7f70c2dbdad1]
  [bt] (7) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x92) [0x7f70c2dbcacc]
  [bt] (8) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseObject()+0x2e1) [0x7f70c2dbdecf]
  [bt] (9) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x69) [0x7f70c2dbcaa3]
  [bt] (10) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseObject()+0x2e1) [0x7f70c2dbdecf]
  [bt] (11) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x69) [0x7f70c2dbcaa3]
  [bt] (12) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseObject()+0x2e1) [0x7f70c2dbdecf]
  [bt] (13) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x69) [0x7f70c2dbcaa3]
  [bt] (14) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::ParseObject()+0x2e1) [0x7f70c2dbdecf]
  [bt] (15) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Parse()+0x69) [0x7f70c2dbcaa3]
  [bt] (16) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::JsonReader::Load()+0x32) [0x7f70c2dbcc6e]
  [bt] (17) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::Json::Load(xgboost::StringView)+0x57) [0x7f70c2dbe9f7]
  [bt] (18) /home/phcho/tmp2/xgboost/python-package/xgboost/../../lib/libxgboost.so(XGBoosterLoadModel+0x37f) [0x7f70c2d673ab]
  [bt] (19) /home/phcho/miniconda3/envs/foobar/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f70e7f529dd]
  [bt] (20) /home/phcho/miniconda3/envs/foobar/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f70e7f52067]
  [bt] (21) /home/phcho/miniconda3/envs/foobar/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f70e7d8b27e]
  [bt] (22) /home/phcho/miniconda3/envs/foobar/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f70e7d8bcb4]
  [bt] (23) python(_PyObject_FastCallKeywords+0x48b) [0x56467da9c00b]
  [bt] (24) python(_PyEval_EvalFrameDefault+0x51d1) [0x56467db009a1]
  [bt] (25) python(_PyFunction_FastCallKeywords+0xfb) [0x56467da9420b]
  [bt] (26) python(_PyEval_EvalFrameDefault+0x6a0) [0x56467dafbe70]
  [bt] (27) python(_PyEval_EvalCodeWithName+0x2f9) [0x56467da442b9]
  [bt] (28) python(PyEval_EvalCodeEx+0x44) [0x56467da451d4]
  [bt] (29) python(PyEval_EvalCode+0x1c) [0x56467da451fc]
  [bt] (30) python(+0x22bf44) [0x56467db5af44]
  [bt] (31) python(PyRun_FileExFlags+0xa1) [0x56467db652b1]
  [bt] (32) python(PyRun_SimpleFileExFlags+0x1c3) [0x56467db654a3]
  [bt] (33) python(+0x2375d5) [0x56467db665d5]
  [bt] (34) python(_Py_UnixMain+0x3c) [0x56467db666fc]
  [bt] (35) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f70e7752b97]
  [bt] (36) python(+0x1dc3c0) [0x56467db0b3c0]

Expecting: ",", got: "\0", around character position: 2147479553
    6.56988\0\0\0\0\0\0\0\0\0
    ~~~~~~~^~~~~~~~~

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2020

I found the root cause:

xgboost/src/common/io.cc

Lines 111 to 128 in e4f5b6c

#if defined(__unix__)
struct stat fs;
if (stat(fname.c_str(), &fs) != 0) {
OpenErr();
}
size_t f_size_bytes = fs.st_size;
buffer.resize(f_size_bytes + 1);
int32_t fd = open(fname.c_str(), O_RDONLY);
#if defined(__linux__)
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
#endif // defined(__linux__)
ssize_t bytes_read = read(fd, &buffer[0], f_size_bytes);
if (bytes_read < 0) {
close(fd);
ReadErr();
}
close(fd);

We use POSIX functions to load the JSON file into memory, and there is a bug that introduces NUL letters (value 0) into the string. I'm trying to diagnose the bug.

The different behavior between Linux and MacOS is explained: on MacOS, std::fread() is used instead; it works as expected and loads the JSON file correctly, without NUL letters.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2020

@nathanhack I submitted a pull request to fix the bug: #5831. It will be part of the next upcoming release (1.2.0). Thanks!

@nathanhack
Copy link
Author

Thanks for the thorough investigation of my issue, and for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants