Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: JSON as Next-Generation Model Serialization Format #3980

Closed
hcho3 opened this issue Dec 8, 2018 · 48 comments
Closed

RFC: JSON as Next-Generation Model Serialization Format #3980

hcho3 opened this issue Dec 8, 2018 · 48 comments

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Dec 8, 2018

RFC: JSON as Next-Generation Model Serialization Format

  1. Motivation: Why a new serialization format?
  2. Proposal: Use JSON serialization to ensure backward and forward compatibility
  3. Semantic versioning
  4. Other benefits of JSON
  5. Implementation of the proposal with DMLC JSON parser
  6. Addressing potential objections
  7. Appendix A: Full 1.0 Schema
  8. Appendix B: Representation of missing values for various data types

In this RFC document, I propose that XGBoost project would eventually migrate to JSON for serializing decision tree models.

Special thanks to @trivialfis for initially coming up with the idea of using JSON.

Motivation: Why a new serialization format?

The current method of serialization used in XGBoost is a binary dump of model parameters. For instance, the decision tree class (RegTree) is serialized as follows:

/* Adopted from include/xgboost/tree_model.h; simplified for presentation */
void Save(dmlc::Stream* fo) {
  // Save tree-wide parameters
  fo->Write(&param, sizeof(TreeParam));
  // Save tree nodes
  fo->Write(nodes_.data(), sizeof(Node) * nodes_.size());
  // Save per-node statistics
  fo->Write(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Save optional leaf vector
  if (param.size_leaf_vector != 0) {
    fo->Write(leaf_vector_);
  }
}
void Load(dmlc::Stream* fi) {
  // Load tree-wide parameters
  fi->Read(&param, sizeof(TreeParam));
  nodes_.resize(param.num_nodes);
  stats_.resize(param.num_nodes);
  // Load tree nodes
  fi->Read(nodes_.data(), sizeof(Node) * nodes_.size());
  // Load per-node statistics
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Load optional leaf vector
  if (param.size_leaf_vector != 0) {
    fi->Read(&leaf_vector_);
  }
}

The binary serialization method has a few benefits: it is straight-forward to implement, has low memory overhead (since we can read from and write to file streams directly), is fast to run (no pre-processing required), and produces compact model binaries.

Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility. Backward compatibility refers to the ability on the part of a new version of XGBoost to read model files produced by older versions of XGBoost. To see why we can't add new fields, let's refer to the decision tree class example shown above. The current NodeStat class has four fields:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 1 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
};

The total byte size of NodeStat is 16 bytes. Now imagine a fictitious scenario where we want to add a new field called instance_cnt. This field would store the number of instances (data points) that are associated with the node:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 2 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
  // NEW: number of instances associated with the node
  int64_t instance_cnt;
};

Note that the new version of NodeStat is now 24 bytes. Now we have already broken backward compatibility: when the latest version of XGBoost runs the snippet

// Load per-node statistics
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());

it will attempt to read 24 * M bytes, where M is the number of nodes in the decision tree. However, if the saved model was serialized in an old version of XGBoost, there would be only 16 * M bytes to read in the serialized file! The program will either crash or show some kind of undefined behavior.

What would be the work-around? We can add extra logic in the snippet above to check which version of XGBoost produced the serialized file:

if ( /* if the model was generated with old version */ ) {
  for (size_t i = 0; i < stats.size(); ++i) {
    NodeStatOld tmp;  // Version 1 of NodeStat
    fi->Read(&tmp, sizeof(NodeStatOld));
    stats_[i].loss_chg = tmp.loss_chg;
    stats_[i].sum_hess = tmp.sum_hess;
    stats_[i].base_weight = tmp.base_weight;
    stats_[i].leaf_child_cnt = tmp.leaf_child_cnt;
    // instance count is missing
    stats_[i].instance_cnt = -1;
  }
} else {  /* the model was generated with new version */
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
}

That's a lot of lines for reading a single C++ vector. In general, the extra logic for backward compatibility can accumulate over time and inflict future maintenance burden to contributors. To make matters worse, current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file. The proposed work-around is not only messy but also not actually feasible.

Proposal: Use JSON serialization to ensure backward and forward compatibility

JSON (JavaScript Object Notation) is a light-weight serialization format built on two fundamental structures: JSON objects (a set of key-value pairs) and JSON arrays (an ordered list of values). In this section, I will argue that it is possible to JSON as a future-proof method for ensuring backward and forward compatibility, defined as follows:

  • Backward compatibility: Later versions of XGBoost can read models produced by previous versions of XGBoost.
  • Forward compatibility: Previous versions of XGBoost can read models produced by later versions of XGBoost.

We will make use of two useful characteristics of JSON objects:

  • The presence of extra keys in a JSON object does not cause fatal errors.
  • If a JSON object are missing some keys we wanted, we can reliably detect this fact.

Let's return to the NodeStat example from the previous section. Version 1 of the NodeStat can be expressed as a JSON object as follows:

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2
}

(The values are made up as an example.) Version 2 would be something like

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2,
  "instance_cnt": 3230
}

Let's first check if backward compatibility holds. The latest version of XGBoost will attempt to read the keys loss_chg, sum_hess, base_weight, leaf_child_cnt, and instance_cnt from the given JSON file. If the JSON file was produced by an old version, it will only have loss_chg, sum_hess, base_weight, and leaf_child_cnt. The JSON parser is able to reliably detect that instance_cnt is missing. This is much better than having the program crash upon encountering missing bytes, as the binary parser would do. Upon detecting missing keys, we can mark the corresponding missing values. For the NodeStat example, we'd put in -1 to indicate the missing value, since -1 cannot be a valid value for the instance count. For optional fields such as instance_cnt, this response is sufficient.

On the other hand, some fields are not so optional. For instance, a brand new feature could be added that makes use of new key-value pairs. In this case, we will need to throw an error whenever the user attempts to make use of the particular feature. We are still doing better than before, however, since the JSON parser does not have to perform this kind of error handling. The JSON parser would simply mark missing key-value pairs as missing, and later parts of the codebase that use the missing pairs may choose to report errors. We are providing what is known as graceful degradation, where users can bring old model files and still use basic functionalities.

How about forward compatibility? The JSON file given by the later version of XGBoost will contain the extra key instance_cnt, which can be simply ignored by the older version. So forward compatibility holds as well.

Freed from future compatibility issues, developers will be able to focus on more interesting development work.

Semantic versioning

Every JSON files produced by XGBoost should have major and minor version fields at the root level:

{
  "major_version": 1,
  "minor_version": 0,
  
  ... other key-value pairs ...
}

All versions that have the same major version are compatible with each other. For example, a program using format version 1.1 can read a model with any version of form 1.x. The minor version should be increased whenever a new field is added to a JSON object.

The major version should be kept to 1, unless a major revision occurs that breaks backward and/or forward compatibility. So version 2.x will not be compatible with 1.x.

See Appendix A for the full schema of Version 1.0.

Other benefits of JSON

Compatibility is the largest motivator for me to propose JSON as the next-generation serialization format. However, there are other benefits too:

  • JSON parsers are available in virtually every language and platform, so we get a parser and serializer for free (in terms of programmer effort) for many languages. It will be easy to parse XGBoost models in Python, Java, JavaScript, and many others. This fact was pointed by @KOLANICH.
  • It is human-readable text, good for legibility.
  • DMLC parameter objects (dmlc::Parameter) is represented neatly as JSON objects, since parameter objects are just collections of key-value pairs. We are now free to add more fields to DMLC parameter objects.
  • Being text, it is endian-independent, provided we use UTF-8 for text encoding. I have received a report that a big-endian machine failed to load a binary serialized file that was copied from a little-endian machine.
  • JSON is relatively compact compared to other text-based formats (e.g. XML).

Implementation of the proposal with DMLC JSON parser

There are many excellent 3rd-party JSON libraries for C++, e.g. Tencent/RapidJSON. However, we will eschew 3rd-party libraries. As @tqchen pointed out, extra dependencies makes it more difficult to port XGBoost to various platforms and targets. In addition, most JSON libraries assume free-form schema-less objects, where JSON objects and arrays may nest each other arbitrarily (thanks @KOLANICH and @tqchen for pointing this out). This assumption adds unnecessary overhead, when our proposed JSON XGBoost format has a regular schema.

Fortunately, the DMLC-Core repository comes with a built-in JSON parser and serializer. What's more, the DMLC JSON parser (dmlc::JSONReader) lets us make assumptions about how JSON objects and arrays are to be nested. For instance, if we expect a JSON object that contains a single key-value pair with the key foo and the value an array of integers, we'd write:

dmlc::JSONReader reader(&input_stream);
std::string key, value;
reader.BeginObject();  // Assume: the top-level object is JSON object, not array
assert(reader.NextObjectItem(&key));  // Assume: there is one key
assert(key == "foo");  // Assume: the key is "foo"
reader.BeginArray();  // Assume: the value for "foo" is an array

std::vector<int> foo_values;
while (reader.NextArrayItem()) {
  int val;
  reader.Read(&val);  // Assume: each element in the array is an integer
  foo_values.push_back(val);
}
assert(!reader.NextObjectItem(&key));  // Assume: there is only one key

One tricky part is that the built-in helper class for reading structures will throw errors upon encountering missing or extra key-value pairs. I will write a custom helper class as part of the draft implementation of the RFC. The custom helper class will handle missing key-value pairs in a consistent way. Appendix B describes how the missing values are to be represented for each data type.

Addressing potential objections

Q. How about space efficiency? JSON will result in bigger model files than the current binary serialization method.

A. There are a few 3rd-party libraries (e.g. MessagePack) that lets us encode JSON into more compact binary form. We will consider adding a plug-in for generating binary encoding, so that 3rd-party dependencies remain optional. There will still be some overhead for storing integer and float storage, but the cost is in my opinion worthwhile; see previous sections for the compatibility benefits of JSON.

Q. How about read/write performance? All the text operations will slow things down.

A. I admit that it is hard to beat the performance of binary serialization; how can you do better than simply dumping bit-patterns from the memory? We are really making a trade-off here, trading read/write speed for future extensibility. However, we are able to soften the blow by serializing decision trees in parallel, see microsoft/LightGBM#1083 for an example.

Q. Binary formats are not necessary closed to future extension. With some care, it is possible to design a new binary format with future extensibility in mind.

A. It is certainly possible to design a binary format to enable future extensions. In fact, had the current version of XGBoost stored the version number in model artifacts, we could have added extra logic for handling multiple formats. However, I still think JSON is superior, for two reasons: 1) designing an extensible binary format takes great precision and care to get right; and 2) extra logic for managing multiple versions in a custom parser can be messy, as we saw in the first section.

Appedix A. Full 1.0 Schema

The full schema can be accessed at https://xgboost-json-schema.readthedocs.io/en/latest/. The source is hosted at https://github.com/hcho3/xgboost-json-schema.

Appendix B: Representation of missing values for various data types

  • String: ""
  • Floating-point: nan
  • Integer: std::numeric_limits<int_type>::max()
  • JSON Object/Array: null (null object)
@trivialfis
Copy link
Member

I have a few more problems need to address, mostly related to parameters. Introducing JSON is kind of part of the plan. It won't have any conflict with current proposal(nice job). Will add proper comments as soon as possible. :) And it's really nice to have a organized RFC, it's much better than my original attempt.

@KOLANICH
Copy link
Contributor

KOLANICH commented Dec 8, 2018

(e.g. MessagePack)

... and CBOR

@tqchen
Copy link
Member

tqchen commented Dec 8, 2018

@hcho3 can you please elaborate a bit more on how do we store the tree structure and give a minimum example?

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 8, 2018

@tqchen I’ll put up the full schema soon, and as part of that will describe how tree structures will be stored.

@trivialfis
Copy link
Member

trivialfis commented Dec 9, 2018

Other models

migrate to JSON for serializing decision tree models.

Linear models are also supported, actually much easier than tree models. And I have passion of adding new interesting algorithms.

Deprecating (changing) parameter

RFC addresses adding new parameters explicitly. But in reality we might deprecate parameters in the future due to deprecated feature of algorithms, duplicated parameters for different components (n_gpus and gpu_id comes to mind) or broken features disabled temporally. Luckily, handling removed parameters is the reversed of adding new parameters, which is addressed in this RFC.

Backward compatibility

Reading old model from newer version of XGBoost. Extra parameters removed in newer version but presented in old model file can be simply ignore. This corresponds to situation of the forward compatibility of adding new parameters.

Forward compatibility

Reading new model from older version of XGBoost. This corresponds to situation of backward compatibility in adding new parameters. Here we will have removed parameters, which are not presented in model file, as missing values. Since most of the parameters have default value, parameters not presented in model file (missing) can also be simply ignored.

Summarize

Forward Backward
Add Extra key can be simply ignored by older version. Use default value to handing missing values or do graceful degradation
Remove Use default value to handing missing values or do graceful degradation Extra key can be ignored by newer version

Saving complete snapshot of XGBoost

Most of the classes (in C++) in XGBoost have their own parameters. For examples, objectives, split-evaluator. Currently these parameters are not saved in a unified way and as a matter of fact, rarely saved. My hope is we add IO interface to these classes, and recurs over all of them when Saving/Loading XGBoost. For examples, learner class can call Save/Load methods from updater, then updater call Save/Load of split-evaluators and so on. Once finished and the call stack returns, we will have a complete representation of XGBoost's states' snapshot.

Problems

The this way we might actually need schema less representation, for example, MonotonicConstraint is one of split-evaluators, which contains array (std::vector) parameters. And we have a list of split-evaluators. It's not yet clear to me that how to achieve this with dmlc::JSON* without a lots of boilerplate code. Please know that every added helper class is a fragment of code that's hard to integrate.

Parameter validation

This is a future plan. If there's a typo or unused parameter in user specified model, XGBoost simply ignores it, which might lead to degraded performance or unexpected behavior. We @RAMitchell want to add extra checks for wrong/unused parameters. It's not yet clear to me how to achieve this, but the basic idea is to let each component responsible for checking/marking their own unique parameters and let a manager class (maybe learner) to check unused (not marked by other components) parameters after all components are done with their configuration. Maybe this logic can be cooperated into handling missing/extra parameters of model IO?

@hcho3 Feel free let me know if I can help in any way.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@trivialfis I like the idea of complete snapshot, having performed a messy hack to preserve GPU predictor parameters (#3856). That said, we want to do it in a way that minimizes the boilerplate code. I will re-read your comment when I complete the draft for 1.0 schema. Mainly, I'm documenting what's currently being saved.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@trivialfis Also, I used to be afraid of saving more things into the model, because of future maintenance liability caused by binary serialization. I may be more comfortable now with saving everything, now that JSON serialization could forestall compatibility headaches.

@KOLANICH
Copy link
Contributor

KOLANICH commented Dec 10, 2018

IMHO we should make XGBoost save itself the hyperparams used to create and evaluate the model (with possibility to disable that) and columns names and types keeping their order (with possibility to disable that and/or to discard the stored names and use integers (starting from 0) without model reloading). Column names are needed in order to reshape a pandas DataFrame to make the columns have the same order the model was trained on (currently I store this info in separate files). But I still insist on keeping data strictly schemed, no arbitrary stuff should be saved there, since it would result in bloated models.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@KOLANICH I'm not so sure about that, since column names and types are external to the core C++ codebase. So far, we've had Python / R / Java wrappers to save these information.

@trivialfis @RAMitchell What do you think?

@trivialfis
Copy link
Member

trivialfis commented Dec 10, 2018

@KOLANICH @hcho3

and columns names and types keeping their order

It's a viable suggestion that we manage all these in C++, this may lead to a cleaner data management. But that's a topic for another day since we have to first implement this management code then implement IO. Adding this feature later won't bring too many hassles since we will have a very good backward compatibility as explained by @hcho3 . A new issue might be desired after this get sorted out.

no arbitrary stuff should be saved there, since it would cause bloat.

You can make suggestion what specific items should not be saved into the model file after having a initial draft (I imagine there will be a long reviewing process, you are welcome to join), otherwise it's hard to argue what goes into the category "arbitrary stuff".

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@trivialfis @tqchen I've uploaded the full schema as a PDF attachment.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

I may want to put up the schema doc in a GitHub repo, so that we can update the LaTeX source upon every schema update and have it automatically compiled into PDF.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@trivialfis The PDF doc didn’t address your suggestion. Can you take a look and see how the complete snapshot idea can be implemented?

@trivialfis
Copy link
Member

@hcho3 If you put the doc in a GitHub repo I can make PRs. Then you can comment on my changes. :)

@tqchen
Copy link
Member

tqchen commented Dec 10, 2018

Let us just use markdown for documenting schema, it should be in docs eventually

@KOLANICH
Copy link
Contributor

otherwise it's hard to argue what goes into the category "arbitrary stuff".

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset. If we allowed devs to do such things effortlessly, they would do it and this would result in heavily bloated model files.

Let us just use markdown for documenting schema, it should be in docs eventually

Why not to use JSONSchema? It is a machine-readable spec, it can be automatically validated, and it can be rendered into documentation, including a nice interactive one.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2018

@KOLANICH

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset.

You can't really prevent the user to save arbitrary things in their JSON files. What we can do is to enumerate which fields are to be recognized by XGBoost. My schema document does this. Stuff that's not recognized by the schema will be ignored.

Why not to use JSONSchema?

I'm not sure how we can support StringKeyValuePairCollection class which allows for arbitrary key-value pairs.

For now, I'll be typing up the schema doc in RST.

@KOLANICH
Copy link
Contributor

You can't really prevent the user to save arbitrary things in their JSON files.

Yes, we cannot. But we can discourage users to store their arbitrary data by avoiding providing API for adding and editing them. So if he wants to do it, he has to parse the resulting blob and add them manually and serialize back, so it may be easier for him to store them in a separate file.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 11, 2018

@tqchen I've typeset the schema in RST: https://xgboost-json-schema.readthedocs.io/en/latest/

@trivialfis You can submit a pull request at https://github.com/hcho3/xgboost-json-schema

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 12, 2018

Update: See hcho3/xgboost-json-schema#3 for a discussion on serializing a more complete snapshot of XGBoost.

@trivialfis
Copy link
Member

Let me try to keep the threads together. Copied from hcho3/xgboost-json-schema#3 .

Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8 version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?

For example:

The // in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.

Let's take Learner class in the draft as an example:

{
  "learner_model_param" : LearnerModelParam,
  "predictor_param" : PredictorParam,
  "name_obj" : string,
  "name_gbm" : string,
  "gbm" : GBM,
  "attributes" : StringKeyValuePairCollection,
  "eval_metrics" : [ array of string ],
  "count_poisson_max_delta_step" : floating-point
}

Here the draft specifies we save predictor_param and count_posson_max_delta_step, which don't belong to Learner itself. Instead I propose we save something like this for Learner:

{
  // This belongs to learner, hence handled by learner
  "LearnerTrainParam": { LearnerTrainParam },   
  // No `LearnerModelParameter`, we don't need it since JSON can save complete model.

  "predictor" : "gpu_predictor",
  "gpu_predictor" : {
    "GPUPredictionParam": { ... }
  },

  "gbm" : "gbtree",
  "gbtree" : { GBM },

  // This can also be an array, I won't argue which one is better.
  "eval_metrics" : {
    "merror": {...},
    "mae": {...}
  }
}

Update: Actually predictor should be handled in gbm, but lets keep it here for the sake of demonstration.

For actual IO of GPUPredictionParam, we leave it to gpu_predictor. Same goes for GBM.
For how to do that, we can implement this in Learner class:

void Learner::Load(KVStore& kv_store) {
  std::string predictor_name = kv_store["predictor"].ToString();  // say "gpu_predictor" or "cpu_predictor"
  auto p_predictor = Predictor::Create(predictor_name);
  p_predictor->Load(kv_store[predictor_name]);

  // (.. other io ...)

  KVStore& eval_metrics = kv_store["eval_metrics"];
  std::vector<Metric> metrics (eval_metrics.ToMap().size());
  for (auto& m : metrics) {
    metrics.Load(eval_metrics);
  }
}

Inside Metric, let's say mae:

void Load(KVStore& kv_store) {
   KVStore self = kv_store["mae"];  // Look up itself by name.
  // load parameters from `self`
  // load other stuffs if needed.
}

Motivation

The reason I want to do it in this way are:

  1. No extra_attributes. That's a remedy for not being able to add new fields. Now we are able to do so.
  2. Modular. Each C++ class handles what it has, once there are need for special code handling changes of dumped model, Learner doesn't get bloated.
  3. Organized. This way it would be much easier for human to interpret the model since it follows a hierarchy that matches XGBoost itself, what you see from the JSON file, is what the internal of XGBoost looks like.
  4. Complete. Described in (3).
  5. Other models. Related to (2). We can save linear model easily, because it handles its own IO.
  6. Can be In-cooperated existing Configuration. Inside XGBoost, the functions like Configure, Init are just another way of loading the class itself from parameters.

The most important one is (2).

What should be saved

Pointed out by @hcho3 , we should draft a list for what goes into final dump file. I will start working on it once these are approved by participants.

Possible objections

  1. Size of the model file.
    Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
  2. Schema less
    Previously I use split_evaluator as an example in RFC: JSON as Next-Generation Model Serialization Format #3980 (comment) . It's possible that we @RAMitchell remove replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.

@trivialfis
Copy link
Member

@hcho3 @KOLANICH @tqchen @thvasilo @RAMitchell Can you take a look if time allows?

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 14, 2018

@trivialfis I cast my vote +1 for your proposal. I think it is possible to use dmlc::JSONReader directly, without the intermediate KVStore structure, as long as we use a schema in the form similar to my current proposal. I will upload a trial implementation, once I'm done getting ready for holidays.

@hcho3 hcho3 pinned this issue Dec 14, 2018
@trivialfis
Copy link
Member

@hcho3 Let me give a try on drafting the schema. :)

@hcho3
Copy link
Collaborator Author

hcho3 commented Apr 5, 2019

@LeZhengThu No, this thread is just a proposal. It will be a while until this feature implemented.

@LeZhengThu
Copy link

@hcho3 Thanks. Hope this feature can go live soon. I like the idea very much.

@trivialfis
Copy link
Member

@hcho3 The last time we talked about JSON, it was suggested that dmlc::Json* should be used instead of 3-party implementation or self designed Json implementation. And I remember that you have some thoughts about how to use dmlc::Json*?

@ras44
Copy link
Contributor

ras44 commented May 2, 2019

One issue that might arise with JSON serialization is the treatment of serialized float values. Presumably xgboost will be aware that all decimal values in the JSON should be cast as floats; however I believe most typical JSON parsers will treat the values as doubles, leading to discrepancies for downstream users who don't implement a post-parsing step to cast doubles to floats in the resulting object.

As @hcho3 shows, it is possible to do the round-trip using the maxdigits10, which guarantees that the float will be reproducible from the decimal text but all parsed decimal values must be cast to a float for the resulting object to represent the state of the model correctly. That is the step that is not guaranteed via a simple JSON-parse.

As @thvasilo points out, any downstream users like SHAP by @slundberg and Treelite will need to be aware of this treatment and implement a post-parsing step to cast any doubles to floats (and then treat them as floats in any further calculations, ie. fexp instead of exp, etc.).

As this is my first comment in this issue, I hope it is helpful!

ras44 added a commit to ras44/xgboost that referenced this issue May 2, 2019
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see dmlc#3980 (comment)
@hcho3
Copy link
Collaborator Author

hcho3 commented May 2, 2019

@ras44 The JSON schema clearly says that all floating-point values are 32-bit: https://xgboost-json-schema.readthedocs.io/en/latest/#notations

hcho3 pushed a commit that referenced this issue May 3, 2019
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)
@ras44
Copy link
Contributor

ras44 commented May 4, 2019

@hcho3 Thanks for your note. My main point is that the if the JSON dump is to be used by anything other than xgboost (which will presumably need to implement a custom JSON parser that encodes/decodes floats according the schema), the user will have to consider multiple aspects in order produce the same results as xgboost. It appears this is already happening in issues such as #3960, #4060, #4097, #4429. I am providing PR #4439 explaining some of the issues and considerations in case it could be of use for people who aim to work with a JSON dump.

@trivialfis
Copy link
Member

@hcho3 I will go ahead and create an initial draft implementation based on schema.

@trivialfis
Copy link
Member

I will try to use dmlc::JSON, but add abstractions whenever needed.

@hcho3
Copy link
Collaborator Author

hcho3 commented May 9, 2019

Thanks for the update. Sorry I haven't had a chance to tackle the implementation yet. Let me know if you need any help or feedback.

@RAMitchell
Copy link
Member

I want to revisit some of this discussion around using external dependencies as our JSON engine. After some discussion with @trivialfis and looking at #5046, it seems clear to me that a simple, partial json implementation is possible in dmlc-core or xgboost but not feature complete or with good performance. We are looking at thousands of lines of code for a custom implementation with good enough performance.

Achieving better performance than the naive implementation may turn out to be critical as our distributed implementations serialize during training.

I am normally strongly against dependencies but in this case using a header only json dependency may solve more problems than it creates.

@trivialfis
Copy link
Member

trivialfis commented Nov 19, 2019

@chenqin
Copy link
Contributor

chenqin commented Nov 19, 2019

I incline we use customized json writer/reader as long as well tested, introducing dependency out of our control could be liability with many unused features. Having Json could help move to simplify rabit recovery protocol. failed worker can recover from flexible sized key/value JSON payload from adjacent host.

@KOLANICH
Copy link
Contributor

using a header only json dependency

nlohmann/JSON is not especially performant.

@trivialfis
Copy link
Member

@KOLANICH Indeed, nice interface but neither memory usage nor computation is efficient.

@trivialfis
Copy link
Member

Closing as now experimental support with Schema is mainlined. Further improvement will come as separated PRs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests