-
Notifications
You must be signed in to change notification settings - Fork 2
Is it necessary to stick to schema of current binary format? #3
Comments
With #4, the tree nodes will be more compact, so the issue is less severe.
Can we keep |
@hcho3 Sorry for the ambiguity. The correct word should be "replace" it with something simpler (anything without a virtual function pointer). |
But otherwise I concur with saving the complete snapshot of XGBoost. We do want to be careful to provide a way for the user to override the save configurations, so that, say, the user can load model trained with GPU and use it on a machine without GPUs. |
@trivialfis Ah I see. Let's come back to it later, to find a good way to simplify |
@trivialfis Back to the discussion: should every class be serialized? If not, which should be? Can we make a list? |
I edited the comment just before your first reply :-)
No problem, at least user need to change "gpu_hist" to "hist", other things can be configured by
I will start working on this once my proposal get approved by other participants. |
A little tidbit:
I think we still need |
@hcho3 Aha, totally forgot. I will review all parameters/objects when drafting the list. |
@trivialfis Can you look at #4 as well, when you get a chance? |
@hcho3 Sorry for the long delay. Should be able to make a PR in a few days. |
Closing as draft is in #5 . |
Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a
utf-8
version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?For example:
The
//
in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.Let's take
Learner
class in the draft as an example:Here the draft specifies we save
predictor_param
andcount_posson_max_delta_step
, which don't belong toLearner
itself. Instead I propose we save something like this forLearner
:Update: Actually
predictor
should be handled ingbm
, but lets keep it here for the sake of demonstration.For actual IO of
GPUPredictionParam
, we leave it togpu_predictor
. Same goes forGBM
.For how to do that, we can implement this in
Learner
class:Inside
Metric
, let's saymae
:Motivation
The reason I want to do it in this way are:
extra_attributes
. That's a remedy for not being able to add new fields. Now we are able to do so.Learner
doesn't get bloated.Configuration
. Inside XGBoost, the functions likeConfigure
,Init
are just another way of loading the class itself from parameters.The most important one is (2).
Possible objections
Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
Previously I use
split_evaluator
as an example in RFC: JSON as Next-Generation Model Serialization Format dmlc/xgboost#3980 (comment) . It's possible that we @RAMitchellremovereplace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.The text was updated successfully, but these errors were encountered: