-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Callback interface for logging internal information, to aid debugging #4837
Comments
Related: Tensorflow lets users to "poke" into internal quantities and visualize them in Tensorboard. https://www.tensorflow.org/guide/summaries_and_tensorboard |
we might just starts with something simple like store a structured csv file along side with model. It give better idea to compare apple with apple We can leave visualization empty and see how people actually want to interact with those information. from debugging side, it can be interesting to see how folks set breaks on certain conditions (number of iteration, depth of tree etc). We might needs to put more thoughts on where is best place for python user to access those internal info or do we allow user manually overwrite those values. |
@chenqin Yes, visualization can be left as a future work. The main point of the proposal is to create a callback interface where the user-defined function is called to log internal information. (We can create CSV serialization logic on top of the callback interface.) For now, this proposal does not consider an interactive debugger. The internal info will be strictly read-only. So the user will run a full training job and then analyze the log file after the fact. |
@hcho3 I can create a vomit option to verbosity if you want. Then it can log stuffs into file or to stdout and we redirect them to file. I'm not sure how Python callback can extract arbitrary information from c++. |
Python callback will invoke a C API function
This might be easier to manage than callbacks. Let me think this through. |
Recently, I've seen many users report regressions in model performance (e.g. accuracy, AUC) when they upgraded their XGBoost to latest version. For example, see https://discuss.xgboost.ai/t/learning-with-xgboost-0-90-vs-1-0-0/1068/5.
I'd like to work with @chenqin to set up regression tests. As part of that, we need more than just traces of evaluation set metrics (accuracy, AUC) etc, since the metrics only capture the aggregate information. Many useful information such as gradient values, split scores, and quantile sketches remain hidden deep within the C++ codebase.
I propose to create a callback interface so as to expose useful information for verbose logging. Once we can log these pieces of information, we can use them to detect potential regressions and troubleshoot anomalies. Logging the traces of non-aggregate quantities will allow us to have closer look at the internal workings of XGBoost. We can even build a "debugger" for users to find out why their XGBoost models are performing terribly.
For each of the quantities listed above, XGBoost will call the callback function with the quantity as an argument. Here are the possible callback interfaces (there’s more than one since quantities are of different types):
void callback(const char* keyword, const float* array, const uint64_t* shape, int dim)
void callback(const char* keyword, const double* array, const uint64_t* shape, int dim)
void callback(const char* keyword, const GradientPair* array, const uint64_t* shape, int dim)
Here is what I envision to be the user experience:
Each quantity will have a unique keyword associated, e.g.
"gradient"
,"hessian"
,"histogram"
,"quantile_sketch"
, etc.Requirements
List of quantities we should log
yhat_i
: Predicted labels (model outputs) for data points (Per data point, Per boosting iteration)x_i
- Inputl(y_i, yhat_i)
: Loss value (“residual”) computed from true and predicted labels (Per data point, Per boosting iteration)The text was updated successfully, but these errors were encountered: