-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to "refine" (categorical-) valid value spaces along the transformer pipeline #411
Comments
TLDR: The crux of the matter is that your sub-pipeline for categorical features contains two (meta-)transformers that collect and present categorical valid value space (VVS) information - first Well, in principle, it should be OK to assume that it's safe to "narrow" VVS down the line. So, the Also, the VVS that is defined by |
data_values
to CategoricalDomain breaks export
This issue is raised in relation to LightGBM models (as opposed to built-in SkLearn models). LightGBM is able to auto-detect and handle categorical features without any kind of external binarization or one-hot encoding. Therefore, please don't use the You should replace mapper = DataFrameMapper(
[([cat_col], [CategoricalDomain(...), CastTransformer("category")]) for cat_col in cat_cols]
) This is the way today! However, looking into JPMML-SkLearn library code, then I have a suspicion that the transformer sequence Looking more into JPMML-SkLearn code, then it should be possible to suppress this "VVS sanity check" by inserting some kind of "dummy" transformation between the current two transformers - the idea is to temporarily obfuscate the nature of the feature. Perhaps a dummy A long-term solution would be to add proper |
EDIT: This comment was written before I saw your second comment! @vruusmann From what I can see, The way I see it, I can do one of the following:
On a tangential note: in my existing PMML exports, I see no explicit trace of With this in mind, I have three follow up questions:
|
The JPMML-SkLearn library uses The PMML representation always deals with real-life data schema (ie. string category values) not any transformations thereof (ie. string value indexes after some encoding scheme).
Pay attention to the so-called "operational type" of the In your case - categorical features - it is not important, meaning that you can sort this list in any way you wish, and insert new list elements in any place. |
@vruusmann I will try adding a |
While you're still using Workflow:
This hackish workflow should solve the above IllegalArgumentException, because the JPMML-SkLearn library will see identical VVSes on both |
I tried the hackish workflow before trying the pmml_label_encoder.classes_ = categorical_domain.data_values_.copy() I will try incorporating the model into a Java application in the next few days and update you with the results - hopefully there won't be any issues. Meanwhile, I will also try the |
That's probably a bit too naive... Fundamentally, you want to keep the original sub-list of Perhaps it's better to update In your workflow, the |
I understand the concern you raised, and I was afraid that this approach would be too naive, but wanted to try it out nevertheless. Regarding your suggestion - how could I meaningfully edit I will try a less naive approach, though:
Will update you with the results as soon as I'm done :) |
That should be the correct way of doing things. A value list has two logical parts - first the sub-list of values that are actually seen by Scikit-Learn when fitting the model. This sub-list is equal to Use whatever approach you wish, but you must not mess up the first sub-list! I'll think I'll go forward and implement |
UpdateI have tried the Traceback (most recent call last):
File "/path/to/venv/site-packages/sklearn_pandas/pipeline.py", line 24, in _call_fit
return fit_method(X, y, **kwargs)
File "/path/to/venv/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/path/to/venv/site-packages/sklearn/base.py", line 919, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "/path/to/venv/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/path/to/venv/site-packages/sklearn2pmml/preprocessing/__init__.py", line 110, in transform
return cast(X, self.dtype_)
File "/path/to/venv/site-packages/sklearn2pmml/util/__init__.py", line 26, in cast
return X.astype(dtype)
TypeError: Canot interpret 'CategoricalDtype(categories=[LIST OF CATEGORY VALUES IN TRAIN SET], ordered=False)' as a data type The following two hacky-patch-after-training approaches seem to work (training, verification, exporting), but I haven't had the time to properly test their integration into Java:
pmml_label_encoder.classes_ = categorical_domain.data_values_
# see code snippet in comments above for a more detailed breakdown
dfmapper_cat_cols = [(col_name, [CategoricalDomain(...), PMMLLabelEncoder()]) for col_name in CAT_COLS]
# ...create DataFrameMapper...
# ...create PMMLPipeline as (mapper, classifier) sequence...
# ...fit PMMLPipeline...
# Hacky-Patch-After-Training:
for col_name, transformers in dfmapper_cat_cols:
categorical_domain, label_encoder = transformers
existing_classes = label_encoder.classes_.tolist()
new_classes = existing_classes.copy()
for value in categorical_domain.data_values_:
if value not in existing_classes:
new_classes.append(value)
label_encoder.classes_ = np.array(new_classes)
categorical_domain.data_values_ = np.array(new_classes)
# ...export PMMLPipeline... Once again, @vruusmann thank you very much for your support! 😄 |
I've got this issue effectively solved in my local codebase. Will probably push to GitHub later today or tomorrow. Much to my surprise, the TLDR: The recommended workflow for training LightGBM models in early 2024: mapper = DataFrameMapper(
[([cat_col], CategoricalDomain(dtype = "category")) for cat_col in cat_cols] +
[([cont_col], ContinuousDomain()) for cont_col in cont_cols]
, input_df = True, df_out = True)
classifier = LGBMClassifier(n_estimators = 11, max_depth = 3)
pipeline = PMMLPipeline([
("mapper", mapper),
("classifier", classifier)
])
pipeline.fit(X, y)
sklearn2pmml(pipeline, "Pipeline.pmml") There is no need for |
You cannot call Refactor your pipeline, so that the See my above Python code example - I'm specifying |
I've just released SkLearn2PMML 0.103.1 to PyPI, which supports the above "recommended workflow" (see #411 (comment)). As for using custom valid value spaces, then set the Please note that since For example: from sklearn2pmml.decoration import CategoricalDomain
cat_domain = CategoricalDomain(data_values = [[...]], dtype = "category") @mbicanic YOU OWE ME BIG TIME NOW! |
This makes sense, thanks for the help! However, with the hacky solution, and now the official fix, I won't be trying it out. Besides, as you said, it's likely the pipeline will bail out again due to a mismatch between the
That's a very fast reaction time, kudos!
Is this true even if I'm defining a # notice that the first tuple element is `cat_col`, not `[cat_col]`
mapper = DataFrameMapper(
[(cat_col, CategoricalDomain(dtype = "category", data_values=[val1, val2, ...])) for cat_col in cat_cols]
)
Indeed I do - much appreciated! 👏 👏 |
I added a small clarification & code example into v0.103.1 release notes: You may need to adjust the "dimensionality" of |
(Removed off-topic comment) |
Hello!
I am facing problems with training and exporting a LightGBM model due to the categorical features.
The dataset is unfortunately proprietary, so I cannot show how it looks like, but the most important dataset properties are:
Problem Description
After reading the discussion in #300, it is my understanding that the
sklearn2pmml.decoration.CategoricalDomain
Python class exposes adata_values
parameter in the constructor, through which we can specify the full value space of categorical features.However, when I specify the full value space via the
data_values
parameter, I get an error during the export to PMML:The stack trace of the error is provided as a screenshot below:
Note: the export works normally when I omit the
data_values
parameter, and everything runs smooth in the inference phase on the Java side since I am specifying how to handle invalid and missing values. However, in my situation it is not acceptable to interpret valid but unseen values as invalid values. In other words, I use theinvalid_value_replacement
andinvalid_value_treatment
parameters only as an ad-hoc fix to avoid things breaking down in the inference phase. I know all possible categorical values in advance, and thus don't expect any invalid values.Environment (debug=True)
I ran the training pipeline with
debug=True
, and this is the output:Note: I am aware Python 3.8 is very old at this point, however I cannot upgrade to a higher version due to company policy.
My Code (Training)
The code used to wrap and train the model is the following (rewritten manually from the server - I apologize for possible typos and errors):
Understanding PMMLEncoder
I also tried looking at the source code of
PMMLEncoder
injpmml-converter
and noticed that the exception comes from the following lines of code:At first glance, this doesn't make a lot of sense to me - it seems as if the code is assuming that the full value space (contained in
existingValues
) has to be the same as the train-set value space (contained invalues
). However, I am aware that I'm most likely doing something wrong, and/or that my interpretation of this code snippet is wrong.Conclusion
Why is this error happening, and how can I fix it?
If I can provide any more relevant information, please ask for it and I'll update the post as soon as I can!
Note: Terminal outputs are provided as screenshots because I am developing on a remote server with copy-pasting disabled. I apologize for the inconvenience.
The text was updated successfully, but these errors were encountered: