Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to "refine" (categorical-) valid value spaces along the transformer pipeline #411

Closed
mbicanic opened this issue Feb 12, 2024 · 17 comments

Comments

@mbicanic
Copy link

mbicanic commented Feb 12, 2024

Hello!

I am facing problems with training and exporting a LightGBM model due to the categorical features.
The dataset is unfortunately proprietary, so I cannot show how it looks like, but the most important dataset properties are:

  • there are 15 categorical features
  • all possible values of all 15 features are known in advance
  • not all values appear in the training set

Problem Description

After reading the discussion in #300, it is my understanding that the sklearn2pmml.decoration.CategoricalDomain Python class exposes a data_values parameter in the constructor, through which we can specify the full value space of categorical features.

However, when I specify the full value space via the data_values parameter, I get an error during the export to PMML:

Exception in thread "main" java.lang.IllegalArgumentException: 
Expected [VALUES I PROVIDED VIA THE data_values PARAMETER] as valid values, got [VALUES PRESENT IN THE TRAINING SET]

The stack trace of the error is provided as a screenshot below:
image
Note: the export works normally when I omit the data_values parameter, and everything runs smooth in the inference phase on the Java side since I am specifying how to handle invalid and missing values. However, in my situation it is not acceptable to interpret valid but unseen values as invalid values. In other words, I use the invalid_value_replacement and invalid_value_treatment parameters only as an ad-hoc fix to avoid things breaking down in the inference phase. I know all possible categorical values in advance, and thus don't expect any invalid values.

Environment (debug=True)

I ran the training pipeline with debug=True, and this is the output:
image
Note: I am aware Python 3.8 is very old at this point, however I cannot upgrade to a higher version due to company policy.

My Code (Training)

The code used to wrap and train the model is the following (rewritten manually from the server - I apologize for possible typos and errors):

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from sklearn2pmml import sklearn2pmml

CAT_COL_NAMES = [...]  # list of all categorical feature names
CAT_COLS: list[tuple[str, list[str], str]] = [
  (category_name, all_category_values, default_category_value),
  ...
]  # category feature name, full category value space, and the default category value all explicitly defined in this list of tuples

df = pd.read_csv("/path/to/dataset.csv")
X = df.drop(['Y'], axis=1)
Y = df['Y']

numeric_cols = [col for col in X.columns if col not in CAT_COL_NAMES]
X[numeric_cols] = X[numeric_cols].fillna(0.0)

hparams = {...} # LightGBM hyperparameters dictionary
lgbm = LGBMClassifier(**hparams)

mapper = DataFrameMapper(
  [(col, [CategoricalDomain(invalid_value_treatment='as_value', invalid_value_replacement=default, missing_value_treatment='as_value', missing_value_replacement=default, data_values=vals), PMMLLabelEncoder()]) for col, vals, default in CAT_COLS] + 
  [(numeric_cols, ContinuousDomain(with_data=False, missing_value_replacement=0))]
)
cat_indices = [i for i in range(len(CAT_COL_NAMES))]

pipeline = PMMLPipeline([('mapper', mapper), ('classifier', lgbm)])
pipeline.fit(X, Y, classifier__categorical_feature=cat_indices, classifier__eval_metric='logloss')

X_sample = X.sample(n=100)
pipeline.verify(X_sample)

sklearn2pmml(pipeline, "model.pmml", with_repr=True, debug=True)

Understanding PMMLEncoder

I also tried looking at the source code of PMMLEncoder in jpmml-converter and noticed that the exception comes from the following lines of code:

if(existingValues != null && !existingValues.isEmpty()){
	if((existingValues).equals(values)){
		break values;
	}
	throw new IllegalArgumentException("Expected " + existingValues + " as valid values, got " + values);
}

At first glance, this doesn't make a lot of sense to me - it seems as if the code is assuming that the full value space (contained in existingValues) has to be the same as the train-set value space (contained in values). However, I am aware that I'm most likely doing something wrong, and/or that my interpretation of this code snippet is wrong.

Conclusion

Why is this error happening, and how can I fix it?

If I can provide any more relevant information, please ask for it and I'll update the post as soon as I can!

Note: Terminal outputs are provided as screenshots because I am developing on a remote server with copy-pasting disabled. I apologize for the inconvenience.

@vruusmann
Copy link
Member

TLDR: The crux of the matter is that your sub-pipeline for categorical features contains two (meta-)transformers that collect and present categorical valid value space (VVS) information - first CategoricalDomain and then PMMLLabelEncoder. Everything works OK if they devise identical VVSes. However, in your case the first transformer (CategoricalDomain) specifies a much "wider" VVS than the second transformer (PMMLLabelEncoder), and the converter decides to bail out at this point because it cannot resolve this conflict automatically.

Well, in principle, it should be OK to assume that it's safe to "narrow" VVS down the line. So, the org.jpmml.converter.PMMLEncoder could be using a lax List#containsAll(List) check instead of the strictest List#equals(List) check. Back in the day, there was a very good reason/explanation behind being so strict, but I can't recall it now (should be available as some GitHub issue).

Also, the VVS that is defined by CategoricalDomain.data_values attribute should be considered to be a "suggestion", not a "hard requirement"... Meaning that List#containsAll(List) would be allowed.

@vruusmann vruusmann changed the title Passing data_values to CategoricalDomain breaks export Ability to "refine" (categorical-) valid value spaces along the transformer pipeline Feb 12, 2024
@vruusmann
Copy link
Member

vruusmann commented Feb 12, 2024

This issue is raised in relation to LightGBM models (as opposed to built-in SkLearn models).

LightGBM is able to auto-detect and handle categorical features without any kind of external binarization or one-hot encoding.

Therefore, please don't use the PMMLLabelEncoder in that place! It was appropriate 5+ years ago when LightGBM didn't have proper categorical features support, but not today.

You should replace PMMLLabelEncoder with sklearn2pmml.preprocessing.CastTransformer that specifies pandas.CategoricalDtype data type:

mapper = DataFrameMapper(
  [([cat_col], [CategoricalDomain(...), CastTransformer("category")]) for cat_col in cat_cols]
)

This is the way today!

However, looking into JPMML-SkLearn library code, then I have a suspicion that the transformer sequence [CategoricalDomain(...), CastTransformer("category")] will trigger the same IllegalArgumentException in the org.jpmml.converter.PMMLEncoder, because we still have two transformers that are devising mis-matching VVSes.

Looking more into JPMML-SkLearn code, then it should be possible to suppress this "VVS sanity check" by inserting some kind of "dummy" transformation between the current two transformers - the idea is to temporarily obfuscate the nature of the feature. Perhaps a dummy ExpressionTransformer("X[0]") will do the trick...

A long-term solution would be to add proper pandasCategoricalDtype support to sklearn2pmml.decoration.DiscreteDomain and its subclasses (ie. CategoricalDomain and OrdinalDomain). The domain decorator is exactly the right place to make the data type information known to everybody. Right now it is postponed to the dedicated CastTransformer step.

@mbicanic
Copy link
Author

mbicanic commented Feb 12, 2024

EDIT: This comment was written before I saw your second comment!

@vruusmann
Thanks for the extremely quick response!

From what I can see, PMMLLabelEncoder doesn't provide a mechanism to specify the VVS of categorical features. If this is true, what would be your recommended course of action in this scenario?

The way I see it, I can do one of the following:

  1. Remove PMMLLabelEncoder from the PMMLPipeline, thus compromising on the quality of the LightGBM model (as per this article)
  2. Keep everything as it is, thus necessarily treating all unseen values as invalid, and replacing them with the invalid_value_replacement
  3. Keep everything as it is, but manually edit the PMML artefact after training to add the unseen values into the <DataDictionary> segment
  4. [insert your suggestion here] 😄

On a tangential note: in my existing PMML exports, I see no explicit trace of PMMLLabelEncoder's transformations of categorical levels into integer values. When looking into the <Node> blocks which split the dataset by a categorical feature, the features are still explicitly written as strings, not as integers.

With this in mind, I have three follow up questions:

  1. Is the string->integer conversion completely implicit, or am I missing an explicit trace of the PMMLLabelEncoder in the PMML artifact?
  2. What are the consequences of manually adding the unseen categories into the <DataDictionary>?
    • My hunch is that it has no effect on the model - the model has never seen these values, so they aren't present in any of the <Node> splits); but it does have an effect on the overall pipeline - it will neither crash nor replace the unseen value with the invalid_value_replacement in the inference phase
  3. If I add the unseen values into the <DataDictionary> block, is it necessary to preserve the order of existing features (and append unseen ones at the end), or can I insert them whereever?
    • I assume the order is important exactly because of the PMMLLabelEncoder - I would expect that it encodes the first value in the <DataDictionary> as 0, the second as 1 etc.
    • Then again, since no explicit trace of PMMLLabelEncoder is visible in the PMML artifact, I may be wrong with my assumption

@vruusmann
Copy link
Member

The JPMML-SkLearn library uses PMMLLabelEncoder.classes_ attribute to perform the int_index-to-string_value reverse mapping.

The PMML representation always deals with real-life data schema (ie. string category values) not any transformations thereof (ie. string value indexes after some encoding scheme).

If I add the unseen values into the block, is it necessary to preserve the order of existing features (and append unseen ones at the end), or can I insert them whereever?

Pay attention to the so-called "operational type" of the DataField element! If it's categorical, then the order of DataField/Value@property="valid" IS NOT important. However, if it's ordinal, then it IS important.

In your case - categorical features - it is not important, meaning that you can sort this list in any way you wish, and insert new list elements in any place.

@mbicanic
Copy link
Author

mbicanic commented Feb 12, 2024

@vruusmann
Thanks - that's actually exactly what I'd expect from the semantics of ordinal vs categorical!

I will try adding a CastTransformer into the pipeline instead of the PMMLLabelEncoder and update you with the results.
To be clear, the whole idea is that the CastTransformer will do the casting for me, and so I don't have to do df[cat_col] = df[cat_col].astype('category'), right?

@vruusmann
Copy link
Member

While you're still using PMMLLabelEncoder transformer, then you may try updating its classes_ attribute manually with the "omitted category values" to make it identical to the CategoricalDomain.data_values_ attribute.

Workflow:

  1. Compose the pipeline. Extract [CategoricalDomain(), PMMLLabelEncoder()] pairs into some external list so you can conveniently "synchronize" them afterwards.
  2. Fit the pipeline.
  3. Update PMMLLabelEncoder.classes_ attribute values. You do this after Scikit-Learn fitting! We will manually fix some "broken" relationships to make the SkLearn2PMML converter happy...
  4. Convert to PMML.

This hackish workflow should solve the above IllegalArgumentException, because the JPMML-SkLearn library will see identical VVSes on both CategoricalDomain and PMMLLabelEncoder.

@mbicanic
Copy link
Author

I tried the hackish workflow before trying the CastTransformer method, and wow - it worked! Thank you so much for the support 😄! I synchronized the CategoricalDomain and the PMMLLabelEncoder simply by:

pmml_label_encoder.classes_ = categorical_domain.data_values_.copy()

I will try incorporating the model into a Java application in the next few days and update you with the results - hopefully there won't be any issues.

Meanwhile, I will also try the CastTransformer approach, just to be sure 😄

@vruusmann
Copy link
Member

pmml_label_encoder.classes_ = categorical_domain.data_values_.copy()

That's probably a bit too naive...

Fundamentally, you want to keep the original sub-list of PMMLLabelEncoder.classes_ unchanged, and then append any extra values to it. The JPMML-SkLearn library uses the ordering of these list elements for performing the integer-to-string value translation, so if you mess up this mapping, you'l get wrong string values.

Perhaps it's better to update CategoricalDomain.data_values_ instead - copy over PMMLLabelEncoder.classes_ values, and then append your own.

In your workflow, the CategoricalDomain.data_values_ attribute is not involved in any integer-to-string reverse translations, but the PMMLLabelEncoder.classes_ attribute is. If you need to mess with something, then mess with the less critical one.

@mbicanic
Copy link
Author

I understand the concern you raised, and I was afraid that this approach would be too naive, but wanted to try it out nevertheless.

Regarding your suggestion - how could I meaningfully edit CategoricalDomain.data_values_? That list is correct to begin with (it contains the full VVS), whereas the PMMLLabelEncoder.classes_ list is "incorrect" (doesn't contain the full VVS) and thus needs to be edited so that the two are in sync, and so that the full VVS is supported.
Unless you meant that I could copy over the elements present in PMMLLabelEncoder.classes_ into CategoricalDomain.data_values_ before all the other unseen classes, so that the ordering is equal between them.

I will try a less naive approach, though:

  • firstly, I'll append missing values into the PMMLLabelEncoder.classes_ attribute
  • then I'll copy PMMLLabelEncoder.classes_ into CategoricalDomain.data_values_ to ensure the ordering is the same

Will update you with the results as soon as I'm done :)

@vruusmann
Copy link
Member

I will try a less naive approach, though:

That should be the correct way of doing things.

A value list has two logical parts - first the sub-list of values that are actually seen by Scikit-Learn when fitting the model. This sub-list is equal to PMMLLabelEncoder.classes_. And then you have this second sub-list, which contains extra values that you want to "enable".

Use whatever approach you wish, but you must not mess up the first sub-list!

I'll think I'll go forward and implement CategoricalDomain(dtype = "category") in the very near future, which should make all these workarounds redundant.

@mbicanic
Copy link
Author

Update

I have tried the CastTransformer approach, but with it I can't even fit the pipeline, let alone export it. There is probably an error within my code again, but since the other approaches do work, I won't waste any more time debugging it. For future reference, this is the error I'm getting:

Traceback (most recent call last):
  File "/path/to/venv/site-packages/sklearn_pandas/pipeline.py", line 24, in _call_fit
    return fit_method(X, y, **kwargs)
  File "/path/to/venv/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/path/to/venv/site-packages/sklearn/base.py", line 919, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/path/to/venv/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/path/to/venv/site-packages/sklearn2pmml/preprocessing/__init__.py", line 110, in transform
    return cast(X, self.dtype_)
  File "/path/to/venv/site-packages/sklearn2pmml/util/__init__.py", line 26, in cast
    return X.astype(dtype)
TypeError: Canot interpret 'CategoricalDtype(categories=[LIST OF CATEGORY VALUES IN TRAIN SET], ordered=False)' as a data type

The following two hacky-patch-after-training approaches seem to work (training, verification, exporting), but I haven't had the time to properly test their integration into Java:

  1. Naively overwriting PMMLLabelEncoder.classes_ with CategoricalDomain.data_values_:
pmml_label_encoder.classes_ = categorical_domain.data_values_
  1. Adding missing values to PMMLLabelEncoder.classes_ while preserving the order and position of existing values:
# see code snippet in comments above for a more detailed breakdown
dfmapper_cat_cols = [(col_name, [CategoricalDomain(...), PMMLLabelEncoder()]) for col_name in CAT_COLS]

# ...create DataFrameMapper...
# ...create PMMLPipeline as (mapper, classifier) sequence...
# ...fit PMMLPipeline...

# Hacky-Patch-After-Training:
for col_name, transformers in dfmapper_cat_cols:
  categorical_domain, label_encoder = transformers
  existing_classes = label_encoder.classes_.tolist()
  new_classes = existing_classes.copy()
  for value in categorical_domain.data_values_:
    if value not in existing_classes:
      new_classes.append(value)
  label_encoder.classes_ = np.array(new_classes)
  categorical_domain.data_values_ = np.array(new_classes)
  
# ...export PMMLPipeline...

Once again, @vruusmann thank you very much for your support! 😄

@vruusmann
Copy link
Member

I've got this issue effectively solved in my local codebase. Will probably push to GitHub later today or tomorrow.

Much to my surprise, the CategoricalDomain already supports dtype = "category". The existing implementation works fine with default data_values (ie. automatically collected from the training data), but not with user-specified one (ie. custom list with unused/unseen category levels). So that was all that needed fixing.

TLDR: The recommended workflow for training LightGBM models in early 2024:

mapper = DataFrameMapper(
	[([cat_col], CategoricalDomain(dtype = "category")) for cat_col in cat_cols] +
	[([cont_col], ContinuousDomain()) for cont_col in cont_cols]
, input_df = True, df_out = True)

classifier = LGBMClassifier(n_estimators = 11, max_depth = 3)

pipeline = PMMLPipeline([
	("mapper", mapper),
	("classifier", classifier)
])
pipeline.fit(X, y)

sklearn2pmml(pipeline, "Pipeline.pmml")

There is no need for PMMLLabelEncoder, or other similar stuff.

@vruusmann
Copy link
Member

TypeError: Canot interpret 'CategoricalDtype(categories=[LIST OF CATEGORY VALUES IN TRAIN SET], ordered=False)' as a data type

You cannot call X.astype(dtype) on Numpy arrays with the pandas.CategoricalDtype argument (ie. Numpy does not support Pandas data types).

Refactor your pipeline, so that the CastTransformer step sees a Pandas' data container in its input.

See my above Python code example - I'm specifying DataFrameMapper.input_df = True and DataFrameMapper.df_out = True exactly for this purpose.

@vruusmann
Copy link
Member

I've just released SkLearn2PMML 0.103.1 to PyPI, which supports the above "recommended workflow" (see #411 (comment)).

As for using custom valid value spaces, then set the data_values parameter to the required list of values, and then set the dtype parameter to "category" (a string literal, not an actual pandas.CategoricalDtype object!).

Please note that since DiscreteDomain subclasses support multi-column mode for some time now, then the data_values must be formatted as a list of lists ([[...]]).

For example:

from sklearn2pmml.decoration import CategoricalDomain

cat_domain = CategoricalDomain(data_values = [[...]], dtype = "category")

@mbicanic YOU OWE ME BIG TIME NOW!

@mbicanic
Copy link
Author

You cannot call X.astype(dtype) on Numpy arrays with the pandas.CategoricalDtype argument (ie. Numpy does not support Pandas data types).

This makes sense, thanks for the help! However, with the hacky solution, and now the official fix, I won't be trying it out. Besides, as you said, it's likely the pipeline will bail out again due to a mismatch between the CategoricalDomain VVS and the CastTransformer VVS.

I've just released SkLearn2PMML 0.103.1 to PyPI, which supports the above "recommended workflow" (see #411 (comment)).

That's a very fast reaction time, kudos!

Please note that since DiscreteDomain subclasses support multi-column mode for some time now, then the data_values must be formatted as a list of lists ([[...]]).

Is this true even if I'm defining a CategoricalDomain for a single feature, or is the following snippet okay:

# notice that the first tuple element is `cat_col`, not `[cat_col]`
mapper = DataFrameMapper(
  [(cat_col, CategoricalDomain(dtype = "category", data_values=[val1, val2, ...])) for cat_col in cat_cols]
)

YOU OWE ME BIG TIME NOW!

Indeed I do - much appreciated! 👏 👏

@vruusmann
Copy link
Member

Is this true even if I'm defining a CategoricalDomain for a single feature, or is the following snippet okay:

I added a small clarification & code example into v0.103.1 release notes:
https://github.com/jpmml/sklearn2pmml/blob/master/NEWS.md#01031

You may need to adjust the "dimensionality" of DiscreteDomain.data_values to comply with the requirement "a list-like of list-likes". Right now your code looks like a "list-like of scalars". So, if you continue like that, expect some kind of Python array indexing or hashing error.

@vruusmann
Copy link
Member

vruusmann commented Feb 26, 2024

(Removed off-topic comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants