Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException when using OneHotEncoder with Integer Categorical Variables in sklearn2pmml #412

Closed
SimonRbk95 opened this issue Feb 16, 2024 · 10 comments

Comments

@SimonRbk95
Copy link

SimonRbk95 commented Feb 16, 2024

Hi,

all I am trying to do is use the OneHotEncoder on all my categorical variables. However, I noticed that if the categorical feature has integer values, I get the error as shown below. To fix it, I already tried to use the CategoricalDomain(dtype=str) decoration to overwrite the datatype, without success.

one_hot = OneHotEncoder(
    sparse=True,
    handle_unknown='infrequent_if_exist', 
    max_categories=10
)

mapper = DataFrameMapper(
    [
        (['cat_col'], [CategoricalDomain(dtype=str), one_hot]),
        (['num_col1', 'num_col2'], None),
    ], 
    df_out=True, 
    drop_cols=cols_to_drop
)

pipeline = PMMLPipeline([   
    ('cat_one_hot', mapper),
    ('classifier', XGBClassifier(**params))
])

pipeline.fit(X_train, y_train)
pipeline.verify(X_train.sample(n=15))

sklearn2pmml(pipeline, 'pipeline.pmml', with_repr=True)
Standard output is empty
Standard error:
Exception in thread "main" java.lang.NullPointerException
	at sklearn.preprocessing.EncoderUtil$3.apply(EncoderUtil.java:171)
	at sklearn.preprocessing.EncoderUtil$3.apply(EncoderUtil.java:167)
	at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:618)
	at sklearn.preprocessing.MultiOneHotEncoder.encodeFeatures(MultiOneHotEncoder.java:70)
	at sklearn.Transformer.encode(Transformer.java:76)
	at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:45)
	at sklearn.Initializer.encode(Initializer.java:59)
	at sklearn.Composite.encodeFeatures(Composite.java:111)
	at sklearn.Composite.initFeatures(Composite.java:254)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:104)
	at com.sklearn2pmml.Main.run(Main.java:80)
	at com.sklearn2pmml.Main.main(Main.java:65)

As a workaround, I tried to map the categories with a LookupTransformer(), which results in the same error.

  mapping = {
      0 : '0',
      2 : '2',
      7 : '7',
}

 ('cat_col', [(LookupTransformer(mapping, default_value = "0")), one_hot])

On a different note, when looking at an PMML file that I exported encoding some of the categoricals for which the code above worked, I am doubting that the max_categories and handle_unkown arguments actually carry over to PMML. However, I don't have any experience with PMML, so maybe you have any insight here as well?

Any help is much appreciated. Thank you!

@vruusmann
Copy link
Member

This particular NPE relates to meddling with the OneHotEncoder.handle_unknown attribute. It looks like as if OneHotEncoder is missing some piece of Python state that it is supposed to have (according to SkLearn documentation).

Anyway, when developing or debugging, try to get the simplest workflow running first, and only thereafter start adding complexity. In your case, does the workflow succeed if you don't mess with the OneHotEncoder.handle_unknown attribute?

I'll try to reproduce this issue with some toy dataset in my computer. Looks to be dataset-dependent.

@vruusmann
Copy link
Member

My second comment is more fundamental - why are you performing one-hot encoding when feeding categorical features into XGBoost algorithm at all?

It is year 2024, and XGBoost (all versions 1.7.X and up) support categorical features natively. All you have to do is cast the appropriate Pandas' data frame column to category data type, and you're all set.

For example, see this categorical features-enabled workflow:
#411 (comment)

Simply replace LGBMClassifier with XGBClassifier, and you'll be all set for success!

Please try this workflow first. When you have it, please LMK here, and I'll advise how to "integrate" the extra "infrequent value handling" capabilities into it (that you're currently using the OneHotEncoder step for).

@vruusmann
Copy link
Member

when looking at an PMML file that I exported encoding
some of the categoricals for which the code above worked,
I am doubting that the max_categories and handle_unkown
arguments actually carry over to PMML.

The short answer here is that the PMML document is absolutely correct. You can verify it independently by making predictions on the validation dataset, and asserting that SkLearn and (J)PMML predictions are exactly the same (.. even for infrequent category values, and whatever else edge and corner case values).

The long answer is that these parameters are only relevant during the model training phase in Python. They do not carry over to the model application (aka prediction) phase.

For example, if you have limited the max number of category levels, then you'll see that if you count the number of category levels (for a particular categorical feature) in a PMML document, then there are exactly n values in use, which satisfy the requirement n <= max_categories.

The SkLearn OneHotEncoder.handle_unknown attribute translates to a PMML MiningSchema/MiningField@invalidValueTreatment attribute:
https://dmg.org/pmml/v4-4-1/MiningSchema.html

When you're interested in seeing the effect of some SkLearn attribute to PMML, then:

  1. Traing and export the model using attribute value A.
  2. Traing and export the model using attribute value B.
  3. Use a diffing tool to see a line-by-line difference between the two PMML documents: diff A.pmml B.pmml

@SimonRbk95
Copy link
Author

SimonRbk95 commented Feb 21, 2024

Hi @vruusmann,

thanks for your reply. I've had some time to test. I've updated the sklearn2pmml, scikit-learn, and xgboost libraries. Then, I used the iris dataset as per your readme and compared different configurations.

  • The library updates allowed me to use 'dtype = 'category'`, making the PMML file usable with the internal categorical handling of the xgboost, which is fine for my use case.

  • I tested different configurations and saw major differences between a simple iris test dataset and the productive data. For instance, max_categories does indeed not carry over, leaving all the categories as is, while for the iris data it does work as you described, displaying the right number of categories as per n <= max_categories.

That being said, fortunately, I can continue with my use case without the OneHotEncoder, so your suggestions were helpful.

@vruusmann
Copy link
Member

The library updates allowed me to use 'dtype = 'category'`

I'm about to publish an updated SkLearn2PMML package version later this week, which will improve a lot around Pandas' categorical data type support.

There is a functional difference whether yo specify dtype = "category" or dtype = pandas.CategoricalDtype(categories = [...]). The current SkLearn2PMML package version is somewhat lacking in this area.

For instance, max_categories does indeed not carry over, leaving all the categories as is, while for the iris data it does work as you described

Can you demonstrate this issue using a small code example? The Iris dataset uses all continuous features, you real-life dataset must be using some categorical features as well, no?

.. displaying the right number of categories as per n <= max_categories

What you're trying to accomplish can be summarized as "how to handle rare (aka infrequent) categories", no?

This can be handled perfectly using CategoricalDomain decorator step. The solution is formed around its data_values (defines the space of valid input values), plus invalid_value_treatment and invalid_value_replacement attributes (defines what to do in case of encountering a non-valid input value).

The rare value handling can be done very explicitly. Right now, I get the impression that you're kind of forced to bring in the OneHotEncoder transformer for this purpose, because Scikit-Learn does not have any proper tooling for this specific job.

I can continue with my use case without the OneHotEncoder

Using OneHotEncoder gives you "one category versus all other categories" type splits (in PMML represented using SimplePredicate elements), which unnecessarily increases the depth of decision trees.

Using XGBoost native encoding gives you "subset of categories versus all other categories" type splits (in PMML represented using SimpleSetPredicate elements). They are vastly more informative and performant.

For example, if you use OneHotEncoder and limit the max depth of individual decision tree models to 3 then, by definition, you can't build a good model if some categorical feature has more than 3 meaningful category levels to it.

@vruusmann
Copy link
Member

T:LDR: Show me your issues in Python source code form (based on some toy dataset), and I'll advise.

I'm working on very closely related matters at the moment, and I can make the next SkLearn2PMML package version much better in this area.

@SimonRbk95
Copy link
Author

SimonRbk95 commented Feb 21, 2024

Thanks for your response!

So, I used the iris dataset as shown below, where I added one random categorical feature to check whether max_categories carries over. It did translate to the PMML judged on the snippet from the PMML file below. Using the same OneHotEncoder configuration, including the CategoricalDomain decorator did not faciliate the same behavior with productive data. In fact, the total number of categories was represented in the equivalent field from below. However, I switched to xgboost's categorical handling because (as also indicated by you) the subset of categories type split yields better results.

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

df['Species'] = iris.target
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]
cols_X = iris_X.columns.tolist()

unique_values = ['value1', 'value2', 'value3', 'value4', 'value5']
iris_X['Test'] = np.random.choice(unique_values, size=len(df))

maps = [([col], [ContinuousDomain(dtype=iris_X[col].dtype)]) for col in cols_X] + \
       [(['Test'], [CategoricalDomain(dtype='str'), OneHotEncoder(max_categories=3, handle_unknown='infrequent_if_exist')])]

mapper = DataFrameMapper(maps, input_df=True, df_out=True)

pipeline = PMMLPipeline([
	("mapper", mapper),
	("classifier", XGBRegressor())
])

pipeline.fit(iris_X, iris_y)
pipeline.verify(iris_X.sample(n = 15))

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pipeline, "xgb.pmml", with_repr = True)
<DerivedField name="regroup(Test)" optype="categorical" dataType="string">
<Apply function="if">
	<Apply function="isIn">
		<FieldRef field="Test"/>
		<Constant dataType="string">value1</Constant>
		<Constant dataType="string">value2</Constant>
		<Constant dataType="string">value5</Constant>
	</Apply>
	<Constant dataType="string">infrequent</Constant>
	<FieldRef field="Test"/>
</Apply>
</DerivedField>

@SimonRbk95
Copy link
Author

SimonRbk95 commented Feb 21, 2024

As mentioned, I've moved away from the OneHotEncoder implementation, but I am having a different, yet somewhat related issue to what you just mentioned.

The code below shows my current implementation for the PMMLPipeline. I noticed that the pipeline's performance is worse than that of the XGBClassifier trained and fitted outside the pipeline. Upon investigation, I boiled it down to the CategoricalDomain decorater. I realised that if I cast the categorical data to categorical type via .astype('category') for all columns of the entire dataframe before I fit the pipeline, I get the same performance. Even if I keep the pipeline as shown below. This is regardles of how I handle missing values and their replacement, seemingly because I do not experience any more invalid value Errors after using .astype('category') as a preprocessing step prior to training the pipeline.

maps = [([col], ContinuousDomain(dtype=X_train[col].dtype, 
                                 invalid_value_treatment = "as_missing", 
                                 missing_value_replacement = X_train[col].median())) for col in cfg.NUMERICALS] + \
       [([col], CategoricalDomain(dtype='category', 
                                  invalid_value_treatment = "as_missing", 
                                  missing_value_replacement = str(X_train[col].mode()[0]))) for col in cfg.CATEGORICALS] 

mapper = DataFrameMapper(maps, input_df=True, df_out=True, drop_cols=cols_to_drop)

pipeline = PMMLPipeline([   
    ('preprocessor', mapper),
    ('classifier', XGBClassifier(**cfg.XGB_PARAMS))
])

pipeline.fit(X_train, y_train)

Please let me know if anything is unclear from my side. Thank you!

@vruusmann
Copy link
Member

The original NullPointerException happened, when the OneHotEncoder transformer had its unknown value handling policy set to "infrequent_if_exist", and the value of the OneHotEncoder.max_categories attribute was set to a number that was greater than the actual number of category levels.

For example, if a categorical feature had three category levels, but the OneHotEncoder.max_categories attribute was set to 5 (ie. anything greater than three), then the OneHotEncoder.infrequent_categories_ attribute was initialized but contained None values.

The quick user-side fix would have been to adjust the OneHotEncoder.max_categories attribute accordingly.

@vruusmann
Copy link
Member

@SimonRbk95 Also pay attention that OneHotEncoder.sparse_output = True configuration does not play well with missing-value aware modeling algorithms such as XGBoost and LightGBM.

If you re-run your experiments twice, first with sparse_output = True and the second time with sparse_output = False you should obtain two different models. If you can rationalize their difference, then you may keep using sparse_output = True. Otherwise, please consider switching to sparse_output = False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants