-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NullPointerException when using OneHotEncoder with Integer Categorical Variables in sklearn2pmml #412
Comments
This particular NPE relates to meddling with the Anyway, when developing or debugging, try to get the simplest workflow running first, and only thereafter start adding complexity. In your case, does the workflow succeed if you don't mess with the I'll try to reproduce this issue with some toy dataset in my computer. Looks to be dataset-dependent. |
My second comment is more fundamental - why are you performing one-hot encoding when feeding categorical features into XGBoost algorithm at all? It is year 2024, and XGBoost (all versions 1.7.X and up) support categorical features natively. All you have to do is cast the appropriate Pandas' data frame column to For example, see this categorical features-enabled workflow: Simply replace Please try this workflow first. When you have it, please LMK here, and I'll advise how to "integrate" the extra "infrequent value handling" capabilities into it (that you're currently using the |
The short answer here is that the PMML document is absolutely correct. You can verify it independently by making predictions on the validation dataset, and asserting that SkLearn and (J)PMML predictions are exactly the same (.. even for infrequent category values, and whatever else edge and corner case values). The long answer is that these parameters are only relevant during the model training phase in Python. They do not carry over to the model application (aka prediction) phase. For example, if you have limited the max number of category levels, then you'll see that if you count the number of category levels (for a particular categorical feature) in a PMML document, then there are exactly The SkLearn When you're interested in seeing the effect of some SkLearn attribute to PMML, then:
|
Hi @vruusmann, thanks for your reply. I've had some time to test. I've updated the sklearn2pmml, scikit-learn, and xgboost libraries. Then, I used the iris dataset as per your readme and compared different configurations.
That being said, fortunately, I can continue with my use case without the OneHotEncoder, so your suggestions were helpful. |
I'm about to publish an updated SkLearn2PMML package version later this week, which will improve a lot around Pandas' categorical data type support. There is a functional difference whether yo specify
Can you demonstrate this issue using a small code example? The Iris dataset uses all continuous features, you real-life dataset must be using some categorical features as well, no?
What you're trying to accomplish can be summarized as "how to handle rare (aka infrequent) categories", no? This can be handled perfectly using The rare value handling can be done very explicitly. Right now, I get the impression that you're kind of forced to bring in the
Using Using XGBoost native encoding gives you "subset of categories versus all other categories" type splits (in PMML represented using For example, if you use |
T:LDR: Show me your issues in Python source code form (based on some toy dataset), and I'll advise. I'm working on very closely related matters at the moment, and I can make the next SkLearn2PMML package version much better in this area. |
Thanks for your response! So, I used the iris dataset as shown below, where I added one random categorical feature to check whether iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['Species'] = iris.target
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]
cols_X = iris_X.columns.tolist()
unique_values = ['value1', 'value2', 'value3', 'value4', 'value5']
iris_X['Test'] = np.random.choice(unique_values, size=len(df))
maps = [([col], [ContinuousDomain(dtype=iris_X[col].dtype)]) for col in cols_X] + \
[(['Test'], [CategoricalDomain(dtype='str'), OneHotEncoder(max_categories=3, handle_unknown='infrequent_if_exist')])]
mapper = DataFrameMapper(maps, input_df=True, df_out=True)
pipeline = PMMLPipeline([
("mapper", mapper),
("classifier", XGBRegressor())
])
pipeline.fit(iris_X, iris_y)
pipeline.verify(iris_X.sample(n = 15))
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "xgb.pmml", with_repr = True)
|
As mentioned, I've moved away from the The code below shows my current implementation for the PMMLPipeline. I noticed that the pipeline's performance is worse than that of the maps = [([col], ContinuousDomain(dtype=X_train[col].dtype,
invalid_value_treatment = "as_missing",
missing_value_replacement = X_train[col].median())) for col in cfg.NUMERICALS] + \
[([col], CategoricalDomain(dtype='category',
invalid_value_treatment = "as_missing",
missing_value_replacement = str(X_train[col].mode()[0]))) for col in cfg.CATEGORICALS]
mapper = DataFrameMapper(maps, input_df=True, df_out=True, drop_cols=cols_to_drop)
pipeline = PMMLPipeline([
('preprocessor', mapper),
('classifier', XGBClassifier(**cfg.XGB_PARAMS))
])
pipeline.fit(X_train, y_train) Please let me know if anything is unclear from my side. Thank you! |
The original For example, if a categorical feature had three category levels, but the The quick user-side fix would have been to adjust the |
@SimonRbk95 Also pay attention that If you re-run your experiments twice, first with |
Hi,
all I am trying to do is use the OneHotEncoder on all my categorical variables. However, I noticed that if the categorical feature has integer values, I get the error as shown below. To fix it, I already tried to use the CategoricalDomain(dtype=str) decoration to overwrite the datatype, without success.
As a workaround, I tried to map the categories with a LookupTransformer(), which results in the same error.
On a different note, when looking at an PMML file that I exported encoding some of the categoricals for which the code above worked, I am doubting that the max_categories and handle_unkown arguments actually carry over to PMML. However, I don't have any experience with PMML, so maybe you have any insight here as well?
Any help is much appreciated. Thank you!
The text was updated successfully, but these errors were encountered: