Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad scoping of target field(s) in stacking estimators #192

Closed
git20190108 opened this issue Jan 31, 2024 · 19 comments
Closed

Bad scoping of target field(s) in stacking estimators #192

git20190108 opened this issue Jan 31, 2024 · 19 comments

Comments

@git20190108
Copy link

git20190108 commented Jan 31, 2024

hi,vruusmann
I find a problem with passing scope value,the xml content of <OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/> in my pmml file is invalid, after I rewrite the value ,it works
Only the lgb predict_proba always return 0, other model is ok. It seems the package can't recognize predict_proba(0, 1)'s value.

before:

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/>

after:

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="transformedValue">
	<FieldRef field="probability(1)"/>
</OutputField>

file detail:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
	<Header>
		<Application name="SkLearn2PMML package" version="0.100.0"/>
		<MiningBuildTask>
		<Extension name="repr">PMMLPipeline(steps=[('Stacking_model', StackingClassifier(estimators=[('lgbm',
                                LGBMClassifier(***)),
                               ('rf',
                                RandomForestClassifier***)),
                               ('MLP',
                                MLPClassifier(***)),
                               ('GNB', GaussianNB(***))],
                   final_estimator=LogisticRegression(***)))])</Extension>
		</MiningBuildTask>

	<RegressionTable intercept="-2.507414415" targetCategory="1">
	    <NumericPredictor name="predict_proba(0, 1)" coefficient="3"/>
	    <NumericPredictor name="predict_proba(1, 1)" coefficient="1"/>
	    <NumericPredictor name="predict_proba(2, 1)" coefficient="2"/>
	    <NumericPredictor name="predict_proba(3, 1)" coefficient="5"/>
	</RegressionTable>
</PMML>

packages:
jpmml-evaluator-python: 0.10.1
java: "1.8.0_211"
Python: 3.9.17

script:

from jpmml_evaluator import make_evaluator
evaluator = make_evaluator(  '***.pmml', reporting = True,backend='py4j').verify() 
evaluator.evaluate(input1)

detail.txt

@git20190108
Copy link
Author

image

@vruusmann
Copy link
Member

Fixed the formatting for you. According to GitHub MarkDown conventions, you should be surrounding code blocks with three backtick symbols, and inline code fragments with a single backtick symbol

@vruusmann
Copy link
Member

If there is a problem, and you solve it by manually editing the PMML document, then this typically indicates a converter-side bug, not an evaluator-side bug.

Therefore, I'm moving this issue over to the JPMML-SkLearn project, because this is the component that is actually responsible for generating OutputField element names and making sure that they are properly scoped.

@vruusmann vruusmann transferred this issue from jpmml/jpmml-evaluator-python Jan 31, 2024
@vruusmann vruusmann changed the title problem with pmmlpipeline scope value Bad scoping of LGBMClassifier probability output fields within StackingClassifier? Jan 31, 2024
@git20190108
Copy link
Author

Bad scoping of LGBMClassifier probability output fields within StackingClassifier?
--yes
-- The first LGBMClassifier always return 0 probability,and the x-report is not work.

@vruusmann
Copy link
Member

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

Anyway, will be creating a small test script to experience this issue on my own computer. Perhaps it affects all third-party classifiers, such as H2O, LightGBM and XGBoost.

One thing that intrigues me is that the converter is unable to detect the output field scoping issue. This PMML document should fail already in the conversion phase.

@git20190108
Copy link
Author

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

-- no , still not work. position is not the reason

@vruusmann
Copy link
Member

Here's my test script - train a stacking classifier for a binary classification problem using SkLearn, LightGBM and XGBoost classifiers, then convert it to a PMML document, and then load and evaluate this PMML document using the JPMML-Evaluator-Python package:

from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

X, y = load_iris(return_X_y = True, as_frame = True)
# Convert to binary classification problem
y = (y == 1)

classifier = StackingClassifier(
	estimators = [
		("sklearn", LogisticRegression()),
		("lightgbm", LGBMClassifier(n_estimators = 3)),
		("xgboost", XGBClassifier(n_estimators = 3))
	],
	final_estimator = LogisticRegression()
)
classifier.fit(X, y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(classifier, "StackingClassifier.pmml")

from jpmml_evaluator import make_evaluator

evaluator = make_evaluator("StackingClassifier.pmml", reporting = True, backend = 'py4j') \
	.verify() 

X_pmml = DataFrame(X.values, columns = X.columns.values.tolist())

yt = evaluator.evaluateAll(X_pmml)
print(yt)

Works absolutely flawlessly. The LightGBM classifier can be moved to any position within the stacking classifier, and everything keeps working just like before.

@vruusmann
Copy link
Member

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

@git20190108
Copy link
Author

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

Does this normal ?
image
test.pmml.txt

@git20190108
Copy link
Author

git20190108 commented Jan 31, 2024

add pypmml result
image

@vruusmann
Copy link
Member

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

Very interesting indeed. Am exploring.

@git20190108
Copy link
Author

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

yes
due to the wrong intermediate results,the final result also mistake。
apparently,pypmml get the normal result with the same file

@vruusmann
Copy link
Member

vruusmann commented Jan 31, 2024

due to the wrong intermediate results,the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

@git20190108
Copy link
Author

due to the wrong intermediate results,the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

yes
only this part wrong, seems this part can't get the correct value.

<Output>
	<OutputField name="predict_proba(1, true)" optype="continuous" dataType="double" feature="probability" value="true" isFinalResult="false"/>
</Output>

@vruusmann vruusmann changed the title Bad scoping of LGBMClassifier probability output fields within StackingClassifier? Bad scoping of target field(s) in stacking estimators Feb 1, 2024
@vruusmann
Copy link
Member

This issue is about two things.

First, the JPMML-SkLearn converter library is generating incorrect PMML documents for bothStackingClassifier and StackingRegressor estimator types. The problem is that the name of the target fields is being passed by the top-level stacking estimator to its member estimators. Instead, it should be "anonymizing" the schema, so that member estimators get to see an "anonymized" target field (ie. the name is null).

The fix is straightforward, simply replace schema with schema.toSegmentSchema() on this line:
https://github.com/jpmml/jpmml-sklearn/blob/1.7.47/pmml-sklearn/src/main/java/sklearn/ensemble/stacking/StackingUtil.java#L56

Existing PMML documents can be fixed by simply deleting the <MiningField name="y" usageType="target"/> fragment from member model schemas. This declaration is only permitted with the top-level model element (ie. /PMML/MiningModel).

Second, the JPMML-Evaluator-Python gets confused that it is requested to re-define the target field over and over again (first with member models "sklearn", "lightgbm" and "xgboost"; and then finally at the top-level). Right now, it simply retains and returns the first (partial-) definition.

According to the PMML specification, it should be an error to re-define the value of some field when moving from one model chain element to another.

Therefore, the correct behaviour for any PMML engine would be to fail with an error here. The JPMML-Evaluator Java library is not doing it, which needs fixing. Its Python wrapper is currently even worse, because it returns a partial result.

@vruusmann
Copy link
Member

TLDR: There are fixes needed in two locations:

  1. The JPMML-SkLearn library should "anonymize" the schema before passing it from the parent/top-level model to child/member models.
  2. The JPMML-Evaluator library should error out when it is presented with a model chain, where sibling models attempt to re-define the value of a target field (IIRC, right now it only checks for the re-definition of output fields).

The fact that PyPMML "works" is no argument, because PyPMML does not perform any PMML document sanity/validity checks on its own. It's too stupid for that.

@vruusmann
Copy link
Member

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

@git20190108
Copy link
Author

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

I will use your method to fix the previous script
Thank you for your patient explanation and I look forward to your fixing these issues.

@vruusmann
Copy link
Member

vruusmann commented Feb 1, 2024

I will use your method to fix the previous script

Using my "StackingClassifier.pmml" file as an example:

You should keep:

  • /PMML/MiningModel/MiningSchema/MiningField@name="y" ie. the very first occurrence
  • /PMML/MiningModel/Segmentation/Segment@id="4"/RegressionModel/MiningSchema/MiningField@name="y" ie. the very last occurrence

You should delete:

  • One occurrence under /PMML/MiningModel/Segmentation/Segment@id="1"
  • Two occurrences under /PMML/MiningModel/Segmentation/Segment@id="2"
  • Two occurrence under /PMML/MiningModel/Segmentation/Segment@id="3"

This keep/delete transformation can probably be automated using an XSLT stylesheet. But I'm too lazy to work on it now.

I will fix the conversion part of this issue in the next SkLearn2PMML package release. Probably sometimes next week.

@git20190108 You shall receive a GitHub notification when this issue gets closed. After that, update your SkLearn2PMML package version, and everything should work fine.

Also, thanks for spotting and reporting this issue to me! Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants