Bad scoping of target field(s) in stacking estimators #192

git20190108 · 2024-01-31T08:03:58Z

hi,vruusmann
I find a problem with passing scope value，the xml content of <OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/> in my pmml file is invalid， after I rewrite the value ，it works
Only the lgb predict_proba always return 0, other model is ok. It seems the package can't recognize predict_proba(0, 1)'s value.

before：

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/>

after：

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="transformedValue">
	<FieldRef field="probability(1)"/>
</OutputField>

file detail：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
	<Header>
		<Application name="SkLearn2PMML package" version="0.100.0"/>
		<MiningBuildTask>
		<Extension name="repr">PMMLPipeline(steps=[('Stacking_model', StackingClassifier(estimators=[('lgbm',
                                LGBMClassifier(***)),
                               ('rf',
                                RandomForestClassifier***)),
                               ('MLP',
                                MLPClassifier(***)),
                               ('GNB', GaussianNB(***))],
                   final_estimator=LogisticRegression(***)))])</Extension>
		</MiningBuildTask>

	<RegressionTable intercept="-2.507414415" targetCategory="1">
	    <NumericPredictor name="predict_proba(0, 1)" coefficient="3"/>
	    <NumericPredictor name="predict_proba(1, 1)" coefficient="1"/>
	    <NumericPredictor name="predict_proba(2, 1)" coefficient="2"/>
	    <NumericPredictor name="predict_proba(3, 1)" coefficient="5"/>
	</RegressionTable>
</PMML>

packages：
jpmml-evaluator-python: 0.10.1
java: "1.8.0_211"
Python: 3.9.17

script:

from jpmml_evaluator import make_evaluator
evaluator = make_evaluator(  '***.pmml', reporting = True,backend='py4j').verify() 
evaluator.evaluate(input1)

detail.txt

The text was updated successfully, but these errors were encountered:

git20190108 · 2024-01-31T08:07:08Z

vruusmann · 2024-01-31T08:08:36Z

Fixed the formatting for you. According to GitHub MarkDown conventions, you should be surrounding code blocks with three backtick symbols, and inline code fragments with a single backtick symbol

vruusmann · 2024-01-31T08:48:26Z

If there is a problem, and you solve it by manually editing the PMML document, then this typically indicates a converter-side bug, not an evaluator-side bug.

Therefore, I'm moving this issue over to the JPMML-SkLearn project, because this is the component that is actually responsible for generating OutputField element names and making sure that they are properly scoped.

git20190108 · 2024-01-31T09:24:28Z

Bad scoping of LGBMClassifier probability output fields within StackingClassifier?
--yes
-- The first LGBMClassifier always return 0 probability,and the x-report is not work.

vruusmann · 2024-01-31T09:44:30Z

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

Anyway, will be creating a small test script to experience this issue on my own computer. Perhaps it affects all third-party classifiers, such as H2O, LightGBM and XGBoost.

One thing that intrigues me is that the converter is unable to detect the output field scoping issue. This PMML document should fail already in the conversion phase.

git20190108 · 2024-01-31T10:03:11Z

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

-- no , still not work. position is not the reason

vruusmann · 2024-01-31T10:04:04Z

Here's my test script - train a stacking classifier for a binary classification problem using SkLearn, LightGBM and XGBoost classifiers, then convert it to a PMML document, and then load and evaluate this PMML document using the JPMML-Evaluator-Python package:

from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

X, y = load_iris(return_X_y = True, as_frame = True)
# Convert to binary classification problem
y = (y == 1)

classifier = StackingClassifier(
	estimators = [
		("sklearn", LogisticRegression()),
		("lightgbm", LGBMClassifier(n_estimators = 3)),
		("xgboost", XGBClassifier(n_estimators = 3))
	],
	final_estimator = LogisticRegression()
)
classifier.fit(X, y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(classifier, "StackingClassifier.pmml")

from jpmml_evaluator import make_evaluator

evaluator = make_evaluator("StackingClassifier.pmml", reporting = True, backend = 'py4j') \
	.verify() 

X_pmml = DataFrame(X.values, columns = X.columns.values.tolist())

yt = evaluator.evaluateAll(X_pmml)
print(yt)

Works absolutely flawlessly. The LightGBM classifier can be moved to any position within the stacking classifier, and everything keeps working just like before.

vruusmann · 2024-01-31T10:05:59Z

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

git20190108 · 2024-01-31T10:22:43Z

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

Does this normal ?

test.pmml.txt

git20190108 · 2024-01-31T10:27:03Z

add pypmml result

vruusmann · 2024-01-31T10:52:30Z

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

Very interesting indeed. Am exploring.

git20190108 · 2024-01-31T11:04:41Z

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

yes
due to the wrong intermediate results，the final result also mistake。
apparently,pypmml get the normal result with the same file

vruusmann · 2024-01-31T11:08:24Z

due to the wrong intermediate results，the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

git20190108 · 2024-01-31T14:49:19Z

due to the wrong intermediate results，the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

yes
only this part wrong, seems this part can't get the correct value.

<Output>
	<OutputField name="predict_proba(1, true)" optype="continuous" dataType="double" feature="probability" value="true" isFinalResult="false"/>
</Output>

vruusmann · 2024-02-01T05:06:14Z

This issue is about two things.

First, the JPMML-SkLearn converter library is generating incorrect PMML documents for bothStackingClassifier and StackingRegressor estimator types. The problem is that the name of the target fields is being passed by the top-level stacking estimator to its member estimators. Instead, it should be "anonymizing" the schema, so that member estimators get to see an "anonymized" target field (ie. the name is null).

The fix is straightforward, simply replace schema with schema.toSegmentSchema() on this line:
https://github.com/jpmml/jpmml-sklearn/blob/1.7.47/pmml-sklearn/src/main/java/sklearn/ensemble/stacking/StackingUtil.java#L56

Existing PMML documents can be fixed by simply deleting the <MiningField name="y" usageType="target"/> fragment from member model schemas. This declaration is only permitted with the top-level model element (ie. /PMML/MiningModel).

Second, the JPMML-Evaluator-Python gets confused that it is requested to re-define the target field over and over again (first with member models "sklearn", "lightgbm" and "xgboost"; and then finally at the top-level). Right now, it simply retains and returns the first (partial-) definition.

According to the PMML specification, it should be an error to re-define the value of some field when moving from one model chain element to another.

Therefore, the correct behaviour for any PMML engine would be to fail with an error here. The JPMML-Evaluator Java library is not doing it, which needs fixing. Its Python wrapper is currently even worse, because it returns a partial result.

vruusmann · 2024-02-01T05:09:21Z

TLDR: There are fixes needed in two locations:

The JPMML-SkLearn library should "anonymize" the schema before passing it from the parent/top-level model to child/member models.
The JPMML-Evaluator library should error out when it is presented with a model chain, where sibling models attempt to re-define the value of a target field (IIRC, right now it only checks for the re-definition of output fields).

The fact that PyPMML "works" is no argument, because PyPMML does not perform any PMML document sanity/validity checks on its own. It's too stupid for that.

vruusmann · 2024-02-01T05:12:36Z

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

git20190108 · 2024-02-01T07:07:23Z

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

I will use your method to fix the previous script
Thank you for your patient explanation and I look forward to your fixing these issues.

vruusmann · 2024-02-01T07:25:49Z

I will use your method to fix the previous script

Using my "StackingClassifier.pmml" file as an example:

You should keep:

/PMML/MiningModel/MiningSchema/MiningField@name="y" ie. the very first occurrence
/PMML/MiningModel/Segmentation/Segment@id="4"/RegressionModel/MiningSchema/MiningField@name="y" ie. the very last occurrence

You should delete:

One occurrence under /PMML/MiningModel/Segmentation/Segment@id="1"
Two occurrences under /PMML/MiningModel/Segmentation/Segment@id="2"
Two occurrence under /PMML/MiningModel/Segmentation/Segment@id="3"

This keep/delete transformation can probably be automated using an XSLT stylesheet. But I'm too lazy to work on it now.

I will fix the conversion part of this issue in the next SkLearn2PMML package release. Probably sometimes next week.

@git20190108 You shall receive a GitHub notification when this issue gets closed. After that, update your SkLearn2PMML package version, and everything should work fine.

Also, thanks for spotting and reporting this issue to me! Much appreciated.

vruusmann transferred this issue from jpmml/jpmml-evaluator-python Jan 31, 2024

vruusmann changed the title ~~problem with pmmlpipeline scope value~~ Bad scoping of LGBMClassifier probability output fields within StackingClassifier? Jan 31, 2024

vruusmann changed the title ~~Bad scoping of LGBMClassifier probability output fields within StackingClassifier?~~ Bad scoping of target field(s) in stacking estimators Feb 1, 2024

vruusmann closed this as completed in a500d9c Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad scoping of target field(s) in stacking estimators #192

Bad scoping of target field(s) in stacking estimators #192

git20190108 commented Jan 31, 2024 •

edited by vruusmann

Loading

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

git20190108 commented Jan 31, 2024 •

edited

Loading

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024 •

edited

Loading

git20190108 commented Jan 31, 2024

vruusmann commented Feb 1, 2024

vruusmann commented Feb 1, 2024

vruusmann commented Feb 1, 2024

git20190108 commented Feb 1, 2024

vruusmann commented Feb 1, 2024 •

edited

Loading

Bad scoping of target field(s) in stacking estimators #192

Bad scoping of target field(s) in stacking estimators #192

Comments

git20190108 commented Jan 31, 2024 • edited by vruusmann Loading

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

git20190108 commented Jan 31, 2024 • edited Loading

vruusmann commented Jan 31, 2024

git20190108 commented Jan 31, 2024

vruusmann commented Jan 31, 2024 • edited Loading

git20190108 commented Jan 31, 2024

vruusmann commented Feb 1, 2024

vruusmann commented Feb 1, 2024

vruusmann commented Feb 1, 2024

git20190108 commented Feb 1, 2024

vruusmann commented Feb 1, 2024 • edited Loading

git20190108 commented Jan 31, 2024 •

edited by vruusmann

Loading

git20190108 commented Jan 31, 2024 •

edited

Loading

vruusmann commented Jan 31, 2024 •

edited

Loading

vruusmann commented Feb 1, 2024 •

edited

Loading