-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad scoping of target field(s) in stacking estimators #192
Comments
Fixed the formatting for you. According to GitHub MarkDown conventions, you should be surrounding code blocks with three backtick symbols, and inline code fragments with a single backtick symbol |
If there is a problem, and you solve it by manually editing the PMML document, then this typically indicates a converter-side bug, not an evaluator-side bug. Therefore, I'm moving this issue over to the JPMML-SkLearn project, because this is the component that is actually responsible for generating |
LGBMClassifier
probability output fields within StackingClassifier
?
Bad scoping of LGBMClassifier probability output fields within StackingClassifier? |
If you move the Anyway, will be creating a small test script to experience this issue on my own computer. Perhaps it affects all third-party classifiers, such as H2O, LightGBM and XGBoost. One thing that intrigues me is that the converter is unable to detect the output field scoping issue. This PMML document should fail already in the conversion phase. |
-- no , still not work. position is not the reason |
Here's my test script - train a stacking classifier for a binary classification problem using SkLearn, LightGBM and XGBoost classifiers, then convert it to a PMML document, and then load and evaluate this PMML document using the JPMML-Evaluator-Python package: from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
X, y = load_iris(return_X_y = True, as_frame = True)
# Convert to binary classification problem
y = (y == 1)
classifier = StackingClassifier(
estimators = [
("sklearn", LogisticRegression()),
("lightgbm", LGBMClassifier(n_estimators = 3)),
("xgboost", XGBClassifier(n_estimators = 3))
],
final_estimator = LogisticRegression()
)
classifier.fit(X, y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(classifier, "StackingClassifier.pmml")
from jpmml_evaluator import make_evaluator
evaluator = make_evaluator("StackingClassifier.pmml", reporting = True, backend = 'py4j') \
.verify()
X_pmml = DataFrame(X.values, columns = X.columns.values.tolist())
yt = evaluator.evaluateAll(X_pmml)
print(yt) Works absolutely flawlessly. The LightGBM classifier can be moved to any position within the stacking classifier, and everything keeps working just like before. |
@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before. |
Does this normal ? |
Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as Very interesting indeed. Am exploring. |
yes |
Seems like a data transfer error somewhere in the Python wrapper. Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values. |
yes
|
LGBMClassifier
probability output fields within StackingClassifier
?
This issue is about two things. First, the JPMML-SkLearn converter library is generating incorrect PMML documents for both The fix is straightforward, simply replace Existing PMML documents can be fixed by simply deleting the Second, the JPMML-Evaluator-Python gets confused that it is requested to re-define the target field over and over again (first with member models "sklearn", "lightgbm" and "xgboost"; and then finally at the top-level). Right now, it simply retains and returns the first (partial-) definition. According to the PMML specification, it should be an error to re-define the value of some field when moving from one model chain element to another. Therefore, the correct behaviour for any PMML engine would be to fail with an error here. The JPMML-Evaluator Java library is not doing it, which needs fixing. Its Python wrapper is currently even worse, because it returns a partial result. |
TLDR: There are fixes needed in two locations:
The fact that PyPMML "works" is no argument, because PyPMML does not perform any PMML document sanity/validity checks on its own. It's too stupid for that. |
The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending |
I will use your method to fix the previous script |
Using my "StackingClassifier.pmml" file as an example: You should keep:
You should delete:
This keep/delete transformation can probably be automated using an XSLT stylesheet. But I'm too lazy to work on it now. I will fix the conversion part of this issue in the next SkLearn2PMML package release. Probably sometimes next week. @git20190108 You shall receive a GitHub notification when this issue gets closed. After that, update your SkLearn2PMML package version, and everything should work fine. Also, thanks for spotting and reporting this issue to me! Much appreciated. |
hi,vruusmann
I find a problem with passing scope value,the xml content of
<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/>
in my pmml file is invalid, after I rewrite the value ,it worksOnly the lgb predict_proba always return 0, other model is ok. It seems the package can't recognize predict_proba(0, 1)'s value.
before:
after:
file detail:
packages:
jpmml-evaluator-python: 0.10.1
java: "1.8.0_211"
Python: 3.9.17
script:
detail.txt
The text was updated successfully, but these errors were encountered: