Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

Closed
BenWilson2 opened this issue Feb 10, 2022 · 6 comments · Fixed by #7694

Comments

@BenWilson2
Copy link

Testing the following script within Apache Spark with the XGBoost4j-spark version 1.5.1 generates an incorrect mapping on the prediction column to the dominant class member value from the rawPrediction generated column on a transform:

import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel,XGBoostClassifier}
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}

val schema = new StructType(Array(
      StructField("item", StringType, true),
      StructField("sepal length", DoubleType, true),
      StructField("sepal width", DoubleType, true),
      StructField("petal length", DoubleType, true),
      StructField("petal width", DoubleType, true),
      StructField("class", StringType, true)))

val rawInput = spark.read.schema(schema).format("csv").option("header", "true").load("dbfs:/databricks-datasets/Rdatasets/data-001/csv/datasets/iris.csv")

rawInput.createOrReplaceTempView("iristable")

val rawData = spark.table("iristable")

val stringIndexer = new StringIndexer().setInputCol("class").setOutputCol("classIndex").fit(rawData)
val stringIndexed = stringIndexer.transform(rawData).drop("class")

val vectorAssembler = new VectorAssembler()
  .setInputCols(Array("sepal length", "sepal width", "petal length", "petal width"))
  .setOutputCol("features")

val xgbInput = vectorAssembler.transform(stringIndexed).select("features","classIndex")

val xgbParam = Map(
      "objective" -> "multi:softmax",
      "num_class" -> 3,
      "num_round" -> 10
                  )

val xgbClassifier = new XGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")
val xgbClassificationModel = xgbClassifier.fit(xgbInput)

display(xgbClassificationModel.transform(xgbInput))

The output example for this is below:
Screen Shot 2022-02-09 at 7 35 52 PM

This is not the same behavior for the objective "multi:softprob" which will return the correct prediction column values from the source rawPrediction column.

Environment:
Apache Spark 3.2.1
Scala 2.12
XGBoost4j-spark 1.15.1

@trivialfis
Copy link
Member

softmax is a bit problematic in general as it removes the probability by reducing it to the output label. We can add a check to remove the probability column if softmax is used.

@BenWilson2
Copy link
Author

The main issue is that the softmax implementation seems to be capturing the wrong prediction label (in the case above it assigns all predictions as label==0 when the rawPrediction values in the screenshot show that class label "1" should have been in the prediction column.

@trivialfis
Copy link
Member

Thank you for raising the issue. @wbo4958 Could you please help take a look when you are available?

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 16, 2022

Will check this issue.

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 21, 2022

Yeah, Just reproduced locally, same issue. We need to fix this.

@trivialfis
Copy link
Member

Maybe related: #3506 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants