XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

BenWilson2 · 2022-02-10T00:59:24Z

Testing the following script within Apache Spark with the XGBoost4j-spark version 1.5.1 generates an incorrect mapping on the prediction column to the dominant class member value from the rawPrediction generated column on a transform:

import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel,XGBoostClassifier}
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}

val schema = new StructType(Array(
      StructField("item", StringType, true),
      StructField("sepal length", DoubleType, true),
      StructField("sepal width", DoubleType, true),
      StructField("petal length", DoubleType, true),
      StructField("petal width", DoubleType, true),
      StructField("class", StringType, true)))

val rawInput = spark.read.schema(schema).format("csv").option("header", "true").load("dbfs:/databricks-datasets/Rdatasets/data-001/csv/datasets/iris.csv")

rawInput.createOrReplaceTempView("iristable")

val rawData = spark.table("iristable")

val stringIndexer = new StringIndexer().setInputCol("class").setOutputCol("classIndex").fit(rawData)
val stringIndexed = stringIndexer.transform(rawData).drop("class")

val vectorAssembler = new VectorAssembler()
  .setInputCols(Array("sepal length", "sepal width", "petal length", "petal width"))
  .setOutputCol("features")

val xgbInput = vectorAssembler.transform(stringIndexed).select("features","classIndex")

val xgbParam = Map(
      "objective" -> "multi:softmax",
      "num_class" -> 3,
      "num_round" -> 10
                  )

val xgbClassifier = new XGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")
val xgbClassificationModel = xgbClassifier.fit(xgbInput)

display(xgbClassificationModel.transform(xgbInput))

The output example for this is below:

This is not the same behavior for the objective "multi:softprob" which will return the correct prediction column values from the source rawPrediction column.

Environment:
Apache Spark 3.2.1
Scala 2.12
XGBoost4j-spark 1.15.1

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-02-15T21:05:53Z

softmax is a bit problematic in general as it removes the probability by reducing it to the output label. We can add a check to remove the probability column if softmax is used.

BenWilson2 · 2022-02-15T21:09:59Z

The main issue is that the softmax implementation seems to be capturing the wrong prediction label (in the case above it assigns all predictions as label==0 when the rawPrediction values in the screenshot show that class label "1" should have been in the prediction column.

trivialfis · 2022-02-15T21:12:55Z

Thank you for raising the issue. @wbo4958 Could you please help take a look when you are available?

wbo4958 · 2022-02-16T02:59:23Z

Will check this issue.

wbo4958 · 2022-02-21T07:07:27Z

Yeah, Just reproduced locally, same issue. We need to fix this.

trivialfis · 2022-02-22T00:25:41Z

Maybe related: #3506 .

trivialfis added the type: bug label Feb 22, 2022

wbo4958 mentioned this issue Feb 23, 2022

[jvm-packages] fix the prediction issue for multi:softmax #7694

Merged

trivialfis closed this as completed in #7694 Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

BenWilson2 commented Feb 10, 2022

trivialfis commented Feb 15, 2022

BenWilson2 commented Feb 15, 2022

trivialfis commented Feb 15, 2022

wbo4958 commented Feb 16, 2022

wbo4958 commented Feb 21, 2022

trivialfis commented Feb 22, 2022

XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

XGBoost4j-spark multiclass with objective "multi:softmax" returns incorrect prediction column value #7643

Comments

BenWilson2 commented Feb 10, 2022

trivialfis commented Feb 15, 2022

BenWilson2 commented Feb 15, 2022

trivialfis commented Feb 15, 2022

wbo4958 commented Feb 16, 2022

wbo4958 commented Feb 21, 2022

trivialfis commented Feb 22, 2022