[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

johnnyzhon · 2021-06-23T03:58:12Z

Describe the bug
While execute initCap function on GPU, if the input string has the digital number in it, for example 'spar2Rk' , the output result string is not expected.
The current result is "Spar2Rk", didn't convert the uppercase character to lowercase behind digital number.

Steps/Code to reproduce bug

def createDf1():
   print("### CREATE DATAFRAME 1 ####")
   schema = StructType([
                        StructField("byteF", ByteType()),
                        StructField("shortF", ShortType()),
                        StructField("intF", IntegerType()),
                        StructField("longF", LongType()),
                        StructField("floatF", FloatType()),
                        StructField("doubleF", DoubleType()),
                        StructField("booleanF", BooleanType()),
                        StructField("strF", StringType()),
                        StructField("decimalF", DecimalType(5,2)),
                        StructField("timestampF", TimestampType()),
                        StructField("dateF", DateType())
                      ])
   dt = datetime.date(1990, 1, 1)
   tm = datetime.datetime(2020,2,1,12,1,1)
   dcm1 = Decimal('111.11')
   dcm2 = Decimal('222.11')
   data = [
           (10, 700, 1000, 30, None, 3.000013, True, "nVIDIA inc", dcm2, None, dt),
           (10, 500, 2000, 30, None, 3.000013, True, "n2VIDIA inc", dcm2, None, dt),
          ]
   df = spark.createDataFrame(data, schema)
   df.createOrReplaceTempView("test_table1")
   df.show()
if __name__ == "__main__":
   spark = SparkSession.builder.appName("sparktest").getOrCreate()
   createDf1()
   sql_query_line_list=[
                       "SELECT initcap(strF) FROM test_table1",
                       ]
   for sql_query_line in sql_query_line_list:
       # enable CPU
       spark.conf.set("spark.rapids.sql.enabled", "false")
       print("CPU Physical Plan")
       spark.sql(sql_query_line).explain()
       cpu_result = spark.sql(sql_query_line).collect()
       # enable GPU
       spark.conf.set("spark.rapids.sql.enabled", "true")
       print("GPU Physical Plan")
       spark.sql(sql_query_line).explain()
       gpu_result = spark.sql(sql_query_line).collect()
       #sort result
       cpu_result.sort(key=testlib._RowCmp)
       gpu_result.sort(key=testlib._RowCmp)
       # compare cpu & gpu SQL results
       print ("\n")
       print("### CPU RESULT ###")
       for obj in cpu_result:
           print (str(obj).encode("utf-8"))
       print("\n### GPU RESULT ###")
       for obj in gpu_result:
           print (str(obj).encode("utf-8"))
       print("\n### COMPARING GPU AND CPU RESULT")
       if (testlib.compare(cpu_result, gpu_result)):
           print("[TEST RESULT] PASS")
       else:
          print("[TEST RESULT] FAIL")

Expected behavior
expected result: Spar2rk

Environment details (please complete the following information)

Environment location: [Cloud(NGC)]
Spark configuration settings related to the issue

        --driver-memory 10G,
        --num-executors 4,
        --executor-memory 32G,
        --conf spark.driver.host=127.0.0.1,
        --conf spark.cores.max=${SPARK_CORES_MAX},
        --conf spark.local.dir=${SPARK_LOCAL_DIR},
        --conf spark.executor.cores=4,
        --conf spark.task.cpus=4,
        --conf spark.driver.memoryOverhead=5G,
        --conf spark.eventLog.enabled=true,
        --conf spark.shuffle.service.enabled=false,
        --conf spark.plugins=com.nvidia.spark.SQLPlugin,
        --conf spark.rapids.sql.concurrentGpuTasks=2,
        --conf spark.locality.wait=0s,
        --conf spark.sql.files.maxPartitionBytes=512m,
        --conf spark.executor.memoryOverhead=10G,
        --conf spark.rapids.memory.pinnedPool.size=8G,
        --conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true',
        --conf spark.executor.resource.gpu.amount=1,
        --conf spark.task.resource.gpu.amount=0.5,
        --conf spark.rapids.sql.expression.InitCap=true,
        --conf spark.rapids.sql.explain=ALL,
        --conf spark.rapids.sql.castStringToFloat.enabled=true,
        --conf spark.rapids.sql.castFloatToString.enabled=true,
        --conf spark.rapids.sql.expression.Lower=true,
        --conf spark.rapids.sql.expression.Upper=true,
        --conf spark.rapids.sql.variableFloatAgg.enabled=true,
        --conf spark.rapids.sql.hasNans=false,
        --conf spark.executor.extraClassPath=/root/jars/rapids-4-spark_2.12-21.06.0.jar:/root/jars/cudf-21.06.1-cuda11.jar,
        --conf spark.driver.extraClassPath=/root/jars/rapids-4-spark_2.12-21.06.0.jar:/root/jars/cudf-21.06.1-cuda11.jar,

Additional context
Add any other context about the problem here.

jlowe · 2021-06-23T16:53:31Z

The problem is a semantics disconnect between Spark's initcap function and the cudf title() function. Spark expects the first character after whitespace to be capitalized. The cudf title() documentation in the header file states it capitalizes after spaces, but the title implementation documentation states it capitalizes after non-alphabetic characters instead.

jlowe · 2021-06-23T17:02:08Z

I filed rapidsai/cudf#8596 to track the disconnect between the cudf documentation and implementation.

sameerz · 2021-06-23T22:37:31Z

Related to issue #120

johnnyzhon added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jun 23, 2021

jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jun 23, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 23, 2021

sameerz mentioned this issue Jun 23, 2021

Update documentation for InitCap incompatibility #2797

Merged

GaryShen2008 assigned firestarman Jun 28, 2021

firestarman mentioned this issue Jun 29, 2021

Replace toTitle with capitalize for GpuInitCap #2838

Merged

jlowe closed this as completed in #2838 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

johnnyzhon commented Jun 23, 2021 •

edited by revans2

Loading

jlowe commented Jun 23, 2021

jlowe commented Jun 23, 2021

sameerz commented Jun 23, 2021

[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

Comments

johnnyzhon commented Jun 23, 2021 • edited by revans2 Loading

jlowe commented Jun 23, 2021

jlowe commented Jun 23, 2021

sameerz commented Jun 23, 2021

johnnyzhon commented Jun 23, 2021 •

edited by revans2

Loading