Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU. #2786

Closed
johnnyzhon opened this issue Jun 23, 2021 · 3 comments · Fixed by #2838
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@johnnyzhon
Copy link
Collaborator

johnnyzhon commented Jun 23, 2021

Describe the bug
While execute initCap function on GPU, if the input string has the digital number in it, for example 'spar2Rk' , the output result string is not expected.
The current result is "Spar2Rk", didn't convert the uppercase character to lowercase behind digital number.

Steps/Code to reproduce bug

def createDf1():
   print("### CREATE DATAFRAME 1 ####")
   schema = StructType([
                        StructField("byteF", ByteType()),
                        StructField("shortF", ShortType()),
                        StructField("intF", IntegerType()),
                        StructField("longF", LongType()),
                        StructField("floatF", FloatType()),
                        StructField("doubleF", DoubleType()),
                        StructField("booleanF", BooleanType()),
                        StructField("strF", StringType()),
                        StructField("decimalF", DecimalType(5,2)),
                        StructField("timestampF", TimestampType()),
                        StructField("dateF", DateType())
                      ])
   dt = datetime.date(1990, 1, 1)
   tm = datetime.datetime(2020,2,1,12,1,1)
   dcm1 = Decimal('111.11')
   dcm2 = Decimal('222.11')
   data = [
           (10, 700, 1000, 30, None, 3.000013, True, "nVIDIA inc", dcm2, None, dt),
           (10, 500, 2000, 30, None, 3.000013, True, "n2VIDIA inc", dcm2, None, dt),
          ]
   df = spark.createDataFrame(data, schema)
   df.createOrReplaceTempView("test_table1")
   df.show()
if __name__ == "__main__":
   spark = SparkSession.builder.appName("sparktest").getOrCreate()
   createDf1()
   sql_query_line_list=[
                       "SELECT initcap(strF) FROM test_table1",
                       ]
   for sql_query_line in sql_query_line_list:
       # enable CPU
       spark.conf.set("spark.rapids.sql.enabled", "false")
       print("CPU Physical Plan")
       spark.sql(sql_query_line).explain()
       cpu_result = spark.sql(sql_query_line).collect()
       # enable GPU
       spark.conf.set("spark.rapids.sql.enabled", "true")
       print("GPU Physical Plan")
       spark.sql(sql_query_line).explain()
       gpu_result = spark.sql(sql_query_line).collect()
       #sort result
       cpu_result.sort(key=testlib._RowCmp)
       gpu_result.sort(key=testlib._RowCmp)
       # compare cpu & gpu SQL results
       print ("\n")
       print("### CPU RESULT ###")
       for obj in cpu_result:
           print (str(obj).encode("utf-8"))
       print("\n### GPU RESULT ###")
       for obj in gpu_result:
           print (str(obj).encode("utf-8"))
       print("\n### COMPARING GPU AND CPU RESULT")
       if (testlib.compare(cpu_result, gpu_result)):
           print("[TEST RESULT] PASS")
       else:
          print("[TEST RESULT] FAIL")

Expected behavior
expected result: Spar2rk

Environment details (please complete the following information)

  • Environment location: [Cloud(NGC)]
  • Spark configuration settings related to the issue
        --driver-memory 10G,
        --num-executors 4,
        --executor-memory 32G,
        --conf spark.driver.host=127.0.0.1,
        --conf spark.cores.max=${SPARK_CORES_MAX},
        --conf spark.local.dir=${SPARK_LOCAL_DIR},
        --conf spark.executor.cores=4,
        --conf spark.task.cpus=4,
        --conf spark.driver.memoryOverhead=5G,
        --conf spark.eventLog.enabled=true,
        --conf spark.shuffle.service.enabled=false,
        --conf spark.plugins=com.nvidia.spark.SQLPlugin,
        --conf spark.rapids.sql.concurrentGpuTasks=2,
        --conf spark.locality.wait=0s,
        --conf spark.sql.files.maxPartitionBytes=512m,
        --conf spark.executor.memoryOverhead=10G,
        --conf spark.rapids.memory.pinnedPool.size=8G,
        --conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true',
        --conf spark.executor.resource.gpu.amount=1,
        --conf spark.task.resource.gpu.amount=0.5,
        --conf spark.rapids.sql.expression.InitCap=true,
        --conf spark.rapids.sql.explain=ALL,
        --conf spark.rapids.sql.castStringToFloat.enabled=true,
        --conf spark.rapids.sql.castFloatToString.enabled=true,
        --conf spark.rapids.sql.expression.Lower=true,
        --conf spark.rapids.sql.expression.Upper=true,
        --conf spark.rapids.sql.variableFloatAgg.enabled=true,
        --conf spark.rapids.sql.hasNans=false,
        --conf spark.executor.extraClassPath=/root/jars/rapids-4-spark_2.12-21.06.0.jar:/root/jars/cudf-21.06.1-cuda11.jar,
        --conf spark.driver.extraClassPath=/root/jars/rapids-4-spark_2.12-21.06.0.jar:/root/jars/cudf-21.06.1-cuda11.jar,

Additional context
Add any other context about the problem here.

@johnnyzhon johnnyzhon added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jun 23, 2021
@jlowe
Copy link
Member

jlowe commented Jun 23, 2021

The problem is a semantics disconnect between Spark's initcap function and the cudf title() function. Spark expects the first character after whitespace to be capitalized. The cudf title() documentation in the header file states it capitalizes after spaces, but the title implementation documentation states it capitalizes after non-alphabetic characters instead.

@jlowe
Copy link
Member

jlowe commented Jun 23, 2021

I filed rapidsai/cudf#8596 to track the disconnect between the cudf documentation and implementation.

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jun 23, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 23, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jun 23, 2021

Related to issue #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants