[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 #4069

NvTimLiu · 2021-11-10T07:28:34Z

Describe the bug

2021-11-10T06:53:08.173Z] 
[2021-11-10T06:53:08.173Z] =================================== FAILURES ===================================
[2021-11-10T06:53:08.173Z]  test_nested_pruning_and_case_insensitive[true--reader_confs0-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['STRUCT', Struct(['case_INSENsitive', Long])]]] _�[0m
[2021-11-10T06:53:08.173Z] [gw0] linux -- Python 3.8.12 /databricks/conda/envs/cudf-udf/bin/python
[2021-11-10T06:53:08.173Z] 
[2021-11-10T06:53:08.173Z] spark_tmp_path = '/tmp/pyspark_tests//754491/'
[2021-11-10T06:53:08.173Z] data_gen = [['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]
[2021-11-10T06:53:08.173Z] read_schema = [['STRUCT', Struct(['case_INSENsitive', Long])]]
[2021-11-10T06:53:08.173Z] reader_confs = {'spark.rapids.sql.format.parquet.reader.type': 'PERFILE'}
[2021-11-10T06:53:08.173Z] v1_enabled_list = '', nested_enabled = 'true'
[2021-11-10T06:53:08.173Z] 
[2021-11-10T06:53:08.173Z]     @pytest.mark.parametrize('data_gen,read_schema', _nested_pruning_schemas, ids=idfn)
[2021-11-10T06:53:08.173Z]     @pytest.mark.parametrize('reader_confs', reader_opt_confs)
[2021-11-10T06:53:08.173Z]     @pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
[2021-11-10T06:53:08.173Z]     @pytest.mark.parametrize('nested_enabled', ["true", "false"])
[2021-11-10T06:53:08.173Z]     def test_nested_pruning_and_case_insensitive(spark_tmp_path, data_gen, read_schema, reader_confs, v1_enabled_list, nested_enabled):
[2021-11-10T06:53:08.173Z]         data_path = spark_tmp_path + '/PARQUET_DATA'
[2021-11-10T06:53:08.173Z]         with_cpu_session(
[2021-11-10T06:53:08.173Z]                 lambda spark : gen_df(spark, data_gen).write.parquet(data_path),
[2021-11-10T06:53:08.173Z]                 conf=rebase_write_corrected_conf)
[2021-11-10T06:53:08.173Z]         all_confs = copy_and_update(reader_confs, {
[2021-11-10T06:53:08.173Z]             'spark.sql.sources.useV1SourceList': v1_enabled_list,
[2021-11-10T06:53:08.173Z]             'spark.sql.optimizer.nestedSchemaPruning.enabled': nested_enabled,
[2021-11-10T06:53:08.173Z]             'spark.sql.legacy.parquet.datetimeRebaseModeInRead': 'CORRECTED'})
[2021-11-10T06:53:08.173Z]         # This is a hack to get the type in a slightly less verbose way
[2021-11-10T06:53:08.173Z]         rs = StructGen(read_schema, nullable=False).data_type
[2021-11-10T06:53:08.173Z] >       assert_gpu_and_cpu_are_equal_collect(lambda spark : spark.read.schema(rs).parquet(data_path),
[2021-11-10T06:53:08.173Z]                 conf=all_confs)
[2021-11-10T06:53:08.174Z] 
[2021-11-10T06:53:08.174Z] ../../src/main/python/parquet_test.py:504: 
[2021-11-10T06:53:08.174Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-11-10T06:53:08.174Z] ../../src/main/python/asserts.py:505: in assert_gpu_and_cpu_are_equal_collect
[2021-11-10T06:53:08.174Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2021-11-10T06:53:08.174Z] ../../src/main/python/asserts.py:425: in _assert_gpu_and_cpu_are_equal
[2021-11-10T06:53:08.174Z]     run_on_gpu()
[2021-11-10T06:53:08.174Z] ../../src/main/python/asserts.py:419: in run_on_gpu
[2021-11-10T06:53:08.174Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2021-11-10T06:53:08.174Z] ../../src/main/python/spark_session.py:105: in with_gpu_session
[2021-11-10T06:53:08.174Z]     return with_spark_session(func, conf=copy)
[2021-11-10T06:53:08.174Z] ../../src/main/python/spark_session.py:70: in with_spark_session
[2021-11-10T06:53:08.174Z]     ret = func(_spark)
[2021-11-10T06:53:08.174Z] ../../src/main/python/asserts.py:198: in <lambda>
[2021-11-10T06:53:08.174Z]     bring_back = lambda spark: limit_func(spark).collect()
[2021-11-10T06:53:08.174Z] /databricks/spark/python/pyspark/sql/dataframe.py:697: in collect
[2021-11-10T06:53:08.174Z]     sock_info = self._jdf.collectToPython()
[2021-11-10T06:53:08.174Z] /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__
[2021-11-10T06:53:08.174Z]     return_value = get_return_value(
[2021-11-10T06:53:08.174Z] /databricks/spark/python/pyspark/sql/utils.py:117: in deco
[2021-11-10T06:53:08.174Z]     return f(*a, **kw)
[2021-11-10T06:53:08.174Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-11-10T06:53:08.174Z] 
[2021-11-10T06:53:08.174Z] answer = 'xro499862'
[2021-11-10T06:53:08.174Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f7ea9f70a00>
[2021-11-10T06:53:08.174Z] target_id = 'o499859', name = 'collectToPython'
[2021-11-10T06:53:08.174Z] 
[2021-11-10T06:53:08.174Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2021-11-10T06:53:08.174Z]         """Converts an answer received from the Java gateway into a Python object.
[2021-11-10T06:53:08.174Z]     
[2021-11-10T06:53:08.174Z]         For example, string representation of integers are converted to Python
[2021-11-10T06:53:08.174Z]         integer, string representation of objects are converted to JavaObject
[2021-11-10T06:53:08.174Z]         instances, etc.
[2021-11-10T06:53:08.174Z]     
[2021-11-10T06:53:08.174Z]         :param answer: the string returned by the Java gateway
[2021-11-10T06:53:08.174Z]         :param gateway_client: the gateway client used to communicate with the Java
[2021-11-10T06:53:08.174Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2021-11-10T06:53:08.174Z]             list, map)
[2021-11-10T06:53:08.174Z]         :param target_id: the name of the object from which the answer comes from
[2021-11-10T06:53:08.174Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2021-11-10T06:53:08.174Z]         :param name: the name of the member from which the answer comes from
[2021-11-10T06:53:08.174Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2021-11-10T06:53:08.174Z]         """
[2021-11-10T06:53:08.174Z]         if is_error(answer)[0]:
[2021-11-10T06:53:08.174Z]             if len(answer) > 1:
[2021-11-10T06:53:08.174Z]                 type = answer[1]
[2021-11-10T06:53:08.174Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2021-11-10T06:53:08.174Z]                 if answer[1] == REFERENCE_TYPE:
[2021-11-10T06:53:08.174Z] >                   raise Py4JJavaError(
[2021-11-10T06:53:08.174Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2021-11-10T06:53:08.174Z]                         format(target_id, ".", name), value)
[2021-11-10T06:53:08.174Z]                    py4j.protocol.Py4JJavaError: An error occurred while calling o499859.collectToPython.
[2021-11-10T06:53:08.174Z]                    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14363.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14363.0 (TID 56503) (ip-10-59-180-78.us-west-2.compute.internal executor driver): ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-518-cuda11/cpp/src/io/parquet/reader_impl.cu:386: Found no metadata for schema index
[2021-11-10T06:53:08.174Z]                    	at ai.rapids.cudf.Table.readParquet(Native Method)
[2021-11-10T06:53:08.174Z]                    	at ai.rapids.cudf.Table.readParquet(Table.java:862)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readToTable$1(GpuParquetScanBase.scala:1491)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.readToTable(GpuParquetScanBase.scala:1490)
[2021-11-10T06:53:08.174Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readBatch$1(GpuParquetScanBase.scala:1451)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.readBatch(GpuParquetScanBase.scala:1439)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.next(GpuParquetScanBase.scala:1424)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(GpuDataSourceRDD.scala:94)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:54)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:67)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.PartitionIterator.hasNext(GpuDataSourceRDD.scala:61)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(GpuDataSourceRDD.scala:78)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2021-11-10T06:53:08.175Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.175Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:223)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:178)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:222)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:199)
[2021-11-10T06:53:08.175Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:239)
[2021-11-10T06:53:08.175Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:178)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
[2021-11-10T06:53:08.175Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
[2021-11-10T06:53:08.175Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
[2021-11-10T06:53:08.175Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.scheduler.Task.run(Task.scala:91)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1605)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
[2021-11-10T06:53:08.175Z]                    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2021-11-10T06:53:08.175Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.175Z]                    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
[2021-11-10T06:53:08.175Z]                    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-11-10T06:53:08.175Z]                    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-11-10T06:53:08.176Z]                    	at java.lang.Thread.run(Thread.java:748)
[2021-11-10T06:53:08.176Z]                    
[2021-11-10T06:53:08.176Z]                    Driver stacktrace:
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)
[2021-11-10T06:53:08.176Z]                    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[2021-11-10T06:53:08.176Z]                    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[2021-11-10T06:53:08.176Z]                    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)
[2021-11-10T06:53:08.176Z]                    	at scala.Option.foreach(Option.scala:407)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2476)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:264)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:299)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:512)
[2021-11-10T06:53:08.176Z]                    	at scala.Option.getOrElse(Option.scala:189)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:511)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:399)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:374)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:406)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3613)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3825)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
[2021-11-10T06:53:08.176Z]                    	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:223)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3823)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3611)
[2021-11-10T06:53:08.177Z]                    	at sun.reflect.GeneratedMethodAccessor116.invoke(Unknown Source)
[2021-11-10T06:53:08.177Z]                    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2021-11-10T06:53:08.177Z]                    	at java.lang.reflect.Method.invoke(Method.java:498)
[2021-11-10T06:53:08.177Z]                    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
[2021-11-10T06:53:08.177Z]                    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
[2021-11-10T06:53:08.177Z]                    	at py4j.Gateway.invoke(Gateway.java:295)
[2021-11-10T06:53:08.177Z]                    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[2021-11-10T06:53:08.177Z]                    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
[2021-11-10T06:53:08.177Z]                    	at py4j.GatewayConnection.run(GatewayConnection.java:251)
[2021-11-10T06:53:08.177Z]                    	at java.lang.Thread.run(Thread.java:748)
[2021-11-10T06:53:08.177Z]                    Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-518-cuda11/cpp/src/io/parquet/reader_impl.cu:386: Found no metadata for schema index
[2021-11-10T06:53:08.177Z]                    	at ai.rapids.cudf.Table.readParquet(Native Method)
[2021-11-10T06:53:08.177Z]                    	at ai.rapids.cudf.Table.readParquet(Table.java:862)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readToTable$1(GpuParquetScanBase.scala:1491)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.readToTable(GpuParquetScanBase.scala:1490)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readBatch$1(GpuParquetScanBase.scala:1451)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.readBatch(GpuParquetScanBase.scala:1439)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ParquetPartitionReader.next(GpuParquetScanBase.scala:1424)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(GpuDataSourceRDD.scala:94)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:54)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:67)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.PartitionIterator.hasNext(GpuDataSourceRDD.scala:61)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(GpuDataSourceRDD.scala:78)
[2021-11-10T06:53:08.177Z]                    	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2021-11-10T06:53:08.177Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.177Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:223)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-11-10T06:53:08.177Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:178)
[2021-11-10T06:53:08.178Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:222)
[2021-11-10T06:53:08.178Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:199)
[2021-11-10T06:53:08.178Z]                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:239)
[2021-11-10T06:53:08.178Z]                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:178)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
[2021-11-10T06:53:08.178Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
[2021-11-10T06:53:08.178Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
[2021-11-10T06:53:08.178Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.scheduler.Task.run(Task.scala:91)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1605)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
[2021-11-10T06:53:08.178Z]                    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2021-11-10T06:53:08.178Z]                    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2021-11-10T06:53:08.178Z]                    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
[2021-11-10T06:53:08.178Z]                    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-11-10T06:53:08.178Z]                    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-11-10T06:53:08.178Z]                    	... 1 more
[2021-11-10T06:53:08.178Z] 
[2021-11-10T06:53:08.178Z] /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py:326: Py4JJavaError
[2021-11-10T06:53:08.178Z] ----------------------------- Captured stdout call -----------------------------
..... 
2021-11-10T06:53:08.875Z] =========================== short test summary info ============================
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs0-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['STRUCT', Struct(['case_INSENsitive', Long])]]]
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs0-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['struct', Struct(['CASE_INSENSITIVE', Long])]]]
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs0-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['stRUct', Struct(['CASE_INSENSITIVE', Long])]]]
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs1-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['STRUCT', Struct(['case_INSENsitive', Long])]]]
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs1-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['struct', Struct(['CASE_INSENSITIVE', Long])]]]
[2021-11-10T06:53:08.875Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs1-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['stRUct', Struct(['CASE_INSENSITIVE', Long])]]]
[2021-11-10T06:53:08.876Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs2-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', Short])]]-[['STRUCT', Struct(['case_INSENsitive', Long])]]]
[2021-11-10T06:53:08.876Z] FAILED ../../src/main/python/parquet_test.py::test_nested_pruning_and_case_insensitive[true--reader_confs2-[['struct', Struct(['c_1', String],['case_insensitive', Long],['c_3', 
14:53:09  = 36 failed, 10781 passed, 136 skipped, 404 xfailed, 156 xpassed, 76 warnings in 6261.85s (1:44:21) =

Steps/Code to reproduce bug
Build rapids-4-spark and run IT on Databricks 9.1 ML spark-3.1.2

Environment details (please complete the following information)

Environment location: [Local spark, Databricks 9.1ML spark-3.1.2]

The text was updated successfully, but these errors were encountered:

NvTimLiu · 2021-11-10T07:30:15Z

@wbo4958 Does #3982 related to this issue?

NvTimLiu · 2021-11-10T08:07:32Z

@wbo4958 parquet tests only failed on DB9.1,

tests PASS on DB7.3/DB8.2 and other non-DB environments

wbo4958 · 2021-11-10T12:31:38Z

Looks like 9.1 runtime has changed the API "ParquetReadSupport.clipParquetSchema" which result in different result

For DB 9.1 3.1.2 runtime

 clippedSchemaTmp:message spark_schema {
  optional group STRUCT {
    optional int64 case_INSENsitive;
  }
}

spark 3.1.2

 clippedSchemaTmp:message spark_schema {
  optional group struct {
    optional int64 case_insensitive;
  }
}

will continue check tomorrow

wbo4958 · 2021-11-11T04:28:04Z

"ParquetReadSupport.clipParquetSchema" in DB9.1 will return the same name with readDataSchema instead of parquet file schema which will result clipBlocks return empty ColumnChunkMetaData. So issue happened

…turns the readSchema-same-name schema when case insensitive, which will cause clipBlocks return in-correct results since clipBlocks only takes care of case sensitive matching. Signed-off-by: Bobby Wang wbo4958@gmail.com To fix NVIDIA#4069

NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 10, 2021

NvTimLiu assigned wbo4958 Nov 10, 2021

NvTimLiu linked a pull request Nov 10, 2021 that will close this issue

Fix the issue of parquet reading with case insensitive schema #3982

Merged

pxLi changed the title ~~[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.0.2~~ [BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 Nov 10, 2021

NvTimLiu mentioned this issue Nov 10, 2021

Change Databricks image from 8.2 to 9.1 [skip ci] #4049

Merged

tgravescs added the P0 Must have for release label Nov 10, 2021

wbo4958 mentioned this issue Nov 11, 2021

Add case insensitive when clipping parquet blocks [databricks] #4080

Merged

wbo4958 closed this as completed in #4080 Nov 11, 2021

pxLi removed the ? - Needs Triage Need team to review and classify label Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 #4069

[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 #4069

NvTimLiu commented Nov 10, 2021

NvTimLiu commented Nov 10, 2021

NvTimLiu commented Nov 10, 2021

wbo4958 commented Nov 10, 2021

wbo4958 commented Nov 11, 2021

[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 #4069

[BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2 #4069

Comments

NvTimLiu commented Nov 10, 2021

NvTimLiu commented Nov 10, 2021

NvTimLiu commented Nov 10, 2021

wbo4958 commented Nov 10, 2021

wbo4958 commented Nov 11, 2021