[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

abellina · 2021-06-05T16:24:49Z

test_pandas_map_udf_nested_type failed in the nightly EGX/Yarn job with:

09:38:58  =================================== FAILURES ===================================
09:38:58  �[31m�[1m_ test_pandas_map_udf_nested_type[Array(Struct(['child0', Byte],['child1', String],['child2', Float]))] _�[0m
09:38:58  
09:38:58  data_gen = Array(Struct(['child0', Byte],['child1', String],['child2', Float]))
09:38:58  
09:38:58      @pytest.mark.parametrize('data_gen', data_gens_nested_for_udf, ids=idfn)
09:38:58      def test_pandas_map_udf_nested_type(data_gen):
09:38:58          # Supported UDF output types by plugin: (commonCudfTypes + ARRAY).nested() + STRUCT
09:38:58          # STRUCT represents the whole dataframe in Map Pandas UDF, so no struct column in UDF output.
09:38:58          # More details is here
09:38:58          #   https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L119
09:38:58          udf_out_schema = 'c_integral long,' \
09:38:58                           'c_string string,' \
09:38:58                           'c_fp double,' \
09:38:58                           'c_bool boolean,' \
09:38:58                           'c_date date,' \
09:38:58                           'c_time timestamp,' \
09:38:58                           'c_array_array array<array<long>>,' \
09:38:58                           'c_array_string array<string>'
09:38:58      
09:38:58          def col_types_udf(pdf_itr):
09:38:58              for pdf in pdf_itr:
09:38:58                  # Return a data frame with columns of supported type, and there is only one row.
09:38:58                  # The values can not be generated randomly because it should return the same data
09:38:58                  # for both CPU and GPU runs.
09:38:58                  yield pd.DataFrame({
09:38:58                      "c_integral": [len(pdf)],
09:38:58                      "c_string": ["size" + str(len(pdf))],
09:38:58                      "c_fp": [float(len(pdf))],
09:38:58                      "c_bool": [False],
09:38:58                      "c_date": [date(2021, 4, 2)],
09:38:58                      "c_time": [datetime(2021, 4, 2, tzinfo=timezone.utc)],
09:38:58                      "c_array_array": [[[len(pdf)]]],
09:38:58                      "c_array_string": [["size" + str(len(pdf))]]
09:38:58                  })
09:38:58      
09:38:58          assert_gpu_and_cpu_are_equal_collect(
09:38:58              lambda spark: unary_op_df(spark, data_gen)\
09:38:58                  .mapInPandas(col_types_udf, schema=udf_out_schema),
09:38:58  >           conf=arrow_udf_conf)
09:38:58  
09:38:58  �[1m�[31mintegration_tests/src/main/python/udf_test.py�[0m:290: 
09:38:58  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:381: in assert_gpu_and_cpu_are_equal_collect
09:38:58      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:373: in _assert_gpu_and_cpu_are_equal
09:38:58      assert_equal(from_cpu, from_gpu)
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:93: in assert_equal
09:38:58      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
09:38:58  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
09:38:58  
09:38:58  cpu = [Row(c_integral=85, c_string='size85', c_fp=85.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[85]], c_array_string=['size85']), ...]
09:38:58  gpu = [Row(c_integral=170, c_string='size170', c_fp=170.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.d....date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[170]], c_array_string=['size170']), ...]
09:38:58  float_check = <function get_float_check.<locals>.<lambda> at 0x7fb846e526a8>
09:38:58  path = []
09:38:58  
09:38:58      def _assert_equal(cpu, gpu, float_check, path):
09:38:58          t = type(cpu)
09:38:58          if (t is Row):
09:38:58              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
09:38:58              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
09:38:58                  for field in cpu.__fields__:
09:38:58                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
09:38:58              else:
09:38:58                  for index in range(len(cpu)):
09:38:58                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
09:38:58          elif (t is list):
09:38:58  >           assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
09:38:58  �[1m�[31mE           AssertionError: CPU and GPU list have different lengths at [] CPU: 24 GPU: 12�[0m
09:38:58  
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:39: AssertionError
09:38:58  ----------------------------- Captured stdout call -----------------------------
09:38:58  ### CPU RUN ###
09:38:58  ### GPU RUN ###
09:38:58  ### COLLECT: GPU TOOK 1.1823148727416992 CPU TOOK 1.1514267921447754 ###
09:38:58  �[33m=============================== warnings summary ===============================�[0m

The text was updated successfully, but these errors were encountered:

abellina · 2021-06-06T16:45:01Z

We had a second instance of this test failing, but with Array[Long]

11:22:06  =================================== FAILURES ===================================
11:22:06  �[31m�[1m_________________ test_pandas_map_udf_nested_type[Array(Long)] _________________�[0m
11:22:06  
11:22:06  data_gen = Array(Long)
11:22:06  
11:22:06      @pytest.mark.parametrize('data_gen', data_gens_nested_for_udf, ids=idfn)
11:22:06      def test_pandas_map_udf_nested_type(data_gen):
11:22:06          # Supported UDF output types by plugin: (commonCudfTypes + ARRAY).nested() + STRUCT
11:22:06          # STRUCT represents the whole dataframe in Map Pandas UDF, so no struct column in UDF output.
11:22:06          # More details is here
11:22:06          #   https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L119
11:22:06          udf_out_schema = 'c_integral long,' \
11:22:06                           'c_string string,' \
11:22:06                           'c_fp double,' \
11:22:06                           'c_bool boolean,' \
11:22:06                           'c_date date,' \
11:22:06                           'c_time timestamp,' \
11:22:06                           'c_array_array array<array<long>>,' \
11:22:06                           'c_array_string array<string>'
11:22:06      
11:22:06          def col_types_udf(pdf_itr):
11:22:06              for pdf in pdf_itr:
11:22:06                  # Return a data frame with columns of supported type, and there is only one row.
11:22:06                  # The values can not be generated randomly because it should return the same data
11:22:06                  # for both CPU and GPU runs.
11:22:06                  yield pd.DataFrame({
11:22:06                      "c_integral": [len(pdf)],
11:22:06                      "c_string": ["size" + str(len(pdf))],
11:22:06                      "c_fp": [float(len(pdf))],
11:22:06                      "c_bool": [False],
11:22:06                      "c_date": [date(2021, 4, 2)],
11:22:06                      "c_time": [datetime(2021, 4, 2, tzinfo=timezone.utc)],
11:22:06                      "c_array_array": [[[len(pdf)]]],
11:22:06                      "c_array_string": [["size" + str(len(pdf))]]
11:22:06                  })
11:22:06      
11:22:06  >       assert_gpu_and_cpu_are_equal_collect(
11:22:06              lambda spark: unary_op_df(spark, data_gen)\
11:22:06                  .mapInPandas(col_types_udf, schema=udf_out_schema),
11:22:06              conf=arrow_udf_conf)
11:22:06  
11:22:06  �[1m�[31m../../src/main/python/udf_test.py�[0m:287: 
11:22:06  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:381: in assert_gpu_and_cpu_are_equal_collect
11:22:06      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:373: in _assert_gpu_and_cpu_are_equal
11:22:06      assert_equal(from_cpu, from_gpu)
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:93: in assert_equal
11:22:06      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:22:06  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:22:06  
11:22:06  cpu = [Row(c_integral=56, c_string='size56', c_fp=56.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[56]], c_array_string=['size56']), ...]
11:22:06  gpu = [Row(c_integral=42, c_string='size42', c_fp=42.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[42]], c_array_string=['size42']), ...]
11:22:06  float_check = <function get_float_check.<locals>.<lambda> at 0x7f40049eb040>
11:22:06  path = []
11:22:06  
11:22:06      def _assert_equal(cpu, gpu, float_check, path):
11:22:06          t = type(cpu)
11:22:06          if (t is Row):
11:22:06              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
11:22:06              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
11:22:06                  for field in cpu.__fields__:
11:22:06                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:22:06              else:
11:22:06                  for index in range(len(cpu)):
11:22:06                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:22:06          elif (t is list):
11:22:06  >           assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
11:22:06  �[1m�[31mE           AssertionError: CPU and GPU list have different lengths at [] CPU: 36 GPU: 48�[0m
11:22:06  
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:39: AssertionError
11:22:06  ----------------------------- Captured stdout call -----------------------------
11:22:06  ### CPU RUN ###
11:22:06  ### GPU RUN ###
11:22:06  ### COLLECT: GPU TOOK 0.3105759620666504 CPU TOOK 0.22146821022033691 ###
11:22:06  =========================== short test summary info ============================

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 5, 2021

GaryShen2008 assigned firestarman Jun 7, 2021

firestarman mentioned this issue Jun 8, 2021

Ignore order for map udf test #2627

Merged

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jun 8, 2021

sameerz added this to the June 7 - June 18 milestone Jun 8, 2021

sameerz closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

abellina commented Jun 5, 2021

abellina commented Jun 6, 2021

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

Comments

abellina commented Jun 5, 2021

abellina commented Jun 6, 2021