Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

Closed
abellina opened this issue Jun 5, 2021 · 1 comment · Fixed by #2627
Closed

[BUG] test_pandas_map_udf_nested_type failed in Yarn integration #2605

abellina opened this issue Jun 5, 2021 · 1 comment · Fixed by #2627
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@abellina
Copy link
Collaborator

abellina commented Jun 5, 2021

test_pandas_map_udf_nested_type failed in the nightly EGX/Yarn job with:

09:38:58  =================================== FAILURES ===================================
09:38:58  �[31m�[1m_ test_pandas_map_udf_nested_type[Array(Struct(['child0', Byte],['child1', String],['child2', Float]))] _�[0m
09:38:58  
09:38:58  data_gen = Array(Struct(['child0', Byte],['child1', String],['child2', Float]))
09:38:58  
09:38:58      @pytest.mark.parametrize('data_gen', data_gens_nested_for_udf, ids=idfn)
09:38:58      def test_pandas_map_udf_nested_type(data_gen):
09:38:58          # Supported UDF output types by plugin: (commonCudfTypes + ARRAY).nested() + STRUCT
09:38:58          # STRUCT represents the whole dataframe in Map Pandas UDF, so no struct column in UDF output.
09:38:58          # More details is here
09:38:58          #   https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L119
09:38:58          udf_out_schema = 'c_integral long,' \
09:38:58                           'c_string string,' \
09:38:58                           'c_fp double,' \
09:38:58                           'c_bool boolean,' \
09:38:58                           'c_date date,' \
09:38:58                           'c_time timestamp,' \
09:38:58                           'c_array_array array<array<long>>,' \
09:38:58                           'c_array_string array<string>'
09:38:58      
09:38:58          def col_types_udf(pdf_itr):
09:38:58              for pdf in pdf_itr:
09:38:58                  # Return a data frame with columns of supported type, and there is only one row.
09:38:58                  # The values can not be generated randomly because it should return the same data
09:38:58                  # for both CPU and GPU runs.
09:38:58                  yield pd.DataFrame({
09:38:58                      "c_integral": [len(pdf)],
09:38:58                      "c_string": ["size" + str(len(pdf))],
09:38:58                      "c_fp": [float(len(pdf))],
09:38:58                      "c_bool": [False],
09:38:58                      "c_date": [date(2021, 4, 2)],
09:38:58                      "c_time": [datetime(2021, 4, 2, tzinfo=timezone.utc)],
09:38:58                      "c_array_array": [[[len(pdf)]]],
09:38:58                      "c_array_string": [["size" + str(len(pdf))]]
09:38:58                  })
09:38:58      
09:38:58          assert_gpu_and_cpu_are_equal_collect(
09:38:58              lambda spark: unary_op_df(spark, data_gen)\
09:38:58                  .mapInPandas(col_types_udf, schema=udf_out_schema),
09:38:58  >           conf=arrow_udf_conf)
09:38:58  
09:38:58  �[1m�[31mintegration_tests/src/main/python/udf_test.py�[0m:290: 
09:38:58  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:381: in assert_gpu_and_cpu_are_equal_collect
09:38:58      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:373: in _assert_gpu_and_cpu_are_equal
09:38:58      assert_equal(from_cpu, from_gpu)
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:93: in assert_equal
09:38:58      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
09:38:58  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
09:38:58  
09:38:58  cpu = [Row(c_integral=85, c_string='size85', c_fp=85.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[85]], c_array_string=['size85']), ...]
09:38:58  gpu = [Row(c_integral=170, c_string='size170', c_fp=170.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.d....date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[170]], c_array_string=['size170']), ...]
09:38:58  float_check = <function get_float_check.<locals>.<lambda> at 0x7fb846e526a8>
09:38:58  path = []
09:38:58  
09:38:58      def _assert_equal(cpu, gpu, float_check, path):
09:38:58          t = type(cpu)
09:38:58          if (t is Row):
09:38:58              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
09:38:58              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
09:38:58                  for field in cpu.__fields__:
09:38:58                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
09:38:58              else:
09:38:58                  for index in range(len(cpu)):
09:38:58                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
09:38:58          elif (t is list):
09:38:58  >           assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
09:38:58  �[1m�[31mE           AssertionError: CPU and GPU list have different lengths at [] CPU: 24 GPU: 12�[0m
09:38:58  
09:38:58  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:39: AssertionError
09:38:58  ----------------------------- Captured stdout call -----------------------------
09:38:58  ### CPU RUN ###
09:38:58  ### GPU RUN ###
09:38:58  ### COLLECT: GPU TOOK 1.1823148727416992 CPU TOOK 1.1514267921447754 ###
09:38:58  �[33m=============================== warnings summary ===============================�[0m
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 5, 2021
@abellina
Copy link
Collaborator Author

abellina commented Jun 6, 2021

We had a second instance of this test failing, but with Array[Long]

11:22:06  =================================== FAILURES ===================================
11:22:06  �[31m�[1m_________________ test_pandas_map_udf_nested_type[Array(Long)] _________________�[0m
11:22:06  
11:22:06  data_gen = Array(Long)
11:22:06  
11:22:06      @pytest.mark.parametrize('data_gen', data_gens_nested_for_udf, ids=idfn)
11:22:06      def test_pandas_map_udf_nested_type(data_gen):
11:22:06          # Supported UDF output types by plugin: (commonCudfTypes + ARRAY).nested() + STRUCT
11:22:06          # STRUCT represents the whole dataframe in Map Pandas UDF, so no struct column in UDF output.
11:22:06          # More details is here
11:22:06          #   https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L119
11:22:06          udf_out_schema = 'c_integral long,' \
11:22:06                           'c_string string,' \
11:22:06                           'c_fp double,' \
11:22:06                           'c_bool boolean,' \
11:22:06                           'c_date date,' \
11:22:06                           'c_time timestamp,' \
11:22:06                           'c_array_array array<array<long>>,' \
11:22:06                           'c_array_string array<string>'
11:22:06      
11:22:06          def col_types_udf(pdf_itr):
11:22:06              for pdf in pdf_itr:
11:22:06                  # Return a data frame with columns of supported type, and there is only one row.
11:22:06                  # The values can not be generated randomly because it should return the same data
11:22:06                  # for both CPU and GPU runs.
11:22:06                  yield pd.DataFrame({
11:22:06                      "c_integral": [len(pdf)],
11:22:06                      "c_string": ["size" + str(len(pdf))],
11:22:06                      "c_fp": [float(len(pdf))],
11:22:06                      "c_bool": [False],
11:22:06                      "c_date": [date(2021, 4, 2)],
11:22:06                      "c_time": [datetime(2021, 4, 2, tzinfo=timezone.utc)],
11:22:06                      "c_array_array": [[[len(pdf)]]],
11:22:06                      "c_array_string": [["size" + str(len(pdf))]]
11:22:06                  })
11:22:06      
11:22:06  >       assert_gpu_and_cpu_are_equal_collect(
11:22:06              lambda spark: unary_op_df(spark, data_gen)\
11:22:06                  .mapInPandas(col_types_udf, schema=udf_out_schema),
11:22:06              conf=arrow_udf_conf)
11:22:06  
11:22:06  �[1m�[31m../../src/main/python/udf_test.py�[0m:287: 
11:22:06  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:381: in assert_gpu_and_cpu_are_equal_collect
11:22:06      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:373: in _assert_gpu_and_cpu_are_equal
11:22:06      assert_equal(from_cpu, from_gpu)
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:93: in assert_equal
11:22:06      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:22:06  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:22:06  
11:22:06  cpu = [Row(c_integral=56, c_string='size56', c_fp=56.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[56]], c_array_string=['size56']), ...]
11:22:06  gpu = [Row(c_integral=42, c_string='size42', c_fp=42.0, c_bool=False, c_date=datetime.date(2021, 4, 2), c_time=datetime.date...me.date(2021, 4, 2), c_time=datetime.datetime(2021, 4, 2, 0, 0), c_array_array=[[42]], c_array_string=['size42']), ...]
11:22:06  float_check = <function get_float_check.<locals>.<lambda> at 0x7f40049eb040>
11:22:06  path = []
11:22:06  
11:22:06      def _assert_equal(cpu, gpu, float_check, path):
11:22:06          t = type(cpu)
11:22:06          if (t is Row):
11:22:06              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
11:22:06              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
11:22:06                  for field in cpu.__fields__:
11:22:06                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:22:06              else:
11:22:06                  for index in range(len(cpu)):
11:22:06                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:22:06          elif (t is list):
11:22:06  >           assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
11:22:06  �[1m�[31mE           AssertionError: CPU and GPU list have different lengths at [] CPU: 36 GPU: 48�[0m
11:22:06  
11:22:06  �[1m�[31m../../src/main/python/asserts.py�[0m:39: AssertionError
11:22:06  ----------------------------- Captured stdout call -----------------------------
11:22:06  ### CPU RUN ###
11:22:06  ### GPU RUN ###
11:22:06  ### COLLECT: GPU TOOK 0.3105759620666504 CPU TOOK 0.22146821022033691 ###
11:22:06  =========================== short test summary info ============================

@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jun 8, 2021
@sameerz sameerz added this to the June 7 - June 18 milestone Jun 8, 2021
@sameerz sameerz closed this as completed Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants