[BUG] executors shutdown intermittently during integrations test parallel run #5979

pxLi · 2022-07-11T01:16:16Z

Describe the bug

[2022-07-09T16:56:04.988Z] FAILED ../../src/main/python/conditionals_test.py::test_if_else_map[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]
[2022-07-09T16:56:04.988Z] FAILED ../../src/main/python/conditionals_test.py::test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]
[2022-07-10T17:27:40.474Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_groupby_first_last[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))][IGNORE_ORDER({'local': True})]

above cases started failing intermittently since last Friday in multiple pipelines

Executors got SIGABORT from pytest. Detailed pytest logs,

[2022-07-10T17:27:40.211Z] =================================== FAILURES ===================================
[2022-07-10T17:27:40.211Z] �[31m�[1m_ test_if_else_map[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     @pytest.mark.parametrize('data_gen', map_gens_sample, ids=idfn)
[2022-07-10T17:27:40.211Z]     def test_if_else_map(data_gen):
[2022-07-10T17:27:40.211Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.211Z]                 lambda spark : three_col_df(spark, boolean_gen, data_gen, data_gen).selectExpr(
[2022-07-10T17:27:40.211Z]                     'IF(TRUE, b, c)',
[2022-07-10T17:27:40.211Z]                     'IF(a, b, c)'))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/conditionals_test.py�[0m:65: 
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.211Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.211Z]     run_on_gpu()
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.211Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.211Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.211Z]     ret = func(_spark)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.211Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.211Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.211Z]     return_value = get_return_value(
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.211Z]     return f(*a, **kw)
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] answer = 'xro940240'
[2022-07-10T17:27:40.211Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.211Z] target_id = 'o940239', name = 'collectToPython'
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.211Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.211Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.211Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.211Z]         instances, etc.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.211Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.211Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.211Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.211Z]             list, map)
[2022-07-10T17:27:40.211Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.211Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.211Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.211Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.211Z]         """
[2022-07-10T17:27:40.211Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.211Z]             if len(answer) > 1:
[2022-07-10T17:27:40.211Z]                 type = answer[1]
[2022-07-10T17:27:40.211Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.211Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.211Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.211Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.211Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o940239.collectToPython.�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 9597.0 failed 1 times, most recent failure: Lost task 5.0 in stage 9597.0 (TID 318739) (10.136.6.4 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
[2022-07-10T17:27:40.211Z] ----------------------------- Captured stdout call -----------------------------
[2022-07-10T17:27:40.211Z] ### CPU RUN ###
[2022-07-10T17:27:40.211Z] ### GPU RUN ###
[2022-07-10T17:27:40.211Z] �[31m�[1m_ test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     @pytest.mark.order(1) # at the head of xdist worker queue if pytest-order is installed
[2022-07-10T17:27:40.211Z]     @pytest.mark.parametrize('data_gen', all_gens + all_nested_gens, ids=idfn)
[2022-07-10T17:27:40.211Z]     def test_case_when(data_gen):
[2022-07-10T17:27:40.211Z]         num_cmps = 20
[2022-07-10T17:27:40.211Z]         s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
[2022-07-10T17:27:40.211Z]         # we want lots of false
[2022-07-10T17:27:40.211Z]         bool_gen = BooleanGen().with_special_case(False, weight=1000.0)
[2022-07-10T17:27:40.211Z]         gen_cols = [('_b' + str(x), bool_gen) for x in range(0, num_cmps)]
[2022-07-10T17:27:40.211Z]         gen_cols = gen_cols + [('_c' + str(x), data_gen) for x in range(0, num_cmps)]
[2022-07-10T17:27:40.211Z]         gen = StructGen(gen_cols, nullable=False)
[2022-07-10T17:27:40.211Z]         command = f.when(f.col('_b0'), f.col('_c0'))
[2022-07-10T17:27:40.211Z]         for x in range(1, num_cmps):
[2022-07-10T17:27:40.211Z]             command = command.when(f.col('_b'+ str(x)), f.col('_c' + str(x)))
[2022-07-10T17:27:40.211Z]         command = command.otherwise(s1)
[2022-07-10T17:27:40.211Z]         data_type = data_gen.data_type
[2022-07-10T17:27:40.211Z]         # `command` covers the case of (column, scalar) for values, so the following 3 ones
[2022-07-10T17:27:40.211Z]         # are for
[2022-07-10T17:27:40.211Z]         #    (scalar, scalar)  -> the default `otherwise` is a scalar.
[2022-07-10T17:27:40.211Z]         #    (column, column)
[2022-07-10T17:27:40.211Z]         #    (scalar, column)
[2022-07-10T17:27:40.211Z]         # in sequence.
[2022-07-10T17:27:40.211Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.211Z]                 lambda spark : gen_df(spark, gen).select(command,
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), f.col('_c0')).otherwise(f.col('_c1')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).otherwise(f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).when(f.lit(False), f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).when(f.lit(True), f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), f.lit(None).cast(data_type)).otherwise(f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.lit(False), f.col('_c0'))))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/conditionals_test.py�[0m:91: 
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.211Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.211Z]     run_on_gpu()
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.211Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.211Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.211Z]     ret = func(_spark)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.211Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.211Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.211Z]     return_value = get_return_value(
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.211Z]     return f(*a, **kw)
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] answer = 'xro993903'
[2022-07-10T17:27:40.211Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.211Z] target_id = 'o993902', name = 'collectToPython'
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.211Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.212Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.212Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.212Z]         instances, etc.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.212Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.212Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.212Z]             list, map)
[2022-07-10T17:27:40.212Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         """
[2022-07-10T17:27:40.212Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.212Z]             if len(answer) > 1:
[2022-07-10T17:27:40.212Z]                 type = answer[1]
[2022-07-10T17:27:40.212Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.212Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.212Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.212Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.212Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o993902.collectToPython.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 9699.0 failed 1 times, most recent failure: Lost task 38.0 in stage 9699.0 (TID 323622) (10.136.6.4 executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
[2022-07-10T17:27:40.212Z] ----------------------------- Captured stdout call -----------------------------
[2022-07-10T17:27:40.212Z] ### CPU RUN ###
[2022-07-10T17:27:40.212Z] ### GPU RUN ###
[2022-07-10T17:27:40.212Z] �[31m�[1m_ test_groupby_first_last[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z]     @ignore_order(local=True)
[2022-07-10T17:27:40.212Z]     @pytest.mark.parametrize('data_gen', all_gen + _nested_gens, ids=idfn)
[2022-07-10T17:27:40.212Z]     def test_groupby_first_last(data_gen):
[2022-07-10T17:27:40.212Z]         gen_fn = [('a', RepeatSeqGen(LongGen(), length=20)), ('b', data_gen)]
[2022-07-10T17:27:40.212Z]         agg_fn = lambda df: df.groupBy('a').agg(
[2022-07-10T17:27:40.212Z]             f.first('b'), f.last('b'), f.first('b', True), f.last('b', True))
[2022-07-10T17:27:40.212Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.212Z]             # First and last are not deterministic when they are run in a real distributed setup.
[2022-07-10T17:27:40.212Z]             # We set parallelism 1 to prevent nondeterministic results because of distributed setup.
[2022-07-10T17:27:40.212Z]             lambda spark: agg_fn(gen_df(spark, gen_fn, num_slices=1)))
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/hash_aggregate_test.py�[0m:1102: 
[2022-07-10T17:27:40.212Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.212Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.212Z]     run_on_gpu()
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.212Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.212Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.212Z]     ret = func(_spark)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.212Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.212Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.212Z]     return_value = get_return_value(
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.212Z]     return f(*a, **kw)
[2022-07-10T17:27:40.212Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] answer = 'xro1797093'
[2022-07-10T17:27:40.212Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.212Z] target_id = 'o1797092', name = 'collectToPython'
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.212Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.212Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.212Z]         instances, etc.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.212Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.212Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.212Z]             list, map)
[2022-07-10T17:27:40.212Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         """
[2022-07-10T17:27:40.212Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.212Z]             if len(answer) > 1:
[2022-07-10T17:27:40.212Z]                 type = answer[1]
[2022-07-10T17:27:40.212Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.212Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.212Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.212Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.212Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o1797092.collectToPython.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23149.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23149.0 (TID 591372) (10.136.6.4 executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError

The text was updated successfully, but these errors were encountered:

pxLi · 2022-07-11T01:23:26Z

there was no useful logs from executor side. For worker log,

22/07/10 12:22:28 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:7077 after 34 ms (0 ms spent in bootstraps)
22/07/10 12:22:28 INFO Worker: Successfully registered with master spark://127.0.0.1:7077
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/0 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/1 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/2 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/3 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "0" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "1" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "3" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "2" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 13:03:47 INFO Worker: Executor app-20220710122231-0000/3 finished with state EXITED message Command exited with code 134 exitStatus 134
22/07/10 13:03:47 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 3
22/07/10 13:03:47 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220710122231-0000, execId=3)
22/07/10 13:03:47 INFO Worker: Asked to launch executor app-20220710122231-0000/4 for rapids spark plugin integration tests (python)

Executor app-20220710122231-0000/3 finished with state EXITED message Command exited with code 134 exitStatus 134

pxLi · 2022-07-11T01:29:04Z

as it failed intermittently, I guess there could be some memory leaks. We also saw this kind of failures in ub16 test pipeline and jdk11 test pipelines, I am not sure if all there failures were related

pxLi · 2022-07-11T06:10:26Z

some coredump log in UCX nightly test (jdk8),
hs_err_pid31643.log
hs_err_pid31646.log
hs_err_pid31645.log
hs_err_pid31644.log

some coredump log in ubuntu16 nightly test (jdk8),
hs_err_pid1340.log

some coredump logs in jdk11 nightly test,
hs_err_pid12194.log
hs_err_pid12000.log

pxLi · 2022-07-11T08:01:05Z

We found more pipeline failed the same reason if pytest run in parallel mode (xdist) since last Friday

res-life · 2022-07-11T10:19:11Z

Seems it's not related to the commit #5955
The MemoryCleaner.configuredDefaultShutdownHook is always false in IT because not configured ai.rapids.refcount.debug.

ai.rapids.refcount.debug is set to true in UT, see:
https://github.com/NVIDIA/spark-rapids/blob/v22.06.0/pom.xml#L1107

<ai.rapids.refcount.debug>true</ai.rapids.refcount.debug>

Anyway, trigged a build after reverted this commit on JDK11-nightly-dev jenkens pipline, seq num is 25, will see the result later.

abellina · 2022-07-11T14:06:02Z

@rwlee could this be related in any way to rapidsai/cudf#11153?

jlowe · 2022-07-11T15:50:45Z

The hs_err_pid files are quite consistent, always showing a segfault in libcuda.so.1 after ColumnView.ifElse is called, e.g.:

Stack: [0x00007fe4dd33f000,0x00007fe4dd440000],  sp=0x00007fe4dd43aae8,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1cfd40]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 20169  ai.rapids.cudf.ColumnView.ifElseVV(JJJ)J (0 bytes) @ 0x00007ff7163c86c7 [0x00007ff7163c8680+0x47]
J 20168 C1 ai.rapids.cudf.ColumnView.ifElse(Lai/rapids/cudf/ColumnView;Lai/rapids/cudf/ColumnView;)Lai/rapids/cudf/ColumnVector; (68 bytes) @ 0x00007ff715ca18ec [0x00007ff715ca10a0+0x84c]
J 45670 C1 com.nvidia.spark.rapids.GpuIf.$anonfun$columnarEval$3(Lcom/nvidia/spark/rapids/GpuIf;Ljava/lang/Object;Lcom/nvidia/spark/rapids/GpuColumnVector;Ljava/lang/Object;)Lcom/nvidia/spark/rapids/GpuColumnVector; (460 bytes) @ 0x00007ff71fc03a1c [0x00007ff71fc030c0+0x95c]

Most of the time it's ifElseVV but there was at least one crash with ifElseSV.

mythrocks · 2022-07-11T18:56:41Z

Hmm. All of these failures are on map lookup. I wonder if there's a problem in lists::index_of(), or maps_column_view.
I believe there were changes in the former, recently.

Edit: These might be unrelated to the crash. The code under test isn't actually looking up the contents of the map column.

jlowe · 2022-07-12T18:39:59Z

None of the integration tests fail on my machine, even after multiple runs. They do however fail for @revans2, and he was able to localize the failure to a single integration test, test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]. However the test does not fail with gdb attached to the process, and it also does not fail when running with fewer than 3 threads per executor.

We were able to generate a core file from one of the crashes. This appears to be a bug in libcudf that has been there a long time, but I cannot readily explain why it has only started failing recently. See rapidsai/cudf#11248.

pxLi · 2022-07-14T04:20:41Z

Deployed new spark-rapids-jni w/ the fix rapidsai/cudf#11254

Most of CI tests should pass as expected now, I will keep monitoring all other pipelines for a few days.

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 11, 2022

pxLi changed the title ~~[BUG] conditionals_test and hash_aggregate_test failed intermittently in UCX runtime~~ [BUG] conditionals_test and hash_aggregate_test failed intermittently Jul 11, 2022

pxLi changed the title ~~[BUG] conditionals_test and hash_aggregate_test failed intermittently~~ [BUG] executors shutdown intermittently during integrations test parallel run Jul 11, 2022

jlowe mentioned this issue Jul 11, 2022

GPU accelerate Apache Iceberg reads #5941

Merged

jlowe self-assigned this Jul 12, 2022

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 12, 2022

jlowe mentioned this issue Jul 12, 2022

Enable zstd integration tests for parquet and orc [databricks] #5991

Merged

pxLi closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] executors shutdown intermittently during integrations test parallel run #5979

[BUG] executors shutdown intermittently during integrations test parallel run #5979

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022

res-life commented Jul 11, 2022

abellina commented Jul 11, 2022

jlowe commented Jul 11, 2022

mythrocks commented Jul 11, 2022 •

edited

Loading

jlowe commented Jul 12, 2022

pxLi commented Jul 14, 2022 •

edited

Loading

[BUG] executors shutdown intermittently during integrations test parallel run #5979

[BUG] executors shutdown intermittently during integrations test parallel run #5979

Comments

pxLi commented Jul 11, 2022 • edited Loading

pxLi commented Jul 11, 2022 • edited Loading

pxLi commented Jul 11, 2022 • edited Loading

pxLi commented Jul 11, 2022 • edited Loading

pxLi commented Jul 11, 2022

res-life commented Jul 11, 2022

abellina commented Jul 11, 2022

jlowe commented Jul 11, 2022

mythrocks commented Jul 11, 2022 • edited Loading

jlowe commented Jul 12, 2022

pxLi commented Jul 14, 2022 • edited Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 11, 2022 •

edited

Loading

mythrocks commented Jul 11, 2022 •

edited

Loading

pxLi commented Jul 14, 2022 •

edited

Loading