Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] executors shutdown intermittently during integrations test parallel run #5979

Closed
pxLi opened this issue Jul 11, 2022 · 10 comments
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 11, 2022

Describe the bug

[2022-07-09T16:56:04.988Z] FAILED ../../src/main/python/conditionals_test.py::test_if_else_map[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]
[2022-07-09T16:56:04.988Z] FAILED ../../src/main/python/conditionals_test.py::test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]
[2022-07-10T17:27:40.474Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_groupby_first_last[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))][IGNORE_ORDER({'local': True})]

above cases started failing intermittently since last Friday in multiple pipelines

Executors got SIGABORT from pytest. Detailed pytest logs,

[2022-07-10T17:27:40.211Z] =================================== FAILURES ===================================
[2022-07-10T17:27:40.211Z] �[31m�[1m_ test_if_else_map[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     @pytest.mark.parametrize('data_gen', map_gens_sample, ids=idfn)
[2022-07-10T17:27:40.211Z]     def test_if_else_map(data_gen):
[2022-07-10T17:27:40.211Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.211Z]                 lambda spark : three_col_df(spark, boolean_gen, data_gen, data_gen).selectExpr(
[2022-07-10T17:27:40.211Z]                     'IF(TRUE, b, c)',
[2022-07-10T17:27:40.211Z]                     'IF(a, b, c)'))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/conditionals_test.py�[0m:65: 
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.211Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.211Z]     run_on_gpu()
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.211Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.211Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.211Z]     ret = func(_spark)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.211Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.211Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.211Z]     return_value = get_return_value(
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.211Z]     return f(*a, **kw)
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] answer = 'xro940240'
[2022-07-10T17:27:40.211Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.211Z] target_id = 'o940239', name = 'collectToPython'
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.211Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.211Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.211Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.211Z]         instances, etc.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.211Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.211Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.211Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.211Z]             list, map)
[2022-07-10T17:27:40.211Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.211Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.211Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.211Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.211Z]         """
[2022-07-10T17:27:40.211Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.211Z]             if len(answer) > 1:
[2022-07-10T17:27:40.211Z]                 type = answer[1]
[2022-07-10T17:27:40.211Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.211Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.211Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.211Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.211Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o940239.collectToPython.�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 9597.0 failed 1 times, most recent failure: Lost task 5.0 in stage 9597.0 (TID 318739) (10.136.6.4 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.211Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
[2022-07-10T17:27:40.211Z] ----------------------------- Captured stdout call -----------------------------
[2022-07-10T17:27:40.211Z] ### CPU RUN ###
[2022-07-10T17:27:40.211Z] ### GPU RUN ###
[2022-07-10T17:27:40.211Z] �[31m�[1m_ test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     @pytest.mark.order(1) # at the head of xdist worker queue if pytest-order is installed
[2022-07-10T17:27:40.211Z]     @pytest.mark.parametrize('data_gen', all_gens + all_nested_gens, ids=idfn)
[2022-07-10T17:27:40.211Z]     def test_case_when(data_gen):
[2022-07-10T17:27:40.211Z]         num_cmps = 20
[2022-07-10T17:27:40.211Z]         s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
[2022-07-10T17:27:40.211Z]         # we want lots of false
[2022-07-10T17:27:40.211Z]         bool_gen = BooleanGen().with_special_case(False, weight=1000.0)
[2022-07-10T17:27:40.211Z]         gen_cols = [('_b' + str(x), bool_gen) for x in range(0, num_cmps)]
[2022-07-10T17:27:40.211Z]         gen_cols = gen_cols + [('_c' + str(x), data_gen) for x in range(0, num_cmps)]
[2022-07-10T17:27:40.211Z]         gen = StructGen(gen_cols, nullable=False)
[2022-07-10T17:27:40.211Z]         command = f.when(f.col('_b0'), f.col('_c0'))
[2022-07-10T17:27:40.211Z]         for x in range(1, num_cmps):
[2022-07-10T17:27:40.211Z]             command = command.when(f.col('_b'+ str(x)), f.col('_c' + str(x)))
[2022-07-10T17:27:40.211Z]         command = command.otherwise(s1)
[2022-07-10T17:27:40.211Z]         data_type = data_gen.data_type
[2022-07-10T17:27:40.211Z]         # `command` covers the case of (column, scalar) for values, so the following 3 ones
[2022-07-10T17:27:40.211Z]         # are for
[2022-07-10T17:27:40.211Z]         #    (scalar, scalar)  -> the default `otherwise` is a scalar.
[2022-07-10T17:27:40.211Z]         #    (column, column)
[2022-07-10T17:27:40.211Z]         #    (scalar, column)
[2022-07-10T17:27:40.211Z]         # in sequence.
[2022-07-10T17:27:40.211Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.211Z]                 lambda spark : gen_df(spark, gen).select(command,
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), f.col('_c0')).otherwise(f.col('_c1')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).otherwise(f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).when(f.lit(False), f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), s1).when(f.lit(True), f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.col('_b0'), f.lit(None).cast(data_type)).otherwise(f.col('_c0')),
[2022-07-10T17:27:40.211Z]                     f.when(f.lit(False), f.col('_c0'))))
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/conditionals_test.py�[0m:91: 
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.211Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.211Z]     run_on_gpu()
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.211Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.211Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.211Z]     ret = func(_spark)
[2022-07-10T17:27:40.211Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.211Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.211Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.211Z]     return_value = get_return_value(
[2022-07-10T17:27:40.211Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.211Z]     return f(*a, **kw)
[2022-07-10T17:27:40.211Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z] answer = 'xro993903'
[2022-07-10T17:27:40.211Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.211Z] target_id = 'o993902', name = 'collectToPython'
[2022-07-10T17:27:40.211Z] 
[2022-07-10T17:27:40.211Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.211Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.211Z]     
[2022-07-10T17:27:40.212Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.212Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.212Z]         instances, etc.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.212Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.212Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.212Z]             list, map)
[2022-07-10T17:27:40.212Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         """
[2022-07-10T17:27:40.212Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.212Z]             if len(answer) > 1:
[2022-07-10T17:27:40.212Z]                 type = answer[1]
[2022-07-10T17:27:40.212Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.212Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.212Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.212Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.212Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o993902.collectToPython.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 9699.0 failed 1 times, most recent failure: Lost task 38.0 in stage 9699.0 (TID 323622) (10.136.6.4 executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
[2022-07-10T17:27:40.212Z] ----------------------------- Captured stdout call -----------------------------
[2022-07-10T17:27:40.212Z] ### CPU RUN ###
[2022-07-10T17:27:40.212Z] ### GPU RUN ###
[2022-07-10T17:27:40.212Z] �[31m�[1m_ test_groupby_first_last[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))] _�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] data_gen = Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z]     @ignore_order(local=True)
[2022-07-10T17:27:40.212Z]     @pytest.mark.parametrize('data_gen', all_gen + _nested_gens, ids=idfn)
[2022-07-10T17:27:40.212Z]     def test_groupby_first_last(data_gen):
[2022-07-10T17:27:40.212Z]         gen_fn = [('a', RepeatSeqGen(LongGen(), length=20)), ('b', data_gen)]
[2022-07-10T17:27:40.212Z]         agg_fn = lambda df: df.groupBy('a').agg(
[2022-07-10T17:27:40.212Z]             f.first('b'), f.last('b'), f.first('b', True), f.last('b', True))
[2022-07-10T17:27:40.212Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-07-10T17:27:40.212Z]             # First and last are not deterministic when they are run in a real distributed setup.
[2022-07-10T17:27:40.212Z]             # We set parallelism 1 to prevent nondeterministic results because of distributed setup.
[2022-07-10T17:27:40.212Z]             lambda spark: agg_fn(gen_df(spark, gen_fn, num_slices=1)))
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/hash_aggregate_test.py�[0m:1102: 
[2022-07-10T17:27:40.212Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-10T17:27:40.212Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-10T17:27:40.212Z]     run_on_gpu()
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-10T17:27:40.212Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:131: in with_gpu_session
[2022-07-10T17:27:40.212Z]     return with_spark_session(func, conf=copy)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/spark_session.py�[0m:98: in with_spark_session
[2022-07-10T17:27:40.212Z]     ret = func(_spark)
[2022-07-10T17:27:40.212Z] �[1m�[31m../../src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-10T17:27:40.212Z]     bring_back = lambda spark: limit_func(spark).collect()
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py�[0m:677: in collect
[2022-07-10T17:27:40.212Z]     sock_info = self._jdf.collectToPython()
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
[2022-07-10T17:27:40.212Z]     return_value = get_return_value(
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
[2022-07-10T17:27:40.212Z]     return f(*a, **kw)
[2022-07-10T17:27:40.212Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] answer = 'xro1797093'
[2022-07-10T17:27:40.212Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f98f8a6d850>
[2022-07-10T17:27:40.212Z] target_id = 'o1797092', name = 'collectToPython'
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2022-07-10T17:27:40.212Z]         """Converts an answer received from the Java gateway into a Python object.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         For example, string representation of integers are converted to Python
[2022-07-10T17:27:40.212Z]         integer, string representation of objects are converted to JavaObject
[2022-07-10T17:27:40.212Z]         instances, etc.
[2022-07-10T17:27:40.212Z]     
[2022-07-10T17:27:40.212Z]         :param answer: the string returned by the Java gateway
[2022-07-10T17:27:40.212Z]         :param gateway_client: the gateway client used to communicate with the Java
[2022-07-10T17:27:40.212Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2022-07-10T17:27:40.212Z]             list, map)
[2022-07-10T17:27:40.212Z]         :param target_id: the name of the object from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         :param name: the name of the member from which the answer comes from
[2022-07-10T17:27:40.212Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2022-07-10T17:27:40.212Z]         """
[2022-07-10T17:27:40.212Z]         if is_error(answer)[0]:
[2022-07-10T17:27:40.212Z]             if len(answer) > 1:
[2022-07-10T17:27:40.212Z]                 type = answer[1]
[2022-07-10T17:27:40.212Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2022-07-10T17:27:40.212Z]                 if answer[1] == REFERENCE_TYPE:
[2022-07-10T17:27:40.212Z] >                   raise Py4JJavaError(
[2022-07-10T17:27:40.212Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2022-07-10T17:27:40.212Z]                         format(target_id, ".", name), value)
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o1797092.collectToPython.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23149.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23149.0 (TID 591372) (10.136.6.4 executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
[2022-07-10T17:27:40.212Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-07-10T17:27:40.212Z] 
[2022-07-10T17:27:40.212Z] �[1m�[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 11, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 11, 2022

there was no useful logs from executor side. For worker log,

22/07/10 12:22:28 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:7077 after 34 ms (0 ms spent in bootstraps)
22/07/10 12:22:28 INFO Worker: Successfully registered with master spark://127.0.0.1:7077
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/0 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/1 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/2 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO Worker: Asked to launch executor app-20220710122231-0000/3 for rapids spark plugin integration tests (python)
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls to: jenkins
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: Changing view acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: Changing modify acls groups to:
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "0" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "1" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "3" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 12:22:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:/var/lib/jenkins/workspace/rapids_it-UCX-egx06-standalone/jars/rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark312.jar:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/conf/:/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx71680M" "-Dspark.driver.port=46397" "-ea" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.136.6.4:46397" "--executor-id" "2" "--hostname" "10.136.6.4" "--cores" "12" "--app-id" "app-20220710122231-0000" "--worker-url" "spark://Worker@10.136.6.4:36085"
22/07/10 13:03:47 INFO Worker: Executor app-20220710122231-0000/3 finished with state EXITED message Command exited with code 134 exitStatus 134
22/07/10 13:03:47 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 3
22/07/10 13:03:47 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220710122231-0000, execId=3)
22/07/10 13:03:47 INFO Worker: Asked to launch executor app-20220710122231-0000/4 for rapids spark plugin integration tests (python)

Executor app-20220710122231-0000/3 finished with state EXITED message Command exited with code 134 exitStatus 134

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 11, 2022

as it failed intermittently, I guess there could be some memory leaks. We also saw this kind of failures in ub16 test pipeline and jdk11 test pipelines, I am not sure if all there failures were related

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 11, 2022

some coredump log in UCX nightly test (jdk8),
hs_err_pid31643.log
hs_err_pid31646.log
hs_err_pid31645.log
hs_err_pid31644.log

some coredump log in ubuntu16 nightly test (jdk8),
hs_err_pid1340.log

some coredump logs in jdk11 nightly test,
hs_err_pid12194.log
hs_err_pid12000.log

@pxLi pxLi changed the title [BUG] conditionals_test and hash_aggregate_test failed intermittently in UCX runtime [BUG] conditionals_test and hash_aggregate_test failed intermittently Jul 11, 2022
@pxLi pxLi changed the title [BUG] conditionals_test and hash_aggregate_test failed intermittently [BUG] executors shutdown intermittently during integrations test parallel run Jul 11, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 11, 2022

We found more pipeline failed the same reason if pytest run in parallel mode (xdist) since last Friday

@res-life
Copy link
Collaborator

Seems it's not related to the commit #5955
The MemoryCleaner.configuredDefaultShutdownHook is always false in IT because not configured ai.rapids.refcount.debug.

ai.rapids.refcount.debug is set to true in UT, see:
https://github.com/NVIDIA/spark-rapids/blob/v22.06.0/pom.xml#L1107

<ai.rapids.refcount.debug>true</ai.rapids.refcount.debug>

Anyway, trigged a build after reverted this commit on JDK11-nightly-dev jenkens pipline, seq num is 25, will see the result later.

@abellina
Copy link
Collaborator

@rwlee could this be related in any way to rapidsai/cudf#11153?

@jlowe
Copy link
Member

jlowe commented Jul 11, 2022

The hs_err_pid files are quite consistent, always showing a segfault in libcuda.so.1 after ColumnView.ifElse is called, e.g.:

Stack: [0x00007fe4dd33f000,0x00007fe4dd440000],  sp=0x00007fe4dd43aae8,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1cfd40]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 20169  ai.rapids.cudf.ColumnView.ifElseVV(JJJ)J (0 bytes) @ 0x00007ff7163c86c7 [0x00007ff7163c8680+0x47]
J 20168 C1 ai.rapids.cudf.ColumnView.ifElse(Lai/rapids/cudf/ColumnView;Lai/rapids/cudf/ColumnView;)Lai/rapids/cudf/ColumnVector; (68 bytes) @ 0x00007ff715ca18ec [0x00007ff715ca10a0+0x84c]
J 45670 C1 com.nvidia.spark.rapids.GpuIf.$anonfun$columnarEval$3(Lcom/nvidia/spark/rapids/GpuIf;Ljava/lang/Object;Lcom/nvidia/spark/rapids/GpuColumnVector;Ljava/lang/Object;)Lcom/nvidia/spark/rapids/GpuColumnVector; (460 bytes) @ 0x00007ff71fc03a1c [0x00007ff71fc030c0+0x95c]

Most of the time it's ifElseVV but there was at least one crash with ifElseSV.

@mythrocks
Copy link
Collaborator

mythrocks commented Jul 11, 2022

Hmm. All of these failures are on map lookup. I wonder if there's a problem in lists::index_of(), or maps_column_view.
I believe there were changes in the former, recently.

Edit: These might be unrelated to the crash. The code under test isn't actually looking up the contents of the map column.

@jlowe jlowe self-assigned this Jul 12, 2022
@jlowe
Copy link
Member

jlowe commented Jul 12, 2022

None of the integration tests fail on my machine, even after multiple runs. They do however fail for @revans2, and he was able to localize the failure to a single integration test, test_case_when[Map(Short(not_null),Struct(['child0', Byte],['child1', Double]))]. However the test does not fail with gdb attached to the process, and it also does not fail when running with fewer than 3 threads per executor.

We were able to generate a core file from one of the crashes. This appears to be a bug in libcudf that has been there a long time, but I cannot readily explain why it has only started failing recently. See rapidsai/cudf#11248.

@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 12, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 14, 2022

Deployed new spark-rapids-jni w/ the fix rapidsai/cudf#11254

Most of CI tests should pass as expected now, I will keep monitoring all other pipelines for a few days.

@pxLi pxLi closed this as completed Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

6 participants