On task failure catch some CUDA exceptions and kill executor [databricks] #5118

tgravescs · 2022-03-31T22:55:48Z

Related to #5029. This is shorter term solution to just parse the exception message to catch certain types of unrecoverable CUDA errors. It may not be bullet proof as the messages could change.

Here if we find an exception that we think is unrecoverable we system.exit to kill the executor.
Generally you would want to use this with the Spark excludeOnFailure functionality so it doesn't start the executor back up using the same GPU.

I've manually tested this by faking the exception occurring since we can't reproduce it. It properly kills the executor when it sees the exception.

Sample code used to cause failures:

sc.range(0, 4, 1, 4).mapPartitions{x =>
  import org.apache.spark.TaskContext
  val tc = TaskContext.get()
  println("task id: " + tc.taskAttemptId)
  
  if (tc.taskAttemptId % 3 == 0) {
    try {
    throw new Exception("cudaErrorHardwareStackError")
    } catch {
      case e: Throwable =>
      throw new RuntimeException(s"CUDA error encountered: ${e.getMessage}", e)
    }
  }
  x.map(x => x)}.collect()

this generates exceptions like:

22/03/31 22:46:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: CUDA error encountered: cudaErrorHardwareStackError

This new executor plugin code catches and logs the following and then exits:

22/03/31 22:46:57 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: java.lang.RuntimeException: CUDA error encountered: cudaErrorHardwareStackError

In standalone mode with the excludeOnFailure spark configs set to 1 for the node exclusion, it when the task fails and this kills the executor, the node with be excluded and the worker will not be able to restart an executor on that node. Also note keep in mind the spark config spark.excludeOnFailure.timeout which will try spark to retry that node after the timeout value.
Without excludeonFailure, the executors just get restarted on the same nodes for standalone mode. I tested on yarn as well and there it will restart executors but it could be on different nodes depending on the size of the cluster.

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs · 2022-03-31T22:55:59Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

tgravescs · 2022-04-01T14:44:06Z

build

tgravescs added 8 commits March 31, 2022 14:40

On task failure catch some cuda exceptions and kill executor

cecaaa4

Signed-off-by: Thomas Graves <tgraves@apache.org>

include other exceptions

b49a64d

cleanup logs

1192ba9

fix message

00492d3

update other message to debug

a9b169c

dont' call super

afb00a5

comment out checking all error string

cd35d1c

fxi extra space

7492b1e

tgravescs added the bug Something isn't working label Mar 31, 2022

tgravescs added this to the Mar 21 - Apr 1 milestone Mar 31, 2022

tgravescs self-assigned this Mar 31, 2022

jlowe reviewed Apr 1, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala Outdated Show resolved Hide resolved

jlowe mentioned this pull request Apr 1, 2022

JNI: throw CUDA errors more specifically rapidsai/cudf#10551

Merged

tgravescs added 2 commits April 1, 2022 09:18

Check the entire stack trace

47c9416

remove extra comment

17f212b

jlowe approved these changes Apr 1, 2022

View reviewed changes

tgravescs merged commit 207fbfc into NVIDIA:branch-22.04 Apr 1, 2022

tgravescs deleted the catchCudaException branch April 1, 2022 17:50

sperlingxx mentioned this pull request Apr 28, 2022

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

tgravescs commented Mar 31, 2022 •

edited

Loading

tgravescs commented Mar 31, 2022

tgravescs commented Apr 1, 2022

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

Conversation

tgravescs commented Mar 31, 2022 • edited Loading

tgravescs commented Mar 31, 2022

tgravescs commented Apr 1, 2022

tgravescs commented Mar 31, 2022 •

edited

Loading