Qualification tool: Error handling while processing large event logs #3714

nartal1 · 2021-09-30T01:10:39Z

This fixes #3430 .
In this PR, we catch OOM Error and log it with a hint to increase heap size to the user. Earlier it would complete without any log.
Not sure how to add tests for this. I ran it locally with small heap size and got the logError on the console.

 java -Xmx512M -cp tools/target/rapids-4-spark-tools_2.12-21.12.0-SNAPSHOT.jar:/home/nartal/spark-3.1.1/spark-3.1.1-bin-hadoop3.2/jars/* com.nvidia.spark.rapids.tool.profiling.ProfileMain --csv /home/nartal/CPU_GPU_eventLogs/CPU_runs/


21/09/29 17:44:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/09/29 17:44:56 ERROR ApplicationInfo: OOM error while processing large files. Increase heap size to process file:/home/nartal/CPU_GPU_eventLogs/CPU_runs/application_1630450374626_0001_1

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs · 2021-09-30T13:41:46Z

so this should not just be for OOM's, i've seen us swallow other exceptions as well, we should at least print out when an exception happens to let the user know.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov · 2021-09-30T17:10:23Z

tools/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

-              val event = JsonProtocol.sparkEventFromJson(parse(line))
-              processEvent(event)
+          try {
+            val lines = Source.fromInputStream(in)(Codec.UTF8).getLines().toList


the root cause of OOM is likely the fact that we materialize the whole file on Heap. Remove toList to keep it a line iterator.

Thanks @gerashegalov for taking a look. I was still seeing OOM error even after removing toList as we are reading one event log per thread. I am wrapping the checks at thread level now.

there are multiple things that can cause OOM, Gera is just saying this is potentially one of those, it depends on the file sizes. we should file a separate followup if we want to optimize it more.

Can you post the stack trace, it might give us a hint what to look for?

And I'd get a heapdump with -XX:+HeapDumpOnOutOfMemoryError

Below is the stack trace. It looks like while reading the file from inputStream.

Dumping heap to java_pid31797.hprof ... Heap dump file created [427465673 bytes in 1.749 secs] 21/09/30 15:15:32 ERROR Profiler: OOM error while processing large file file:/home/nartal/CPU_runs/application_1630450374626_0001_1.Increase heap size. java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuilder.append(StringBuilder.java:190) at java.io.BufferedReader.readLine(BufferedReader.java:358) at java.io.BufferedReader.readLine(BufferedReader.java:389) at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:189) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) at scala.collection.TraversableOnce.to(TraversableOnce.scala:313) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:311) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toList(TraversableOnce.scala:297) at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:297) at scala.collection.AbstractIterator.toList(Iterator.scala:1429) at org.apache.spark.sql.rapids.tool.AppBase.$anonfun$processEvents$4(AppBase.scala:87) at org.apache.spark.sql.rapids.tool.AppBase$$Lambda$203/903990770.apply(Unknown Source) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611) at org.apache.spark.sql.rapids.tool.AppBase.$anonfun$processEvents$2(AppBase.scala:86) at org.apache.spark.sql.rapids.tool.AppBase$$Lambda$199/136377256.apply(Unknown Source) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.rapids.tool.AppBase.processEvents(AppBase.scala:85) at org.apache.spark.sql.rapids.tool.profiling.ApplicationInfo.<init>(ApplicationInfo.scala:240) at com.nvidia.spark.rapids.tool.profiling.Profiler.com$nvidia$spark$rapids$tool$profiling$Profiler$$createApp(Profiler.scala:248) at com.nvidia.spark.rapids.tool.profiling.Profiler$ProfileProcessThread$1.run(Profiler.scala:205) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266)

I got the heapdump by including the argument you had specified. I opened it with jhat. It's too much info in there. Could you please let me pointers about where I should be looking at?

It looks like you still have toList in there.

VisualVM (part of JDK) and Eclipse MAT are more user friendly for analyzing the heap dump. You want to look for objects with large "retained" heap size.

Thanks Gera. Filed follow on issue to improve memory consumption: https://github.com/NVIDIA/spark-rapids/issues/3727
This PR is mostly to identify OOM and throw a meaningful error so that users can increase the heap size.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2021-10-01T00:53:42Z

build

…napse_log

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs · 2021-10-01T21:09:41Z

build

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2021-10-04T19:54:38Z

build

gerashegalov

LGTM

gerashegalov · 2021-10-04T19:59:27Z

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

+        sys.exit(1)
+      case NonFatal(e) =>
+        logWarning(s"Exception occurred processing file: ${path.eventLog.getName}", e)
+      case o =>


nit: there is no warning in the function but if we ever move it back inside a catch will get a warning like #3743.

Suggested change

case o =>

case o: Throwable =>

Thanks! updated it.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2021-10-04T21:15:37Z

build

nartal1 · 2021-10-05T02:35:24Z

build

tgravescs · 2021-10-05T13:02:15Z

build

…napse_log

nartal1 · 2021-10-05T17:56:39Z

build

nartal1 · 2021-10-05T23:00:43Z

It's failing due to timeout error.

tgravescs · 2021-10-06T13:36:26Z

the tests are failing with known issue #3742, need to wait for a fix there

gerashegalov · 2021-10-07T19:45:29Z

build

…napse_log

nartal1 · 2021-10-08T00:16:03Z

build

catch OOM error while processing large files

a7a8d32

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 added the bug Something isn't working label Sep 30, 2021

nartal1 added this to the Sep 27 - Oct 1 milestone Sep 30, 2021

nartal1 self-assigned this Sep 30, 2021

nartal1 linked an issue Sep 30, 2021 that may be closed by this pull request

[BUG] Profiling tool silently stops without producing any output on a Synapse Spark event log #3430

Closed

addressed review comments

45fbe99

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov reviewed Sep 30, 2021

View reviewed changes

moved error handling to thread level

c5deee6

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 mentioned this pull request Feb 28, 2024

[FEA] Qualification tool: Improve memory consumption while processing large eventlogs. NVIDIA/spark-rapids-tools#815

Open

Merge branch 'branch-21.12' of github.com:NVIDIA/spark-rapids into sy…

8e54a42

…napse_log

tgravescs reviewed Oct 1, 2021

View reviewed changes

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala Outdated Show resolved Hide resolved

addressed review comments

334616b

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov reviewed Oct 1, 2021

View reviewed changes

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala Outdated Show resolved Hide resolved

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala Outdated Show resolved Hide resolved

sameerz modified the milestones: Sep 27 - Oct 1, Oct 4 - Oct 15 Oct 4, 2021

function for error handling

318d10b

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov previously approved these changes Oct 4, 2021

View reviewed changes

addressed nits

b88f87e

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 dismissed gerashegalov’s stale review via b88f87e October 4, 2021 20:06

gerashegalov approved these changes Oct 4, 2021

View reviewed changes

Merge branch 'branch-21.12' of github.com:NVIDIA/spark-rapids into sy…

a5f5d34

…napse_log

gerashegalov mentioned this pull request Oct 7, 2021

[BUG] dedupe fails with find: './parallel-world/spark301/ ...' No such file or directory #3769

Closed

Merge branch 'branch-21.12' of github.com:NVIDIA/spark-rapids into sy…

4796f1e

…napse_log

gerashegalov merged commit 23a8a99 into NVIDIA:branch-21.12 Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification tool: Error handling while processing large event logs #3714

Qualification tool: Error handling while processing large event logs #3714

nartal1 commented Sep 30, 2021

tgravescs commented Sep 30, 2021

gerashegalov Sep 30, 2021 •

edited

Loading

nartal1 Sep 30, 2021

tgravescs Sep 30, 2021

gerashegalov Sep 30, 2021

nartal1 Sep 30, 2021

gerashegalov Sep 30, 2021

nartal1 Oct 1, 2021

nartal1 commented Oct 1, 2021

tgravescs commented Oct 1, 2021

nartal1 commented Oct 4, 2021

gerashegalov left a comment

gerashegalov Oct 4, 2021 •

edited

Loading

nartal1 Oct 4, 2021

nartal1 commented Oct 4, 2021

nartal1 commented Oct 5, 2021

tgravescs commented Oct 5, 2021

nartal1 commented Oct 5, 2021

nartal1 commented Oct 5, 2021

tgravescs commented Oct 6, 2021

gerashegalov commented Oct 7, 2021

nartal1 commented Oct 8, 2021

Qualification tool: Error handling while processing large event logs #3714

Qualification tool: Error handling while processing large event logs #3714

Conversation

nartal1 commented Sep 30, 2021

tgravescs commented Sep 30, 2021

gerashegalov Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

nartal1 Sep 30, 2021

Choose a reason for hiding this comment

tgravescs Sep 30, 2021

Choose a reason for hiding this comment

gerashegalov Sep 30, 2021

Choose a reason for hiding this comment

nartal1 Sep 30, 2021

Choose a reason for hiding this comment

gerashegalov Sep 30, 2021

Choose a reason for hiding this comment

nartal1 Oct 1, 2021

Choose a reason for hiding this comment

nartal1 commented Oct 1, 2021

tgravescs commented Oct 1, 2021

nartal1 commented Oct 4, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

nartal1 Oct 4, 2021

Choose a reason for hiding this comment

nartal1 commented Oct 4, 2021

nartal1 commented Oct 5, 2021

tgravescs commented Oct 5, 2021

nartal1 commented Oct 5, 2021

nartal1 commented Oct 5, 2021

tgravescs commented Oct 6, 2021

gerashegalov commented Oct 7, 2021

nartal1 commented Oct 8, 2021

gerashegalov Sep 30, 2021 •

edited

Loading

gerashegalov Oct 4, 2021 •

edited

Loading