[FEA] Generate Status Report for Profiling Tool #1012

cindyyuanjiang · 2024-05-13T23:03:36Z

Fixes #998

Changes

Generate a status report for event logs processed by Profiling Tool
Refactored common data structures and functions used by Q/P tool

Example Output

spark_rapids profiling --eventlogs <my_event_logs> --tools_jar <my_tools_jar> --verbose

In file rapids_4_spark_profile/profiling_status.csv:

Event Log,Status,Description
file:/xxxxxxxx/photon_log,SUCCESS,app-xxxxxxxx, Took 29084ms to process
file:/xxxxxxxx/eventlog_no_appinfo,, IncorrectAppStatusException: Application status is incorrect. Missing AppInfo
file:/xxxxxxxx/gpu_log,SUCCESS,local-xxxxxxxx, Took 1089ms to process
file:/xxxxxxxx/cpu_log,SUCCESS,application_xxxxxxxx, Took 1850ms to process
file:/xxxxxxxx/structured_streaming_log,SKIPPED,StreamingEventLogException: Encountered Spark Structured Streaming Job: skipping this file!

Testing
Created a directory sample_eventlogs, which contains 1 CPU event log, 1 GPU event log, 1 Photon event log, 1 event log with Structured Streaming, and 1 event log without ApplicationInfo. We will use these event logs for all following tests.

Profiling jar tool: test with combined, collection, and compare modes and verify profiling_status.csv is the same
Profiling jar tool: verify GPU event log is successful (not skipped)
Profiling jar tool: verify Photon and Structured Streaming event logs
Qualification jar tool: verify profiling_status.csv is the same before and after this PR

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

amahussein

Thanks @cindyyuanjiang
I have some quick notes regarding testing because Profiling had some special cases to handle:

Generally speaking, the CLI does not produce enough coverage for the changes in this PR. This change needs to be tested using the jar command because we need to verify the different type of Profiling modes. See the different modes (combined, collection, and compare) in the Profiling jar documentation. For all the Profiling modes, the profiling_status.csv should be the same (also use multiple eventlogs)
Test Profiling Jar cmd with CPU/GPU eventlogs to verify that the app won't be skipped like the qualification
Test Profiling Jar cmd with Photon and Streamed Applications to confirm the behavior
Run Qualification jar cmd on the same eventlogs (GPU/CPU/Photon/..etc) to verify that the Qualification has not changed by moving the code into common methods
Finally, add a unit test to verify the generated results are correct.

amahussein · 2024-05-14T15:12:58Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

+    // Write status reports for all event logs
+    val profileOutputWriter = new ProfileOutputWriter(outputDir, Profiler.PROFILE_LOG_NAME,
+      numOutputRows, outputCSV = outputCSV)
+    val reportResults = generateStatusProfResults(appStatusReporter.asScala.values.toSeq)
+    profileOutputWriter.write("Profiling Status", reportResults)


There is a tricky difference between Qualification and Profiler.
For instance, Profiler generates a profile.log that contains text formatted data for each table.
Therefore, the above code will create two files profile.log and profiling_status.csv. That's going to cause confusion of the purpose of those files that have naming overlap with the subdirectories of the apps. We need only the CSV file.

removed profile.log generation by separating writeCSVTable as an individual function

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

amahussein · 2024-05-15T21:00:15Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

+case class StatusProfileResult(
+    path: String,
+    status: String,
+    message: String = "") extends ProfileResult {
+    override val outputHeaders: Seq[String] = Seq("Event Log", "Status", "Description")
+
+    override def convertToSeq: Seq[String] = {
+      Seq(path, status, message)
+    }
+
+    override def convertToCSVSeq: Seq[String] = {
+      Seq(path, status, message)
+    }
+  }
+


I like that you tried to do some refactor in this PR to make the code reusable.

If we define a new StatusProfileResult, then we have two classes

QualificationAppInfo.StatusSummaryInfo: which has an extra fields like appID

ProfileClassWarehouse.StatusProfileResult

One of the above should go away because they do the same thing.
So, we should add the extra field appID to StatusProfileResult, then replace the usage of StatusSummaryInfo with StatusProfileResult.

Thanks! Added appID in StatusProfileResult and removed StatusSummaryInfo

amahussein · 2024-05-15T21:02:45Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileOutputWriter.scala

+   * Write a CSV file give the input header and data.
+   */
+  def writeCSVTable(header: String, outRows: Seq[ProfileResult],
+      outputCSV: Boolean = true, outputDir: String): Unit = {


The outputCSV: Boolean = true should be removed as an argument from this method.
The method's name is writeCSVTable which makes the argument to writeCSV redundant.
The method should simply prints to CSV without checks. the caller is responsible to do that check.

removed outputCSV from writeCSVTable

amahussein · 2024-05-15T21:04:50Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileOutputWriter.scala

+  def write(headerText: String, outRows: Seq[ProfileResult],
+      emptyTableText: Option[String] = None, tableDesc: Option[String] = None): Unit = {
+    writeTextTable(headerText, outRows, emptyTableText, tableDesc)
+    ProfileOutputWriter.writeCSVTable(headerText, outRows, outputCSV, outputDir)


Should check if outputCSV before calling this and the writeCSVTable should not be taking an argument outputCSV

if (outputCSV) { ProfileOutputWriter.writeCSVTable(headerText, outRows, outputDir) }

added this check before calling ProfileOutputWriter.writeCSVTable

amahussein · 2024-05-15T21:07:28Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

+  /**
+   * For each app status report, generate a StatusProfileResult.
+   * @return Seq[StatusProfileResult] - Seq[(path, status, description)]
+   */
+  private def generateStatusProfResults(appStatuses: Seq[AppResult]): Seq[StatusProfileResult] = {
+    appStatuses.map {
+      case FailureAppResult(path, message) => StatusProfileResult(path, "FAILURE", message)
+      case SkippedAppResult(path, message) => StatusProfileResult(path, "SKIPPED", message)
+      case SuccessAppResult(path, _, message) => StatusProfileResult(path, "SUCCESS", message)
+      case UnknownAppResult(path, _, message) => StatusProfileResult(path, "UNKNOWN", message)
+      case profAppResult: AppResult =>
+        throw new UnsupportedOperationException(s"Invalid status for $profAppResult")
+    }
+  }
+


once you use same class StatusProfileResult for both Q/P tools, this code can be used by Qualification and P tools

reuse generateStatusProfResults for both Q/P tool, moved definition to Profiler.generateStatusProfResults

amahussein · 2024-05-15T21:10:08Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

      case oom: OutOfMemoryError =>
-        logError(s"OOM error while processing large file: ${path.eventLog.toString}." +
-            s" Increase heap size. Exiting ...", oom)
+        logError(s"OOM error while processing large file: $pathStr." +
+            s"Increase heap size.", oom)
        sys.exit(1)
-      case NonFatal(e) =>
-        logWarning(s"Exception occurred processing file: ${path.eventLog.getName}", e)
-      case o: Throwable =>
-        logError(s"Error occurred while processing file: ${path.eventLog.toString}. Exiting ...", o)
+      case o: Error =>
+        logError(s"Error occurred while processing file: $pathStr", o)
        sys.exit(1)
+      case e: Exception =>
+        progressBar.foreach(_.reportFailedProcess())
+        val failureAppResult = FailureAppResult(pathStr,
+          s"Unexpected exception processing log, skipping!")
+        failureAppResult.logMessage(Some(e))
+        appStatusReporter.put(pathStr, failureAppResult)


can we fix some things with this block?
I prefer that it checks for NoFatal to create a failureApp new object. Anything else should cause an exit... I am not sure what are the cases that can produce Error.
Also, this code is pretty much the same we have in Qualification.

Also, in future we need to be careful about the exit part when we merge profiling tool to qual tool. It should not exit the qual tool process.

thanks! I updated this block. It is similar to Qualification. We can try to refactor something common out. I also want to get this correct as a starting point.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa · 2024-05-16T22:38:09Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

+      case SkippedAppResult(path, message) => StatusProfileResult(path, "SKIPPED", message)
+      case SuccessAppResult(path, _, message) => StatusProfileResult(path, "SUCCESS", message)
+      case UnknownAppResult(path, _, message) => StatusProfileResult(path, "UNKNOWN", message)


We are storing appId in SucessAppResult and UnknownAppResult. Can we include the appIds in the csv as well? This help is back tracking appId --> eventlog and vice versa.

thanks @parthosa! I will add the appId into message. This is consistent with the qualification output.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang · 2024-05-17T01:01:35Z

Thanks @amahussein and @parthosa! I addressed all review feedback. I will work on further testing which is outlined in the description.

amahussein · 2024-05-17T14:36:52Z

core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationInfoSuite.scala

@@ -751,6 +751,27 @@ class ApplicationInfoSuite extends FunSuite with Logging {
    assert(execInfo.head.maxMem === 5538054144L)
  }

+  test("test malformed json eventlog") {


Suggest a different name to reflect the real purpose of that unit-test. the current string may cause someone to think that the unit-test is just testing against malformated eventlogs.
Also, add a comment explaining the unit-test logic and what it tests.

Thanks @amahussein! Updated this.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa

Thanks @cindyyuanjiang. It would be nice if we can have generic naming like StatusAppResult or AppStatusResult instead of StatusProfilerResult since we use this case in both tools.

I think we can leave the writer classes separate as they require other parameters as well.

parthosa · 2024-05-20T17:13:26Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala

@@ -595,4 +574,23 @@ object Profiler {
      propStr + s"\nComments:\n$commentsToStr\n"
    }
  }
+
+  /**


Should this function be present in a common trait since its being used by both Profiler and Qualification now?

Thanks @parthosa! Moving this function to common trait RuntimeReporter

parthosa · 2024-05-20T17:16:41Z

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala

@@ -338,7 +339,7 @@ class QualOutputWriter(outputDir: String, reportReadSchema: Boolean,
    }
  }

-  def writeStatusReport(statusReports: Seq[StatusSummaryInfo], order: String): Unit = {
+  def writeStatusReport(statusReports: Seq[StatusProfileResult], order: String): Unit = {


Should we make StatusProfileResult generic since its used by both tools?

That sounds reasonable.
If moving that does not require too many changes, then will be good to do it in this PR.
Otherwise, we can keep it the way it is.

I replaced StatusProfileResult to a more generic name AppStatusResult. AppStatusResult is defined in ProfileClassWareHouse.scala. I think it is okay to keep the definition there because the file documentation says This is a warehouse to store all Classes used for profiling and qualification.

amahussein

Thanks @cindyyuanjiang
I was playing around with teh changes and I noticed that Skipped rows will have empty AppID.
Looks like this PR uses FailureApp to create a skipped App which which does not support an AppID.
IMHO, it is cleaner to have a defined value for a skipped app but I am fine with that given that there is somewhere in the code docs that explains what to expect from those fields.

amahussein · 2024-05-20T18:27:55Z

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala

@@ -338,7 +339,7 @@ class QualOutputWriter(outputDir: String, reportReadSchema: Boolean,
    }
  }

-  def writeStatusReport(statusReports: Seq[StatusSummaryInfo], order: String): Unit = {
+  def writeStatusReport(statusReports: Seq[StatusProfileResult], order: String): Unit = {


That sounds reasonable.
If moving that does not require too many changes, then will be good to do it in this PR.
Otherwise, we can keep it the way it is.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang · 2024-05-20T21:56:13Z

Looks like this PR uses FailureApp to create a skipped App which which does not support an AppID.
IMHO, it is cleaner to have a defined value for a skipped app but I am fine with that given that there is somewhere in the code docs that explains what to expect from those fields.

Thanks @amahussein! Yes both P/Q tool use FailureApp to create Skipped/Unknown/Failed Apps which does not support an AppId. I updated empty AppIds to "N/A" in the output CSV files for clarity.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

amahussein · 2024-05-21T14:37:52Z

Thanks @cindyyuanjiang !
There is a conflict in the branch.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang · 2024-05-21T20:23:06Z

Thanks @amahussein! Resolved conflict.

parthosa

LGTM. Thanks @cindyyuanjiang for merging the status report generation for both tools.

amahussein

Thanks @cindyyuanjiang
This feature was needed for quite some time.

generate status report for profiling tool

fb291d4

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang requested review from parthosa and amahussein May 13, 2024 23:03

cindyyuanjiang self-assigned this May 13, 2024

cindyyuanjiang requested a review from nartal1 May 13, 2024 23:04

cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) labels May 13, 2024

amahussein reviewed May 14, 2024

View reviewed changes

move write csv function to profiler writer object

23d00c3

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

amahussein requested changes May 15, 2024

View reviewed changes

addressed review feedback

c60fe79

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa reviewed May 16, 2024

View reviewed changes

cindyyuanjiang added 4 commits May 16, 2024 17:24

added unit tests

9c85b9c

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

added app id into output

44866a7

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

merge conflict

670349b

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

fixed scala style

20d847e

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

amahussein requested changes May 17, 2024

View reviewed changes

updated unit test name and comment for clarity

fcab963

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang requested review from parthosa and amahussein May 17, 2024 20:27

refactored Q/P tool status report

daf50ee

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa reviewed May 20, 2024

View reviewed changes

amahussein reviewed May 20, 2024

View reviewed changes

addressed review feedback

5df6a7a

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang requested review from amahussein and parthosa May 20, 2024 21:56

fixed import order

eaf3d16

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

cindyyuanjiang added 2 commits May 21, 2024 11:27

merge conflict in qualificationappinfo

a7a007a

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

removed unused import

6144f5f

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>

parthosa approved these changes May 21, 2024

View reviewed changes

amahussein approved these changes May 22, 2024

View reviewed changes

cindyyuanjiang merged commit 4801781 into NVIDIA:dev May 22, 2024
15 checks passed

cindyyuanjiang deleted the spark-rapids-tools-998 branch May 22, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Generate Status Report for Profiling Tool #1012

[FEA] Generate Status Report for Profiling Tool #1012

cindyyuanjiang commented May 13, 2024 •

edited

Loading

amahussein left a comment

amahussein May 14, 2024

cindyyuanjiang May 17, 2024

amahussein May 15, 2024

cindyyuanjiang May 17, 2024

amahussein May 15, 2024

cindyyuanjiang May 17, 2024

amahussein May 15, 2024

cindyyuanjiang May 17, 2024

amahussein May 15, 2024

cindyyuanjiang May 17, 2024

amahussein May 15, 2024

parthosa May 16, 2024

cindyyuanjiang May 16, 2024

parthosa May 16, 2024

cindyyuanjiang May 17, 2024

cindyyuanjiang commented May 17, 2024

amahussein May 17, 2024

cindyyuanjiang May 17, 2024

parthosa left a comment

parthosa May 20, 2024

cindyyuanjiang May 20, 2024

parthosa May 20, 2024

amahussein May 20, 2024

cindyyuanjiang May 20, 2024

amahussein left a comment

amahussein May 20, 2024

cindyyuanjiang commented May 20, 2024

amahussein commented May 21, 2024

cindyyuanjiang commented May 21, 2024

parthosa left a comment

amahussein left a comment

[FEA] Generate Status Report for Profiling Tool #1012

[FEA] Generate Status Report for Profiling Tool #1012

Conversation

cindyyuanjiang commented May 13, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented May 20, 2024

amahussein commented May 21, 2024

cindyyuanjiang commented May 21, 2024

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented May 13, 2024 •

edited

Loading