Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generate Status Report for Profiling Tool #1012

Merged
merged 13 commits into from
May 22, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented May 13, 2024

Fixes #998

Changes

  1. Generate a status report for event logs processed by Profiling Tool
  2. Refactored common data structures and functions used by Q/P tool

Example Output

spark_rapids profiling --eventlogs <my_event_logs> --tools_jar <my_tools_jar> --verbose

In file rapids_4_spark_profile/profiling_status.csv:

Event Log,Status,Description
file:/xxxxxxxx/photon_log,SUCCESS,app-xxxxxxxx, Took 29084ms to process
file:/xxxxxxxx/eventlog_no_appinfo,, IncorrectAppStatusException: Application status is incorrect. Missing AppInfo
file:/xxxxxxxx/gpu_log,SUCCESS,local-xxxxxxxx, Took 1089ms to process
file:/xxxxxxxx/cpu_log,SUCCESS,application_xxxxxxxx, Took 1850ms to process
file:/xxxxxxxx/structured_streaming_log,SKIPPED,StreamingEventLogException: Encountered Spark Structured Streaming Job: skipping this file!

Testing
Created a directory sample_eventlogs, which contains 1 CPU event log, 1 GPU event log, 1 Photon event log, 1 event log with Structured Streaming, and 1 event log without ApplicationInfo. We will use these event logs for all following tests.

  • Profiling jar tool: test with combined, collection, and compare modes and verify profiling_status.csv is the same
  • Profiling jar tool: verify GPU event log is successful (not skipped)
  • Profiling jar tool: verify Photon and Structured Streaming event logs
  • Qualification jar tool: verify profiling_status.csv is the same before and after this PR

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang cindyyuanjiang self-assigned this May 13, 2024
@cindyyuanjiang cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) labels May 13, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
I have some quick notes regarding testing because Profiling had some special cases to handle:

  • Generally speaking, the CLI does not produce enough coverage for the changes in this PR. This change needs to be tested using the jar command because we need to verify the different type of Profiling modes. See the different modes (combined, collection, and compare) in the Profiling jar documentation. For all the Profiling modes, the profiling_status.csv should be the same (also use multiple eventlogs)
  • Test Profiling Jar cmd with CPU/GPU eventlogs to verify that the app won't be skipped like the qualification
  • Test Profiling Jar cmd with Photon and Streamed Applications to confirm the behavior
  • Run Qualification jar cmd on the same eventlogs (GPU/CPU/Photon/..etc) to verify that the Qualification has not changed by moving the code into common methods
  • Finally, add a unit test to verify the generated results are correct.

Comment on lines 146 to 150
// Write status reports for all event logs
val profileOutputWriter = new ProfileOutputWriter(outputDir, Profiler.PROFILE_LOG_NAME,
numOutputRows, outputCSV = outputCSV)
val reportResults = generateStatusProfResults(appStatusReporter.asScala.values.toSeq)
profileOutputWriter.write("Profiling Status", reportResults)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a tricky difference between Qualification and Profiler.
For instance, Profiler generates a profile.log that contains text formatted data for each table.
Therefore, the above code will create two files profile.log and profiling_status.csv. That's going to cause confusion of the purpose of those files that have naming overlap with the subdirectories of the apps. We need only the CSV file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed profile.log generation by separating writeCSVTable as an individual function

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Comment on lines 161 to 175
case class StatusProfileResult(
path: String,
status: String,
message: String = "") extends ProfileResult {
override val outputHeaders: Seq[String] = Seq("Event Log", "Status", "Description")

override def convertToSeq: Seq[String] = {
Seq(path, status, message)
}

override def convertToCSVSeq: Seq[String] = {
Seq(path, status, message)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you tried to do some refactor in this PR to make the code reusable.

If we define a new StatusProfileResult, then we have two classes

  • QualificationAppInfo.StatusSummaryInfo: which has an extra fields like appID
  • ProfileClassWarehouse.StatusProfileResult

One of the above should go away because they do the same thing.
So, we should add the extra field appID to StatusProfileResult, then replace the usage of StatusSummaryInfo with StatusProfileResult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added appID in StatusProfileResult and removed StatusSummaryInfo

* Write a CSV file give the input header and data.
*/
def writeCSVTable(header: String, outRows: Seq[ProfileResult],
outputCSV: Boolean = true, outputDir: String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outputCSV: Boolean = true should be removed as an argument from this method.
The method's name is writeCSVTable which makes the argument to writeCSV redundant.
The method should simply prints to CSV without checks. the caller is responsible to do that check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed outputCSV from writeCSVTable

def write(headerText: String, outRows: Seq[ProfileResult],
emptyTableText: Option[String] = None, tableDesc: Option[String] = None): Unit = {
writeTextTable(headerText, outRows, emptyTableText, tableDesc)
ProfileOutputWriter.writeCSVTable(headerText, outRows, outputCSV, outputDir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check if outputCSV before calling this and the writeCSVTable should not be taking an argument outputCSV

if (outputCSV) {
  ProfileOutputWriter.writeCSVTable(headerText, outRows, outputDir)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this check before calling ProfileOutputWriter.writeCSVTable

Comment on lines 60 to 74
/**
* For each app status report, generate a StatusProfileResult.
* @return Seq[StatusProfileResult] - Seq[(path, status, description)]
*/
private def generateStatusProfResults(appStatuses: Seq[AppResult]): Seq[StatusProfileResult] = {
appStatuses.map {
case FailureAppResult(path, message) => StatusProfileResult(path, "FAILURE", message)
case SkippedAppResult(path, message) => StatusProfileResult(path, "SKIPPED", message)
case SuccessAppResult(path, _, message) => StatusProfileResult(path, "SUCCESS", message)
case UnknownAppResult(path, _, message) => StatusProfileResult(path, "UNKNOWN", message)
case profAppResult: AppResult =>
throw new UnsupportedOperationException(s"Invalid status for $profAppResult")
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once you use same class StatusProfileResult for both Q/P tools, this code can be used by Qualification and P tools

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse generateStatusProfResults for both Q/P tool, moved definition to Profiler.generateStatusProfResults

Comment on lines 203 to 215
case oom: OutOfMemoryError =>
logError(s"OOM error while processing large file: ${path.eventLog.toString}." +
s" Increase heap size. Exiting ...", oom)
logError(s"OOM error while processing large file: $pathStr." +
s"Increase heap size.", oom)
sys.exit(1)
case NonFatal(e) =>
logWarning(s"Exception occurred processing file: ${path.eventLog.getName}", e)
case o: Throwable =>
logError(s"Error occurred while processing file: ${path.eventLog.toString}. Exiting ...", o)
case o: Error =>
logError(s"Error occurred while processing file: $pathStr", o)
sys.exit(1)
case e: Exception =>
progressBar.foreach(_.reportFailedProcess())
val failureAppResult = FailureAppResult(pathStr,
s"Unexpected exception processing log, skipping!")
failureAppResult.logMessage(Some(e))
appStatusReporter.put(pathStr, failureAppResult)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix some things with this block?
I prefer that it checks for NoFatal to create a failureApp new object. Anything else should cause an exit... I am not sure what are the cases that can produce Error.
Also, this code is pretty much the same we have in Qualification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in future we need to be careful about the exit part when we merge profiling tool to qual tool. It should not exit the qual tool process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I updated this block. It is similar to Qualification. We can try to refactor something common out. I also want to get this correct as a starting point.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Comment on lines 67 to 69
case SkippedAppResult(path, message) => StatusProfileResult(path, "SKIPPED", message)
case SuccessAppResult(path, _, message) => StatusProfileResult(path, "SUCCESS", message)
case UnknownAppResult(path, _, message) => StatusProfileResult(path, "UNKNOWN", message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are storing appId in SucessAppResult and UnknownAppResult. Can we include the appIds in the csv as well? This help is back tracking appId --> eventlog and vice versa.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @parthosa! I will add the appId into message. This is consistent with the qualification output.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

Thanks @amahussein and @parthosa! I addressed all review feedback. I will work on further testing which is outlined in the description.

@@ -751,6 +751,27 @@ class ApplicationInfoSuite extends FunSuite with Logging {
assert(execInfo.head.maxMem === 5538054144L)
}

test("test malformed json eventlog") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest a different name to reflect the real purpose of that unit-test. the current string may cause someone to think that the unit-test is just testing against malformated eventlogs.
Also, add a comment explaining the unit-test logic and what it tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! Updated this.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. It would be nice if we can have generic naming like StatusAppResult or AppStatusResult instead of StatusProfilerResult since we use this case in both tools.

I think we can leave the writer classes separate as they require other parameters as well.

@@ -595,4 +574,23 @@ object Profiler {
propStr + s"\nComments:\n$commentsToStr\n"
}
}

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be present in a common trait since its being used by both Profiler and Qualification now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa! Moving this function to common trait RuntimeReporter

@@ -338,7 +339,7 @@ class QualOutputWriter(outputDir: String, reportReadSchema: Boolean,
}
}

def writeStatusReport(statusReports: Seq[StatusSummaryInfo], order: String): Unit = {
def writeStatusReport(statusReports: Seq[StatusProfileResult], order: String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make StatusProfileResult generic since its used by both tools?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable.
If moving that does not require too many changes, then will be good to do it in this PR.
Otherwise, we can keep it the way it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced StatusProfileResult to a more generic name AppStatusResult. AppStatusResult is defined in ProfileClassWareHouse.scala. I think it is okay to keep the definition there because the file documentation says This is a warehouse to store all Classes used for profiling and qualification.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
I was playing around with teh changes and I noticed that Skipped rows will have empty AppID.
Looks like this PR uses FailureApp to create a skipped App which which does not support an AppID.
IMHO, it is cleaner to have a defined value for a skipped app but I am fine with that given that there is somewhere in the code docs that explains what to expect from those fields.

@@ -338,7 +339,7 @@ class QualOutputWriter(outputDir: String, reportReadSchema: Boolean,
}
}

def writeStatusReport(statusReports: Seq[StatusSummaryInfo], order: String): Unit = {
def writeStatusReport(statusReports: Seq[StatusProfileResult], order: String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable.
If moving that does not require too many changes, then will be good to do it in this PR.
Otherwise, we can keep it the way it is.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

Looks like this PR uses FailureApp to create a skipped App which which does not support an AppID.
IMHO, it is cleaner to have a defined value for a skipped app but I am fine with that given that there is somewhere in the code docs that explains what to expect from those fields.

Thanks @amahussein! Yes both P/Q tool use FailureApp to create Skipped/Unknown/Failed Apps which does not support an AppId. I updated empty AppIds to "N/A" in the output CSV files for clarity.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@amahussein
Copy link
Collaborator

Thanks @cindyyuanjiang !
There is a conflict in the branch.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

Thanks @amahussein! Resolved conflict.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @cindyyuanjiang for merging the status report generation for both tools.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
This feature was needed for quite some time.

@cindyyuanjiang cindyyuanjiang merged commit 4801781 into NVIDIA:dev May 22, 2024
15 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-998 branch May 22, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Generate a status report for event logs processed in Profiling tool
3 participants