Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling tool: Add support for health check. #2632

Merged
merged 3 commits into from
Jun 8, 2021

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Jun 8, 2021

This PR adds these features:
list Failed - Jobs, Stages and Tasks
list - RemovedBlockManager()
list - RemovedExecutors()
list - PossibleUnsupportedSQLPlan()

This PR is mainly to include the health related metrics/information of the cluster.
Added new eventlogs which have failed stages, tasks and exectuors removed.
Added tests to validate the API's.

nartal1 added 3 commits June 7, 2021 17:32
…supported SQL

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1 nartal1 added the task Work required that improves the product but is not user facing label Jun 8, 2021
@nartal1 nartal1 self-assigned this Jun 8, 2021
@nartal1
Copy link
Collaborator Author

nartal1 commented Jun 8, 2021

build

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can ignore the nits if you want

```
Application application_1616746343401_0025 (index=1) failed tasks:
+-------+--------------+------+-------+------------------------------------+
|stageId|stageAttemptId|taskId|attempt|endReason_first36char |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to use 36 char here and 100 below? seems like we could use 100 in both. Maybe put ... at the end of hte string if its longer.

/**
* HealthCheck defined health check rules
*/
class HealthCheck(apps: ArrayBuffer[ApplicationInfo], textFileWriter:ToolTextFileWriter){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit space after :

def listFailedJobsStagesTasks(): Unit = {
for (app <- apps) {
// Look for failed tasks.
val tasksMessageHeader = s"Application ${app.appId} (index=${app.index}) failed tasks:\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rest of the headers leave off the appid and app index, we should be consistent about that.

@@ -140,6 +140,9 @@ class ApplicationInfo(
var taskGettingResult: ArrayBuffer[SparkListenerTaskGettingResult] =
ArrayBuffer[SparkListenerTaskGettingResult]()

//Unsupported SQL plan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after //

// For unsupportedSQLPlanDF
allDataFrames += (s"unsupportedSQLplan_$index" -> unsupportedSQLplan.toDF)
if (unsupportedSQLplan.nonEmpty) {
logInfo(s"Total ${unsupportedSQLplan.size} Plan node accums for appID=$appId")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Plan node accums" is copy and paste from above, change it to something like unsupported ops


import com.nvidia.spark.rapids.tool.ToolTestUtils
import org.scalatest.FunSuite
import scala.collection.mutable.ArrayBuffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scala should go after java and before this section

@tgravescs tgravescs added this to the June 7 - June 18 milestone Jun 8, 2021
Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually nothing major here so I'm going to merge and lets follow up with the first 36 char comment and the fixing the log statement in separate one.

@tgravescs tgravescs merged commit 1229894 into NVIDIA:branch-21.06 Jun 8, 2021
nartal1 added a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Profiling tool: Add support to list failed tasks, jobs, stages and unsupported SQL

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

* add tests for health check features

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
nartal1 added a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Profiling tool: Add support to list failed tasks, jobs, stages and unsupported SQL

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

* add tests for health check features

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@tgravescs tgravescs linked an issue Jun 10, 2021 that may be closed by this pull request
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Profiling and qualification tool
2 participants