-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling tool: Add support for health check. #2632
Conversation
…supported SQL Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can ignore the nits if you want
``` | ||
Application application_1616746343401_0025 (index=1) failed tasks: | ||
+-------+--------------+------+-------+------------------------------------+ | ||
|stageId|stageAttemptId|taskId|attempt|endReason_first36char | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use 36 char here and 100 below? seems like we could use 100 in both. Maybe put ... at the end of hte string if its longer.
/** | ||
* HealthCheck defined health check rules | ||
*/ | ||
class HealthCheck(apps: ArrayBuffer[ApplicationInfo], textFileWriter:ToolTextFileWriter){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit space after :
def listFailedJobsStagesTasks(): Unit = { | ||
for (app <- apps) { | ||
// Look for failed tasks. | ||
val tasksMessageHeader = s"Application ${app.appId} (index=${app.index}) failed tasks:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rest of the headers leave off the appid and app index, we should be consistent about that.
@@ -140,6 +140,9 @@ class ApplicationInfo( | |||
var taskGettingResult: ArrayBuffer[SparkListenerTaskGettingResult] = | |||
ArrayBuffer[SparkListenerTaskGettingResult]() | |||
|
|||
//Unsupported SQL plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after //
// For unsupportedSQLPlanDF | ||
allDataFrames += (s"unsupportedSQLplan_$index" -> unsupportedSQLplan.toDF) | ||
if (unsupportedSQLplan.nonEmpty) { | ||
logInfo(s"Total ${unsupportedSQLplan.size} Plan node accums for appID=$appId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Plan node accums" is copy and paste from above, change it to something like unsupported ops
|
||
import com.nvidia.spark.rapids.tool.ToolTestUtils | ||
import org.scalatest.FunSuite | ||
import scala.collection.mutable.ArrayBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: scala should go after java and before this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually nothing major here so I'm going to merge and lets follow up with the first 36 char comment and the fixing the log statement in separate one.
* Profiling tool: Add support to list failed tasks, jobs, stages and unsupported SQL Signed-off-by: Niranjan Artal <nartal@nvidia.com> * add tests for health check features Signed-off-by: Niranjan Artal <nartal@nvidia.com>
* Profiling tool: Add support to list failed tasks, jobs, stages and unsupported SQL Signed-off-by: Niranjan Artal <nartal@nvidia.com> * add tests for health check features Signed-off-by: Niranjan Artal <nartal@nvidia.com>
This PR adds these features:
list Failed - Jobs, Stages and Tasks
list - RemovedBlockManager()
list - RemovedExecutors()
list - PossibleUnsupportedSQLPlan()
This PR is mainly to include the health related metrics/information of the cluster.
Added new eventlogs which have failed stages, tasks and exectuors removed.
Added tests to validate the API's.