Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dist/ddl: add subtask metrics #47175

Merged
merged 22 commits into from
Sep 25, 2023
Merged

Conversation

okJiang
Copy link
Member

@okJiang okJiang commented Sep 21, 2023

What problem does this PR solve?

Issue Number: close #47017

Problem Summary:

What is changed and how it works?

Runtime - Scheduler SubTask

  • A line chart showing the change in the number of waiting subTasks over time
  • A line chart showing the waiting time of waiting subTasks
  • A line chart showing the run time of running subTasks

TiDB perspective

  • A pie chart showing the distribution of all current subTasks on various TiDB nodes

Task perspective

  • A line chart showing the change in the number of each Task's (uncompleted/completed) subTasks over time
  • A line chart showing the average rate of each Task (subTask count/hour, which can later be improved to rows/s or bytes/s)

SubTask perspective

  • A line chart showing the average running speed of subTasks on different TiDB nodes (subTask count/hour)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

more test results are coming soon

image image image
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 21, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 21, 2023
@tiprow
Copy link

tiprow bot commented Sep 21, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@okJiang okJiang changed the title [wip]dist/ddl: add subtask metrics dist/ddl: add subtask metrics Sep 22, 2023
@okJiang okJiang marked this pull request as ready for review September 22, 2023 01:38
@ti-chi-bot ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-tests-checked labels Sep 22, 2023
@ywqzzy
Copy link
Contributor

ywqzzy commented Sep 22, 2023

/cc @ywqzzy

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 22, 2023

@ywqzzy: GitHub didn't allow me to request PR reviews from the following users: ywqzzy.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @ywqzzy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov
Copy link

codecov bot commented Sep 22, 2023

Codecov Report

Merging #47175 (4171ebc) into master (34438f8) will decrease coverage by 0.2724%.
Report is 9 commits behind head on master.
The diff coverage is 80.8988%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #47175        +/-   ##
================================================
- Coverage   72.9862%   72.7138%   -0.2724%     
================================================
  Files          1340       1361        +21     
  Lines        400251     406749      +6498     
================================================
+ Hits         292128     295763      +3635     
- Misses        89191      92200      +3009     
+ Partials      18932      18786       -146     
Flag Coverage Δ
integration 33.3929% <0.0000%> (?)
unit 73.0026% <84.7058%> (+0.0164%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9913% <ø> (ø)
parser 84.9376% <ø> (-0.0108%) ⬇️
br 48.8526% <ø> (-4.2222%) ⬇️

for {
// check if any error occurs.
if err := s.getError(); err != nil {
break
}

subtask, err := s.taskTable.GetSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
subtask, err := s.taskTable.GetFirstSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
if err != nil {
logutil.Logger(s.logCtx).Warn("GetSubtaskInStates meets error", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the log

Comment on lines 174 to 177
for _, subtask := range subtasks {
metrics.IncDistDDLSubTaskCnt(subtask)
metrics.StartDistDDLSubTask(subtask)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap function for all metric related code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrap them to func (s *BaseScheduler) initMetrics(task)?

func (s *BaseScheduler) startSubtask(id int64) {
err := s.taskTable.StartSubtask(id)
func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) {
metrics.DecDistDDLSubTaskCnt(subtask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metrics.DecDistDDLSubTaskCnt(subtask)
metrics.DecDistTaskSubTaskCnt(subtask)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why decrease here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we initiate it in L175

@@ -26,7 +26,8 @@ type TaskTable interface {
GetGlobalTasksInStates(states ...interface{}) (task []*proto.Task, err error)
GetGlobalTaskByID(taskID int64) (task *proto.Task, err error)

GetSubtaskInStates(instanceID string, taskID int64, step int64, states ...interface{}) (*proto.Subtask, error)
GetSubtasksInStates(tidbID string, taskID int64, step int64, states ...interface{}) ([]*proto.Subtask, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it now

Comment on lines 464 to 474
func (s *BaseScheduler) updateSubtaskStateAndError(subtask *proto.Subtask, state string, subTaskErr error) {
metrics.DecDistDDLSubTaskCnt(subtask)
metrics.EndDistDDLSubTask(subtask)
err := s.taskTable.UpdateSubtaskStateAndError(subtask.ID, state, subTaskErr)
if err != nil {
s.onError(err)
}
subtask.State = state
metrics.IncDistDDLSubTaskCnt(subtask)
metrics.StartDistDDLSubTask(subtask)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand the metric update logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the metric as soon as the subtask status is changed.

metrics.StartDistDDLSubTask(subtask)
}

func (s *BaseScheduler) finishSubtask(subtask *proto.Subtask, subtaskMeta []byte) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the finishSubtask method called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in c230f37

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2023
@okJiang
Copy link
Member Author

okJiang commented Sep 22, 2023

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Sep 22, 2023
"targets": [
{
"exemplar": true,
"expr": "sum(tidb_disttask_ddl_subtask_cnt{status=~\"pending|running|revert_pending|reverting|paused\"}) by (task_id)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all expression must have extra labels, see other existing metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
],
"repeat": null,
"title": "Dist DDL",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"title": "Dist DDL",
"title": "Dist Execute Framework",

"targets": [
{
"exemplar": true,
"expr": "time()-tidb_disttask_ddl_subtask_start_time{k8s_cluster=\"$k8s_cluster\",tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\", status=\"pending\"}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put this in previous Dist execute frameowork?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can see this detail in each TiDB Node. If we put it in Dist execute frameowork, different tidb details are mixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance" labels

Copy link
Member Author

@okJiang okJiang Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find other metrics containing these lables 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, we can see the details of each TiDB, rather than being mixed in one panel.
image

@D3Hunter
Copy link
Contributor

/label ok-to-test
/remove-label needs-ok-to-test

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 22, 2023

@D3Hunter: These labels are not set on the issue: needs-ok-to-test.

In response to this:

/label ok-to-test
/remove-label needs-ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@okJiang
Copy link
Member Author

okJiang commented Sep 23, 2023

/retest

"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Distributed DDL SubTask Pending Duration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change all dist ddl to dist task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in c4eef6d

metrics/disttask.go Outdated Show resolved Hide resolved
metrics/disttask.go Outdated Show resolved Hide resolved
metrics/disttask.go Outdated Show resolved Hide resolved
disttask/framework/scheduler/scheduler.go Outdated Show resolved Hide resolved
disttask/framework/scheduler/scheduler.go Outdated Show resolved Hide resolved
func (s *BaseScheduler) startSubtask(id int64) {
err := s.taskTable.StartSubtask(id)
func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) {
metrics.DecDistTaskSubTaskCnt(subtask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dec first, then inc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dec pre-state subtask, then inc new-state subtask

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dec pre-state subtask, then inc new-state subtask

IMHO, the method name is confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dec pre-state subtask, then inc new-state subtask

IMHO, the method name is confusing.

Indeed, there is a point, I think the reason for the confusion is that this function implicitly changes the state of the subtask halfway. How about doing so?

func (s *BaseScheduler) startSubtaskAndUpdateState(subtask *proto.Subtask) {
    ....
}

metrics/grafana/tidb.json Outdated Show resolved Hide resolved
metrics/grafana/tidb_runtime.json Outdated Show resolved Hide resolved
Comment on lines +169 to +177
subtasks, err := s.taskTable.GetSubtasksInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
if err != nil {
s.onError(err)
return s.getError()
}
for _, subtask := range subtasks {
metrics.IncDistTaskSubTaskCnt(subtask)
metrics.StartDistTaskSubTask(subtask)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this code into dispatcher.go.
When dispatching subtasks success, update the metric.
Then we don't need to fetch the taskTable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my previous implementation method, which would cause the instance of collecting metrics to be different, thereby causing confusion in Grafana display.

Co-authored-by: EasonBall <592838129@qq.com>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 25, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tangenta, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Sep 25, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 25, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-09-25 06:59:14.350170718 +0000 UTC m=+258144.068512920: ☑️ agreed by tangenta.
  • 2023-09-25 07:38:52.578940656 +0000 UTC m=+260522.297282857: ☑️ agreed by ywqzzy.

@okJiang
Copy link
Member Author

okJiang commented Sep 25, 2023

/retest

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...

but ok for now

@okJiang
Copy link
Member Author

okJiang commented Sep 25, 2023

now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...

but ok for now

Could lead to certain issues. For instance, we might overlook a few state changes owing to state update twice in interval.

@ti-chi-bot ti-chi-bot bot merged commit 516542b into pingcap:master Sep 25, 2023
@okJiang okJiang deleted the ddl-dist-metrics-2 branch September 25, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

disttask: add metrics collection for dispatcher and scheduler
4 participants