-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dist/ddl: add subtask metrics #47175
Conversation
Skipping CI for Draft Pull Request. |
Skipping CI for Draft Pull Request. |
/cc @ywqzzy |
@ywqzzy: GitHub didn't allow me to request PR reviews from the following users: ywqzzy. Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #47175 +/- ##
================================================
- Coverage 72.9862% 72.7138% -0.2724%
================================================
Files 1340 1361 +21
Lines 400251 406749 +6498
================================================
+ Hits 292128 295763 +3635
- Misses 89191 92200 +3009
+ Partials 18932 18786 -146
Flags with carried forward coverage won't be shown. Click here to find out more.
|
for { | ||
// check if any error occurs. | ||
if err := s.getError(); err != nil { | ||
break | ||
} | ||
|
||
subtask, err := s.taskTable.GetSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
subtask, err := s.taskTable.GetFirstSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
if err != nil { | ||
logutil.Logger(s.logCtx).Warn("GetSubtaskInStates meets error", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the log
for _, subtask := range subtasks { | ||
metrics.IncDistDDLSubTaskCnt(subtask) | ||
metrics.StartDistDDLSubTask(subtask) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap function for all metric related code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about wrap them to func (s *BaseScheduler) initMetrics(task)
?
func (s *BaseScheduler) startSubtask(id int64) { | ||
err := s.taskTable.StartSubtask(id) | ||
func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) { | ||
metrics.DecDistDDLSubTaskCnt(subtask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics.DecDistDDLSubTaskCnt(subtask) | |
metrics.DecDistTaskSubTaskCnt(subtask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why decrease here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we initiate it in L175
@@ -26,7 +26,8 @@ type TaskTable interface { | |||
GetGlobalTasksInStates(states ...interface{}) (task []*proto.Task, err error) | |||
GetGlobalTaskByID(taskID int64) (task *proto.Task, err error) | |||
|
|||
GetSubtaskInStates(instanceID string, taskID int64, step int64, states ...interface{}) (*proto.Subtask, error) | |||
GetSubtasksInStates(tidbID string, taskID int64, step int64, states ...interface{}) ([]*proto.Subtask, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it now
func (s *BaseScheduler) updateSubtaskStateAndError(subtask *proto.Subtask, state string, subTaskErr error) { | ||
metrics.DecDistDDLSubTaskCnt(subtask) | ||
metrics.EndDistDDLSubTask(subtask) | ||
err := s.taskTable.UpdateSubtaskStateAndError(subtask.ID, state, subTaskErr) | ||
if err != nil { | ||
s.onError(err) | ||
} | ||
subtask.State = state | ||
metrics.IncDistDDLSubTaskCnt(subtask) | ||
metrics.StartDistDDLSubTask(subtask) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand the metric update logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the metric as soon as the subtask status is changed.
metrics.StartDistDDLSubTask(subtask) | ||
} | ||
|
||
func (s *BaseScheduler) finishSubtask(subtask *proto.Subtask, subtaskMeta []byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the finishSubtask method called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in c230f37
/ok-to-test |
metrics/grafana/tidb.json
Outdated
"targets": [ | ||
{ | ||
"exemplar": true, | ||
"expr": "sum(tidb_disttask_ddl_subtask_cnt{status=~\"pending|running|revert_pending|reverting|paused\"}) by (task_id)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all expression must have extra labels, see other existing metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #47175 (comment)
metrics/grafana/tidb.json
Outdated
} | ||
], | ||
"repeat": null, | ||
"title": "Dist DDL", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"title": "Dist DDL", | |
"title": "Dist Execute Framework", |
metrics/grafana/tidb_runtime.json
Outdated
"targets": [ | ||
{ | ||
"exemplar": true, | ||
"expr": "time()-tidb_disttask_ddl_subtask_start_time{k8s_cluster=\"$k8s_cluster\",tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\", status=\"pending\"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not put this in previous Dist execute frameowork
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can see this detail in each TiDB Node. If we put it in Dist execute frameowork, different tidb details are mixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance"
labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find other metrics containing these lables 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/label ok-to-test |
@D3Hunter: These labels are not set on the issue: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/retest |
metrics/grafana/tidb_runtime.json
Outdated
"timeFrom": null, | ||
"timeRegions": [], | ||
"timeShift": null, | ||
"title": "Distributed DDL SubTask Pending Duration", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change all dist ddl to dist task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in c4eef6d
func (s *BaseScheduler) startSubtask(id int64) { | ||
err := s.taskTable.StartSubtask(id) | ||
func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) { | ||
metrics.DecDistTaskSubTaskCnt(subtask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dec first, then inc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dec pre-state subtask, then inc new-state subtask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dec pre-state subtask, then inc new-state subtask
IMHO, the method name is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dec pre-state subtask, then inc new-state subtask
IMHO, the method name is confusing.
Indeed, there is a point, I think the reason for the confusion is that this function implicitly changes the state of the subtask halfway. How about doing so?
func (s *BaseScheduler) startSubtaskAndUpdateState(subtask *proto.Subtask) {
....
}
subtasks, err := s.taskTable.GetSubtasksInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
if err != nil { | ||
s.onError(err) | ||
return s.getError() | ||
} | ||
for _, subtask := range subtasks { | ||
metrics.IncDistTaskSubTaskCnt(subtask) | ||
metrics.StartDistTaskSubTask(subtask) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this code into dispatcher.go.
When dispatching subtasks success, update the metric.
Then we don't need to fetch the taskTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my previous implementation method, which would cause the instance of collecting metrics to be different, thereby causing confusion in Grafana display.
Co-authored-by: EasonBall <592838129@qq.com>
…into ddl-dist-metrics-2
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tangenta, ywqzzy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...
but ok for now
Could lead to certain issues. For instance, we might overlook a few state changes owing to state update twice in interval. |
What problem does this PR solve?
Issue Number: close #47017
Problem Summary:
What is changed and how it works?
Runtime - Scheduler SubTask
TiDB perspective
Task perspective
SubTask perspective
Check List
Tests
more test results are coming soon
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.