Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: do not remove the operator when the step does not finish #1715

Merged
merged 2 commits into from
Sep 5, 2019

Conversation

shafreeck
Copy link
Contributor

@shafreeck shafreeck commented Aug 29, 2019

Signed-off-by: Shafreeck Sea shafreeck@gmail.com

What problem does this PR solve?

There is a bug introduced by #1652, in some case, like adding peers or adding learners, the step is left unfinished if the peer is in pending state, although the conf version has changed, in these cases, the operator will be removed because the controller thought someone has changed the conf version(in fact, it self did). We fix that by checking if the conf version has actually changed by current step, if it is, the operator is not regarded as stale.

What is changed and how it works?

Only check conf version changes when a new step to be executed.

Check List

Tests

  • Unit test

@shafreeck shafreeck force-pushed the fix/unfinished-step branch from a1636e0 to 8a43885 Compare August 29, 2019 09:48
Copy link
Contributor

@Luffbee Luffbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A step may be send multiple times. If we only check ConfVer at the first time, the problem in the origin issue will appear again. (Consider if the ConfVer changed after the first send)

@codecov-io
Copy link

codecov-io commented Aug 29, 2019

Codecov Report

Merging #1715 into master will decrease coverage by 0.01%.
The diff coverage is 68.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1715      +/-   ##
==========================================
- Coverage   76.91%   76.89%   -0.02%     
==========================================
  Files         160      160              
  Lines       15718    15730      +12     
==========================================
+ Hits        12089    12096       +7     
- Misses       2615     2619       +4     
- Partials     1014     1015       +1
Impacted Files Coverage Δ
server/schedule/operator_controller.go 89.37% <100%> (ø) ⬆️
server/schedule/operator/operator.go 85.94% <67.74%> (-0.06%) ⬇️
server/tso/tso.go 77.06% <0%> (-6.43%) ⬇️
server/schedulers/random_merge.go 61.53% <0%> (-5.13%) ⬇️
pkg/etcdutil/etcdutil.go 88.4% <0%> (-2.9%) ⬇️
server/grpc_service.go 57.48% <0%> (-1.31%) ⬇️
server/server.go 82.21% <0%> (-0.58%) ⬇️
server/cluster.go 83.1% <0%> (+0.25%) ⬆️
server/member/leader.go 80.1% <0%> (+0.51%) ⬆️
server/handler.go 50.13% <0%> (+0.52%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9af07b0...b5b2a3f. Read the comment docs.

@shafreeck shafreeck force-pushed the fix/unfinished-step branch from 8a43885 to d67a2a9 Compare August 29, 2019 15:56
@shafreeck shafreeck changed the title scheduler: do not check the conf ver changes before a new step scheduler: do not remove the operator when the step does not finish Aug 29, 2019
@shafreeck shafreeck force-pushed the fix/unfinished-step branch from d67a2a9 to d7e2f64 Compare August 29, 2019 16:09
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

There is a bug introduced by tikv#1652, in some case, like adding peers or
adding learners, the step is left unfinished if the peer is in pending
state, although the conf version has changed, in these cases, the
operator will be removed because the controller thought someone has
changed the conf version(in fact, it self did). We fix that by checking
if the conf version has actually changed by current step, if it is,
the operator is not regarded as stale.

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@shafreeck shafreeck force-pushed the fix/unfinished-step branch from d7e2f64 to 01e5d8e Compare August 29, 2019 16:22
Copy link
Contributor

@Luffbee Luffbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about merging ConfVerChanged and IsFinish? These two functions are too intimate, so separating them is not a good idea.

Then Check can return step and the total change, but it will need to scan steps from start. The problem here is: is it possible that a finished step becomes unfinished before the whole operator finished? (@nolouch). Actually, this problem also influences current implementation.

@nolouch
Copy link
Contributor

nolouch commented Aug 30, 2019

@Luffbee Only if there are two operator works on one region at the same time. consider the network, It's possible but the probability is very small. But theepoch of the region will monotonically increase, so as your suggestion may better than now.

Luffbee
Luffbee previously approved these changes Aug 30, 2019
@Luffbee Luffbee dismissed their stale review August 30, 2019 08:03

Unexpected approve.

@nolouch nolouch added the type/bug The issue is confirmed as a bug. label Sep 2, 2019
@shafreeck
Copy link
Contributor Author

shafreeck commented Sep 5, 2019

@Luffbee I thought about your suggestion before and gave up. From the perspective of code, they really can be merged, but it is weird according to the semantics. Conf version is not the completion of an epoch. It is acceptable to return if the epoched changed like

func IsFinish(region) (epochChanged bool, finished bool)

It is weird like this

func IsFinish(region) (confVerChanged bool, finished bool)

If there is another requirement to check if a version changed, the latter signature maybe should look like this:

func IsFinish(region) (confVerChanged bool, versionChagned bool, finished bool)

I don't think this is a good trend.

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. let's merge this firstly. this changed not affect the old code style.

@nolouch
Copy link
Contributor

nolouch commented Sep 5, 2019

PTAL @Luffbee @rleungx

Copy link
Contributor

@Luffbee Luffbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nolouch nolouch added the status/can-merge Indicates a PR has been approved by a committer. label Sep 5, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Sep 5, 2019

/run-all-tests

@sre-bot sre-bot merged commit 144031c into tikv:master Sep 5, 2019
Luffbee added a commit that referenced this pull request Sep 9, 2019
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* remove unnecessary parentheses

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* change internal stat values to float64

* add pending operator influence

* add metrics of pending influence

* fix metrics

* fix panic

* adjust pending influence of balanceHotWrite

* change weight of pending influence

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* decrease region rolling window; store pending influence in scheduler

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* decrease possiblility transfer hot write leader

* change pending influence weight

* add unstarted op metrics

* add logs for debug

* add log for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* Revert "add logs for debug"

This reverts commit e74c7a9.

* add metrics for hotspot operators

* operator: rewrite move region related functions (#1667)

* add metrics for pending operators

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* fix bug

* fix bug

* fix bug

* fix metrics thread-safe bug

* fix logic bug

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* add drop time for operator

* use IsDropped to recognize canceled ops

* try to fix trans leader burst

* try to fix trans leader burst

* add zombie influence

* change select src dst strategy; improve op_controller

* change select src strategy

* fix bug

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)
Luffbee added a commit that referenced this pull request Sep 11, 2019
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* operator: rewrite move region related functions (#1667)

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)

* statistics: fix region flow calculation (#1688)

Signed-off-by: jiyingtk <jiyingtk@mail.ustc.edu.cn>

* makefile: improve deadlock-enable/disable (#1736)

* api: fix missing keys statistic in region information (#1741)

Signed-off-by: nolouch <nolouch@gmail.com>

* *: update go version to 1.13 (#1742)

Signed-off-by: disksing <i@disksing.com>

* coordinator: add the operator cost time in log field (#1748)

Signed-off-by: nolouch <nolouch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/can-merge Indicates a PR has been approved by a committer. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants