Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

bug-fix: use table-info-before always and fix bug for recover lock in optimistic #1518

Merged
merged 22 commits into from
Mar 26, 2021

Conversation

GMHDBJD
Copy link
Collaborator

@GMHDBJD GMHDBJD commented Mar 18, 2021

What problem does this PR solve?

#1510 case4

What is changed and how it works?

  • if table-info-before of an info is different with table saved in master(like worker restart and TrySync twice)
    • use table-info-before always
    • joined tables base on table-info-before as oldJoined
  • when recovering lock
    • join all info
    • build lock with joined info, update tables with tableInfoBefore
    • sort the infos, TrySync them one by one

Check List

Tests

  • unit test
  • integration test

@GMHDBJD GMHDBJD added status/WIP This PR is still work in progress type/bug-fix Bug fix needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 labels Mar 18, 2021
@GMHDBJD GMHDBJD added status/PTAL This PR is ready for review. Add this label back after committing new changes status/WIP This PR is still work in progress and removed status/WIP This PR is still work in progress status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 19, 2021
@lichunzhu
Copy link
Contributor

Is this PR still WIP?

@GMHDBJD GMHDBJD added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP This PR is still work in progress labels Mar 19, 2021
@@ -72,7 +72,7 @@ func NewLock(ID, task, downSchema, downTable string, ti *model.TableInfo, tts []
synced: true,
versions: make(map[string]map[string]map[string]int64),
}
l.addTables(tts)
l.addTables(l.joined, tts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if tts contains the other tables whose schema didn't reach l.joined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here l.joined equals tableInfoBefore. If it's a new lock(first table), tableInfoBefore=l.joined. For other newer tables, they will be added in L163

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tts here is from https://github.com/pingcap/dm/blob/master/dm/master/shardddl/optimist.go#L267. optimistic will get all source tables from etcd and this tts may have not only the table in this info, I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em... That's right, I will fix it

@GMHDBJD GMHDBJD added status/WIP This PR is still work in progress and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 19, 2021
pkg/terror/error_list.go Outdated Show resolved Hide resolved
pkg/shardddl/optimism/info.go Outdated Show resolved Hide resolved
dm/master/shardddl/optimist_test.go Outdated Show resolved Hide resolved
// handle the case where <callerSource, callerSchema, callerTable>
// is not in old source tables and current new source tables.
// duplicate append is not a problem.
tts = append(tts, newTargetTable(l.Task, callerSource, l.DownSchema, l.DownTable,
map[string]map[string]struct{}{callerSchema: {callerTable: struct{}{}}}))
// add any new source tables.
l.addTables(tts)
l.receiveTable(callerSource, callerSchema, callerTable, oldTable)
if val, ok := l.versions[callerSource][callerSchema][callerTable]; !ok || val < infoVersion {
l.versions[callerSource][callerSchema][callerTable] = infoVersion
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we changed the l.tables by l.receiveTable, so in line 179 oldJoined is not a result of "join every l.tables" 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean join all the tables for old joined?

Copy link
Collaborator

@lance6716 lance6716 Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just not sure if the correctness depends on "old joined should be joined result of all l.tables", or "old joined of previous TrySync and new joined of next TrySync should keep consistent"

Copy link
Collaborator Author

@GMHDBJD GMHDBJD Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume user start-task with same table info in all upstream, tableInfoBefore should be same as oldJoined or oldtable. Otherwise, the oldJoined is not collect and we may keep it and hope later join may return an error? cc @lichunzhu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dm-master/dm-worker meets a restart the table info in all upstream is more likely to be different. What's the problem here? I don't understand this clearly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the tableInfoBefore not equal l.tables[source][schema][table], that means the l.tables[source][schema][table] and oldjoined may wrong, do we need join all the tables with tableInfoBefore as oldjoined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the tableInfoBefore not equal l.tables[source][schema][table], that means the l.tables[source][schema][table] and oldjoined may wrong, do we need join all the tables with tableInfoBefore as oldjoined?

I think carefully about this situation. Could you help me check if I'm wrong?

When this happens, it means we add a table whose table schema is not equal to joined table info now. When we init all the tables, we have three situations:

  1. This task is new. The pre-check will assure the correctness of the tables' schemas.
  2. This dm cluster recovers from a failure. The lock is synced now and all the tables have the same schema. If we add a table with a different schema, we should report an error.
  3. This dm cluster recovers from a failure. The lock is unsynced now. Some workers didn't put their lock Info into etcd before the restart. These tables' l.tables will be set to l.joined at first, and if it's different from the tableInfoBefore received later I think it's acceptable. Am I right?

Copy link
Collaborator Author

@GMHDBJD GMHDBJD Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean if we get a info which we have already received the table info before, if their schema is different, we should report an error?

Copy link
Contributor

@lichunzhu lichunzhu Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I don't think we should report an error in situation 3.

Copy link
Collaborator Author

@GMHDBJD GMHDBJD Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some workers didn't put their lock Info into etcd before the restart. These tables' l.tables will be set to l.joined at first

That means the table haven't been received. l.received[x][x][x]==false

In unit test, we have some idempotent TrySync, how do we deal with the case? 🤔

@lance6716 lance6716 added this to the v2.0.2 milestone Mar 24, 2021
Comment on lines 659 to 669
# start=1
# end=35
# except=(024 025 029)
# for i in $(seq -f "%03g" ${start} ${end}); do
# if [[ ${except[@]} =~ $i ]]; then
# continue
# fi
# DM_${i}
# sleep 1
# done
# DM_RENAME_COLUMN_OPTIMISTIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for debug?

@GMHDBJD GMHDBJD changed the title optimistic: use tableInfoBefore instead of joined table if table not exist bug-fix: use table-info-before always and fix bug for recover lock in optimistic Mar 25, 2021
Copy link
Collaborator

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review later

newJoined, err := joined.Join(schemacmp.Encode(info.TableInfoBefore))
// ignore error, will report it in TrySync later
if err != nil {
o.logger.Error(fmt.Sprintf("fail to join table info %s with %s, lockID: %s in recover lock", joined, newJoined, lockID), log.ShortError(err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add a alerting rule for errors of Optimist initialization in future

infos = sortInfos(ifm)
c.Assert(len(infos), Equals, 3)

i11.Version = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we could replace Info.Version by Revision (could file another PR)

i11 = NewInfo(task, source1, upSchema, upTable, downSchema, downTable, DDLs1, ti0, []*model.TableInfo{ti1})
i21 = NewInfo(task, source2, upSchema, upTable, downSchema, downTable, DDLs2, ti2, []*model.TableInfo{ti3})

tts = []TargetTable{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"target" in TargetTable makes me think of "source/target", where the "target" means downstream tables. We could rename it to "TableMap" or something in future

@@ -1580,3 +1584,81 @@ func newInfoWithVersion(task, source, upSchema, upTable, downSchema, downTable s
info.Version = vers[source][upSchema][upTable]
return info
}

func (t *testLock) TestLockTrySyncReceived(c *C) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which case does it test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename in 814d7a7 . Now we use table-info-before for all table, create a different index will not be in conflict now.

tests/shardddl1/run.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

run_sql_source2 "insert into ${shardddl1}.${tb2} values(7,7);"
check_log_contain_with_retry "putted a shard DDL.*tb2.*ALTER TABLE .* ADD COLUMN" $WORK_DIR/worker1/log/dm-worker.log $WORK_DIR/worker2/log/dm-worker.log

# recover lock, tb1's info: (a,b,c)->(a,c); tb2's info: (a)->(a,b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we cover case#4 at this place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes? Though the order cannot determine, but we don't depend on the order now

@lichunzhu
Copy link
Contributor

/lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 One reviewer already commented LGTM label Mar 25, 2021
@lance6716
Copy link
Collaborator

/lgtm

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lance6716
  • lichunzhu

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels Mar 26, 2021
@lance6716
Copy link
Collaborator

/run-compatibility-tests

@GMHDBJD GMHDBJD merged commit 69f65ca into pingcap:master Mar 26, 2021
@GMHDBJD GMHDBJD deleted the useTableInfoBefore branch March 26, 2021 02:56
ti-srebot pushed a commit to ti-srebot/dm that referenced this pull request Mar 26, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link

cherry pick to release-2.0 in PR #1536

@ti-srebot ti-srebot added already-cherry-pick-2.0 The related PR is already cherry-picked to release-2.0. Add this label once the PR is cherry-picked and removed needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 labels Mar 26, 2021
ti-chi-bot pushed a commit that referenced this pull request Mar 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
already-cherry-pick-2.0 The related PR is already cherry-picked to release-2.0. Add this label once the PR is cherry-picked needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated size/XXL status/LGT2 Two reviewers already commented LGTM, ready for merge status/PTAL This PR is ready for review. Add this label back after committing new changes type/bug-fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants