Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of Backup/Restore involving external_storage (#7917) #7965

Merged
merged 4 commits into from
May 29, 2020

Conversation

sre-bot
Copy link
Contributor

@sre-bot sre-bot commented May 29, 2020

cherry-pick #7917 to release-3.1


What problem does this PR solve?

Issue Number: close #7375, close #7850, close #7880, close tidb-challenge-program/bug-hunting-issue#72

Problem Summary:

The backup/restore did not have timeout and retry. This causes a single spurious network error to either catastrophically kill the entire BR task or block the BR thread pool forever.

What is changed and how it works?

What's Changed:

  1. ("Restore" supports retrying externally through restore: Make all download error as retryable pingcap/br#298, so for now we're not going to implement retry for "Restore").
  2. The "Backup" gRPC produces a stream, which cannot be retried externally. Therefore, we implemented retry directly in the backup service (ExternalStorage::write in particular) to fix AWS br backup failed when collation is open  #7850.
  3. The produced backup file can have size up to 135 MiB. This is not a good idea to retry the upload entirely. So we implemented Multipart Upload on S3. Unfortunately tame-gcs does not support resumable upload Support resumable upload EmbarkStudios/tame-gcs#24 so this won't work for GCS. (This also fix external_storage: if S3 upload cannot finish within 15 minutes it will fail #7375.)
    • Unfortunately, we need to copy the SST file into memory and clone several times in order to retry.
  4. We have also replaced reqwest by hyper so those convert_request and convert_response are no longer needed.
  5. For "Restore", we added a timeout around ExternalStorage::read, so that a stuck download won't block the thread pool forever, to fix After backup/restore stuck, new backup cannot work until all tikv are deleted #7880.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)
    • Backup and restore against a minio S3 storage.

Side effects

  • Performance regression
    • Consumes more MEM: Backup may need to copy the entire SST file several times.
    • Multi-part upload and retrying would produce many more requests per file.

Release note

  • Improved robustness of backup and restore for S3 and GCS storage.

Signed-off-by: sre-bot <sre-bot@pingcap.com>
@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

/run-all-tests

@sre-bot sre-bot added component/backup-restore Component: backup, import, external_storage type/cherry-pick Type: PR - Cherry pick labels May 29, 2020
@sre-bot sre-bot added this to the v3.1.2 milestone May 29, 2020
@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

Please fix conflicts

@kennytm kennytm force-pushed the release-3.1-3c667df06126 branch 3 times, most recently from 7750c03 to 6559431 Compare May 29, 2020 08:14
Signed-off-by: kennytm <kennytm@gmail.com>
@kennytm
Copy link
Contributor

kennytm commented May 29, 2020

/run-integration-ddl-test

/run-integration-compatibility-test

@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/merge

@5kbpers 5kbpers closed this May 29, 2020
@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label May 29, 2020
@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

Your auto merge job has been accepted, waiting for:

  • 7905
  • 7962

@5kbpers 5kbpers reopened this May 29, 2020
@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/merge

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

Your auto merge job has been accepted, waiting for:

  • 7905
  • 7962

@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/run-integration-ddl-test

@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/merge

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

Your auto merge job has been accepted, waiting for:

  • 7962

@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/merge

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

Your auto merge job has been accepted, waiting for:

  • 7962

@5kbpers
Copy link
Member

5kbpers commented May 29, 2020

/merge

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

Your auto merge job has been accepted, waiting for:

  • 7962

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

/run-all-tests

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

@sre-bot merge failed.

@sre-bot
Copy link
Contributor Author

sre-bot commented May 29, 2020

/run-all-tests

@sre-bot sre-bot merged commit 768711d into tikv:release-3.1 May 29, 2020
5kbpers pushed a commit to 5kbpers/tikv that referenced this pull request Jun 2, 2020
…#7917) (tikv#7965)

Signed-off-by: sre-bot <sre-bot@pingcap.com>
Signed-off-by: kennytm <kennytm@gmail.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
5kbpers pushed a commit to 5kbpers/tikv that referenced this pull request Jun 2, 2020
…#7917) (tikv#7965)

Signed-off-by: sre-bot <sre-bot@pingcap.com>
Signed-off-by: kennytm <kennytm@gmail.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/backup-restore Component: backup, import, external_storage status/can-merge Indicates a PR has been approved by a committer. type/cherry-pick Type: PR - Cherry pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants