Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: v18.0.0-rc1 S3 backup failed to upload #14188

Closed
Tracked by #13990
L3o-pold opened this issue Oct 5, 2023 · 31 comments · Fixed by #16806
Closed
Tracked by #13990

Bug Report: v18.0.0-rc1 S3 backup failed to upload #14188

L3o-pold opened this issue Oct 5, 2023 · 31 comments · Fixed by #16806

Comments

@L3o-pold
Copy link
Collaborator

L3o-pold commented Oct 5, 2023

Overview of the Issue

Upgraded from v17.0.3 to v18.0.0-rc1 our backup failed to be upload to S3 like storage (not AWS but more like Minio).

E1005 07:20:00.774018      14 main.go:56] rpc error: code = Unknown desc = TabletManager.Backup on commerce-0000001010 error: MultipartUpload: upload multipart failed
	upload id: ZGQ5OWUxNWYtY2IzNS00MDdjLWI1OGItMWNkZDA4M2ZlYzRk
caused by: RequestCanceled: request context canceled
caused by: context canceled: MultipartUpload: upload multipart failed
	upload id: ZGQ5OWUxNWYtY2IzNS00MDdjLWI1OGItMWNkZDA4M2ZlYzRk
caused by: RequestCanceled: request context canceled
caused by: context canceled

Reproduction Steps

/vt/bin/vtctldclient --server=vtctld:15999 --logtostderr=true BackupShard --concurrency=1 "commerce/-"

Running vttablet and vtctld with following flag:

--backup_storage_implementation=s3 \
--backup_engine_implementation=xtrabackup \
--s3_backup_aws_region=us-east-1 \
--s3_backup_storage_bucket=default \
--s3_backup_force_path_style=true \
--s3_backup_aws_endpoint=https://* \
--s3_backup_storage_root=*

Binary Version

vtctldclient version Version: 18.0.0-rc1 (Git revision 6ab165ade925b35a00cf447827d874eba13998b6 branch 'heads/v18.0.0-rc1') built on Tue Oct  3 15:00:58 UTC 2023 by vitess@buildkitsandbox using go1.21.1 linux/amd64

Operating System and Environment details

kubernetes / docker official vitess image

Log Fragments

No response

@L3o-pold L3o-pold added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Oct 5, 2023
@deepthi
Copy link
Member

deepthi commented Oct 5, 2023

@L3o-pold can you please capture the log from vttablet as well? None of us maintainers have access to a minio environment so this will be tough to debug.

@deepthi
Copy link
Member

deepthi commented Oct 5, 2023

Our minio dependency hasn't changed in years. However the aws-sdk-go dependency has changed. That would be the first place to check.
https://github.com/vitessio/vitess/blame/main/go.mod#L14
You can see the diff here 9d2029a

If you are able to build an image with the older version of aws-sdk-go, and that works, that will give us a starting point. But in any case we'll need the vttablet logs, not just the vtctld logs.

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 6, 2023

@deepthi we are not using Minio but AWS compatible storage like it.
vttablet logs are the same:

vttablet W1006 00:03:11.814578       1 rpc_server.go:72] TabletManager.Backup(concurrency:1)(on commerce-0000001012 from ) error: MultipartUpload: upload multipart failed
vttablet 	upload id: M2MzYjJkY2YtODI4Yi00NGE5LWFkMjgtYTIwYzU2NmM4Yjdh
vttablet caused by: RequestCanceled: request context canceled
vttablet caused by: context canceled

I'll try to see if I can reproduce with a simple Minio docker instance, and with the team in charge of our storage solution to see if they have some log on their side.

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 6, 2023

I reproduce with a Minio docker instance:

    minio:
        image: minio/minio:RELEASE.2023-09-30T07-02-29Z
        command: ["server", "/data", "--console-address", ":9001"]
        environment:
            MINIO_ROOT_USER: "test"
            MINIO_ROOT_PASSWORD: "test"

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 6, 2023

From my testing, downgrading the aws-sdk-go have no impact, but reverting #12500 works.

@GuptaManan100 GuptaManan100 added Component: Backup and Restore and removed Needs Triage This issue needs to be correctly labelled and triaged labels Oct 9, 2023
@deepthi
Copy link
Member

deepthi commented Oct 9, 2023

I'm going to test with vanilla S3 to see if that exhibits the same problem. If yes, we will have to revert at least a portion of #12500 and then figure out a way forward.

@deepthi
Copy link
Member

deepthi commented Oct 9, 2023

I have tested PR #12500 (or actually it's exact counterpart on the planetscale fork) and backups with vanilla S3 work just fine. However, my test db is very small, and the whole vtbackup process (download previous backup from s3, restore and take a new backup and upload to s3) takes 10 minutes.
How big is your backup? Context canceled is telling us that a timeout is being hit.
Default timeout (--action_timeout) for all vtctl commands is 1 hour. Does your command take longer than that? Or does it error out exactly after 1 hour?
If yes, you need to extend that. If not, let me know and we'll see if we can think of something else.

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 9, 2023

@deepthi 2GB and fails in less than 2ms once xtrabackup finished.

edit: tried with an empty database and same issue.

@harshit-gangal
Copy link
Member

@deepthi Is planetscale using the new vtctld grpc api to do the backup?

@deepthi
Copy link
Member

deepthi commented Oct 10, 2023

@L3o-pold can you try if builtinbackup works? You did specify the flags in your issue description but I missed the fact that you are in fact using xtrabackup. My test was with vtbackup (which operator uses) + builtin backup, not xtrabackup.

Also:

  • were you using the same command with v17? Or were you by any chance using vtctlclient instead of vtctldclient?

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 10, 2023

@deepthi --backup_engine_implementation=xtrabackup was specified in the issue yeah :) I'll try with builtin backup

@L3o-pold
Copy link
Collaborator Author

unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352680626} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352697998} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352710499} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:357451136} file:"builtinbackupengine.go" line:487 value:"resetting mysqld super_read_only to false"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:359553344} file:"builtinbackupengine.go" line:422 value:"resetting mysqld super_read_only to false"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:361549882} level:ERROR file:"backup.go" line:180 value:"backup is not usable, aborting it: [RequestCanceled: request context canceled\ncaused by: context canceled;RequestCanceled: request context canceled\ncaused by: context canceled]"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:376869259} level:ERROR file:"backup.go" line:188 value:"failed to finish backup: [MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.\n\tstatus code: 400, request id: 178CD22DF0FC0D2A, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8]: %!v(MISSING)"
E1010 18:24:14.456267      40 main.go:56] rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled

@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 11, 2023

this change works:

image

@deepthi
Copy link
Member

deepthi commented Oct 11, 2023

@L3o-pold there are 3 variables here

  • vanilla s3 vs your storage
  • builtinbackup vs xtrabackup -> you have shown that this doesn't matter
  • vtctlclient vs vtctldclient - can you try running with vtctlclient and tell us what happens?

The reason I don't want to just revert the context change is that the new code is "correct".

@L3o-pold
Copy link
Collaborator Author

@deepthi

/vt/bin/vtctlclient --server=vtctld:15999 BackupShard -- --concurrency=1 --allow_primary "unsharded/-"
W1011 15:53:05.200409      68 log.go:39] Failed to read in config : Config File "vtconfig" Not Found in "[/]". This is optional, and can be ignored if you are not using config files. For a detailed explanation, see https://github.com/vitessio/vitess/blob/main/doc/viper/viper.md#config-files.
E1011 15:53:11.339150      68 main.go:96] E1011 15:53:11.338962 backup.go:110] E1011 15:53:11.338637 backup.go:180] backup is not usable, aborting it: [RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled]
E1011 15:53:11.342682      68 main.go:96] E1011 15:53:11.342484 backup.go:110] E1011 15:53:11.342019 backup.go:188] failed to finish backup: [MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
        status code: 400, request id: 178D18845BAAE316, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8]: %!v(MISSING)
BackupShard Error: rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled
E1011 15:53:11.432335      68 main.go:105] remote error: rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled

@deepthi
Copy link
Member

deepthi commented Oct 12, 2023

I just tried the same thing against s3 (without --allow_primary, so it happened on a replica) and the backup succeeds. So at this point I'm going say that this seems to be specific to your storage provider.
Especially given that you seem to be getting an error returned from the call to UploadWithContext.

@L3o-pold
Copy link
Collaborator Author

Multiple storage provider as we replicate the behavior with Swift and Minio

@L3o-pold
Copy link
Collaborator Author

Similar to #8391

@deepthi
Copy link
Member

deepthi commented Oct 12, 2023

In order to scope this down a bit further we can try this

  • my test was with the PlanetScale fork, I can re-test with vanilla vitess
  • is it possible for you to test with an actual s3 bucket instead of Minio/Swift?

@L3o-pold
Copy link
Collaborator Author

We are not using Amazon S3 because we are on premise. Maybe it works with pure Amazon S3 but not with compatible S3 storage... it used to work in v16 and v17 tho.

@deepthi
Copy link
Member

deepthi commented Oct 15, 2023

This is a little bit of a gray area because on the one hand, we don't "officially" support s3 like non-s3 storage backends. On the other hand, it has worked for a long time and you and others are relying on it.
So for now, I'm willing for us to revert the change that is causing this error and go back to using Upload instead of UploadWithContext and then we can try and figure out why that is not working as expected.
Will you be able to do a PR to make just that change? We don't want to revert the stats changes from #12500

cc @maxenglander

@deepthi deepthi added this to the v18.0.0 milestone Oct 15, 2023
@deepthi
Copy link
Member

deepthi commented Oct 16, 2023

@L3o-pold there's one more thing you should do.
Can you run this again with some logging added? Log the context and its deadline in s3.go just before we call Upload/UploadWithContext, and at the beginning of the rpc call in rpc_backup.go. That will tell us if we are somehow overriding the original context or its deadline.

rpc_backup.go

func (tm *TabletManager) Backup(ctx context.Context, logger logutil.Logger, req *tabletmanagerdatapb.BackupRequest) error {

@L3o-pold
Copy link
Collaborator Author

I'll check for this tomorrow.

@L3o-pold
Copy link
Collaborator Author

@deepthi

rpc_backup.go:44] Backup: [s3] context.Background.WithValue(type transport.connectionKey, val <not Stringer>).WithValue(type peer.peerKey, val <not Stringer>).WithDeadline(2023-10-18 05:47:58.845009687 +0000 UTC m=+3647.749218339 [59m59.989854174s]).WithValue(type metadata.mdIncomingKey, val <not Stringer>).WithValue(type grpc.streamKey, val <not Stringer>).WithValue(type grpc.rpcInfoContextKey, val <not Stringer>).WithValue(type callinfo.key, val <not Stringer>)

s3.go:176] AddFile: [s3] context.Background.WithValue(type transport.connectionKey, val <not Stringer>).WithValue(type peer.peerKey, val <not Stringer>).WithDeadline(2023-10-18 05:47:58.845009687 +0000 UTC m=+3647.749218339 [59m55.920480222s]).WithValue(type metadata.mdIncomingKey, val <not Stringer>).WithValue(type grpc.streamKey, val <not Stringer>).WithValue(type grpc.rpcInfoContextKey, val <not Stringer>).WithValue(type callinfo.key, val <not Stringer>).WithCancel

@deepthi
Copy link
Member

deepthi commented Oct 19, 2023

So the logs tell us that we are not messing up the context in passing it in. AND that it does have a 1 hour timeout. This makes it more likely that somehow this is not being supported on the server side i.e. the specific storage solution you are using. Do you get any logs on the storage server that can help?

@deepthi deepthi mentioned this issue Oct 19, 2023
45 tasks
@L3o-pold
Copy link
Collaborator Author

L3o-pold commented Oct 19, 2023

@deepthi the only thing I have on the (minio) server side log is 499 http status (client/vttablet disconnection) #8391 (comment)

@deepthi
Copy link
Member

deepthi commented Oct 19, 2023

Then the next thing I can think of is some sort of grpc keep alive timeout. The client (vttablet) is canceling because there's no response within some shorter timeout on the grpc connection.
There was an issue of that sort some time ago #9195 but the error message is different.

@dbussink
Copy link
Contributor

The error message looks like somehow an explicit cancel on the context was triggered. It doesn't look like an expiry. And these contexts are not really something the server even sees, it's a purely client side concern.

I'd be curious to know if in your case EndBackup ever ends up being called? That is what should wait for the uploads to complete then.

@L3o-pold
Copy link
Collaborator Author

@dbussink EndBackup seems not called

@dbussink
Copy link
Contributor

@dbussink EndBackup seems not called

That would indicate that we also don't want on the background goroutine that uploads. Which means that request completes and the context is cancelled. Which explains why this happens.

I don't think this is to be honest related to S3 / minio or whatever the implementation is. This is a bug that happens independently of that cc @deepthi.

@frouioui
Copy link
Member

I opened #16806 to fix this issue. The PR contains my interpretation of the issue too.

@frouioui frouioui modified the milestones: v20.0.0, v21.0.0 Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants