-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: v18.0.0-rc1 S3 backup failed to upload #14188
Comments
@L3o-pold can you please capture the log from vttablet as well? None of us maintainers have access to a minio environment so this will be tough to debug. |
Our minio dependency hasn't changed in years. However the If you are able to build an image with the older version of aws-sdk-go, and that works, that will give us a starting point. But in any case we'll need the vttablet logs, not just the vtctld logs. |
@deepthi we are not using Minio but AWS compatible storage like it.
I'll try to see if I can reproduce with a simple Minio docker instance, and with the team in charge of our storage solution to see if they have some log on their side. |
I reproduce with a Minio docker instance: minio:
image: minio/minio:RELEASE.2023-09-30T07-02-29Z
command: ["server", "/data", "--console-address", ":9001"]
environment:
MINIO_ROOT_USER: "test"
MINIO_ROOT_PASSWORD: "test" |
From my testing, downgrading the |
I'm going to test with vanilla S3 to see if that exhibits the same problem. If yes, we will have to revert at least a portion of #12500 and then figure out a way forward. |
I have tested PR #12500 (or actually it's exact counterpart on the planetscale fork) and backups with vanilla S3 work just fine. However, my test db is very small, and the whole vtbackup process (download previous backup from s3, restore and take a new backup and upload to s3) takes 10 minutes. |
@deepthi 2GB and fails in less than 2ms once xtrabackup finished. edit: tried with an empty database and same issue. |
@deepthi Is planetscale using the new vtctld grpc api to do the backup? |
@L3o-pold can you try if builtinbackup works? You did specify the flags in your issue description but I missed the fact that you are in fact using xtrabackup. My test was with vtbackup (which operator uses) + builtin backup, not xtrabackup. Also:
|
@deepthi |
|
@L3o-pold there are 3 variables here
The reason I don't want to just revert the context change is that the new code is "correct". |
|
I just tried the same thing against s3 (without --allow_primary, so it happened on a replica) and the backup succeeds. So at this point I'm going say that this seems to be specific to your storage provider. |
Multiple storage provider as we replicate the behavior with Swift and Minio |
Similar to #8391 |
In order to scope this down a bit further we can try this
|
We are not using Amazon S3 because we are on premise. Maybe it works with pure Amazon S3 but not with compatible S3 storage... it used to work in v16 and v17 tho. |
This is a little bit of a gray area because on the one hand, we don't "officially" support s3 like non-s3 storage backends. On the other hand, it has worked for a long time and you and others are relying on it. |
@L3o-pold there's one more thing you should do. rpc_backup.go
|
I'll check for this tomorrow. |
|
So the logs tell us that we are not messing up the context in passing it in. AND that it does have a 1 hour timeout. This makes it more likely that somehow this is not being supported on the server side i.e. the specific storage solution you are using. Do you get any logs on the storage server that can help? |
@deepthi the only thing I have on the (minio) server side log is 499 http status (client/vttablet disconnection) #8391 (comment) |
Then the next thing I can think of is some sort of grpc keep alive timeout. The client (vttablet) is canceling because there's no response within some shorter timeout on the grpc connection. |
The error message looks like somehow an explicit cancel on the context was triggered. It doesn't look like an expiry. And these contexts are not really something the server even sees, it's a purely client side concern. I'd be curious to know if in your case |
@dbussink |
That would indicate that we also don't want on the background goroutine that uploads. Which means that request completes and the context is cancelled. Which explains why this happens. I don't think this is to be honest related to S3 / minio or whatever the implementation is. This is a bug that happens independently of that cc @deepthi. |
I opened #16806 to fix this issue. The PR contains my interpretation of the issue too. |
Overview of the Issue
Upgraded from v17.0.3 to v18.0.0-rc1 our backup failed to be upload to S3 like storage (not AWS but more like Minio).
Reproduction Steps
/vt/bin/vtctldclient --server=vtctld:15999 --logtostderr=true BackupShard --concurrency=1 "commerce/-"
Running vttablet and vtctld with following flag:
Binary Version
vtctldclient version Version: 18.0.0-rc1 (Git revision 6ab165ade925b35a00cf447827d874eba13998b6 branch 'heads/v18.0.0-rc1') built on Tue Oct 3 15:00:58 UTC 2023 by vitess@buildkitsandbox using go1.21.1 linux/amd64
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: