-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner is pegging the backend #1687
Comments
Update: appears to be fixed if the transaction returned has a non-nil ID. But, I think this probably highlights a general scary-ness around that for-loop. We should in general not have unprotected looping of RPCs. Any bug (now or in the future) that would cause that loop to spin will peg the backend. |
This is basically the same issue as #1662. If the server starts returning a permanent error for The pause block mentioned above is intended for the situation when the worker does not need to do anything, and the worker is paused to avoid burning CPU unnecessarily. The exponential backoff that we should implement in case of an error should be separate from that, or at least also take that condition into account. |
I'm wondering whether we should solve this problem by adding a circuit breaker to the for loop instead of an exponential backoff. My reasoning and suggestion for this is:
|
There is no error happening here. The problem is that missing of ID in the Transaction response leads to the unhealthy write transaction and google-cloud-go/spanner/session.go Line 1178 in 25d2e81
So it enters an endless loop: there is always an unhealthy If we want to add the circuit breaker, we need to distinguish the cases:
Also, there is a sleep interval when no work needs to be done. But currently, we don't have any sleep interval for two consecutive health checks (when Exponential backoff may not work properly here, because if we have a number of unhealthy sessions, it should run normally instead of waiting exponentially. |
In my opinion, we should use the value that is returned here to determine whether we should stop the process of preparing transactions: google-cloud-go/spanner/session.go Line 1181 in 25d2e81
That method will also normally not return an error if an empty transaction is returned by the server (which by the way should not be possible), but that can easily be added to the method. The reason I think we should use the return value of that method to stop the prepare process is that it will also catch any errors that might be introduced by other bugs, server problems, etc. |
You're right. In my testing, this method does not return any error at the moment. I agree that it will be good to use this error for the circuit breaker. My only concern is that there is no sleep internal for two consecutive health checks. Should we add a sleep interval? Because we may end up with a risk again that |
I would rather explore whether we should split the two things that this for-loop is doing into two different loops/methods, as the error handling that we would want to apply to these are probably different:
I don't think we should add any sleep between two consecutive health checks (pings) if they succeed. The health checker should be allowed to ping sessions when deemed necessary without any artificial delay. We should however ensure that the |
Thanks a lot for the detailed explanation. This sounds good to me. BTW: I think the spanner in |
https://code-review.googlesource.com/c/gocloud/+/49030 is a fix for this. |
1: Start a basic echo server (something like https://gist.github.com/jadekler/b85fdf6a7a71583419e05763bbb1ce7c) that sends an empty transaction
&tpb.Transaction{}
inBeginTransaction
2. Start the client library
3. See thousands of requests
BeginTransaction
per second to echo server@skuruppu believes it is this block:
google-cloud-go/spanner/session.go
Lines 1171 to 1211 in 25d2e81
Likely this entire for-loop needs to be re-written to have exponential backoff. We should probably use https://godoc.org/github.com/googleapis/gax-go/v2 instead of the "sometimes pause 100ms"
google-cloud-go/spanner/session.go
Lines 1195 to 1209 in 25d2e81
Marking this a p1 bug since it seems like the kind of thing that can cause unintended DOS of GCP.
The text was updated successfully, but these errors were encountered: