Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner is pegging the backend #1687

Closed
jeanbza opened this issue Nov 29, 2019 · 9 comments
Closed

spanner is pegging the backend #1687

jeanbza opened this issue Nov 29, 2019 · 9 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jeanbza
Copy link
Member

jeanbza commented Nov 29, 2019

1: Start a basic echo server (something like https://gist.github.com/jadekler/b85fdf6a7a71583419e05763bbb1ce7c) that sends an empty transaction &tpb.Transaction{} in BeginTransaction
2. Start the client library
3. See thousands of requests BeginTransaction per second to echo server

@skuruppu believes it is this block:

for {
if hc.isClosing() {
// Exit when the pool has been closed and all sessions have been
// destroyed or when health checker has been closed.
hc.waitWorkers.Done()
return
}
ws := getNextForTx()
if ws != nil {
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
err := ws.prepareForWrite(ctx)
cancel()
if err != nil {
// Skip handling prepare error, session can be prepared in next
// cycle.
log.Printf("Failed to prepare session, error: %v", toSpannerError(err))
}
hc.pool.recycle(ws)
hc.pool.mu.Lock()
hc.pool.prepareReqs--
hc.pool.mu.Unlock()
hc.markDone(ws)
}
rs := getNextForPing()
if rs == nil {
if ws == nil {
// No work to be done so sleep to avoid burning CPU.
pause := int64(100 * time.Millisecond)
if pause > int64(hc.interval) {
pause = int64(hc.interval)
}
select {
case <-time.After(time.Duration(rand.Int63n(pause) + pause/2)):
case <-hc.done:
}
}
continue
}
hc.healthCheck(rs)
}

Likely this entire for-loop needs to be re-written to have exponential backoff. We should probably use https://godoc.org/github.com/googleapis/gax-go/v2 instead of the "sometimes pause 100ms"

if rs == nil {
if ws == nil {
// No work to be done so sleep to avoid burning CPU.
pause := int64(100 * time.Millisecond)
if pause > int64(hc.interval) {
pause = int64(hc.interval)
}
select {
case <-time.After(time.Duration(rand.Int63n(pause) + pause/2)):
case <-hc.done:
}
}
continue
}
. (100ms needs to be changed to exponential backoff, and the conditionality of the pause needs to be fixed)

Marking this a p1 bug since it seems like the kind of thing that can cause unintended DOS of GCP.

@jeanbza jeanbza added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. api: spanner Issues related to the Spanner API. labels Nov 29, 2019
@jeanbza
Copy link
Member Author

jeanbza commented Nov 29, 2019

Update: appears to be fixed if the transaction returned has a non-nil ID. But, I think this probably highlights a general scary-ness around that for-loop. We should in general not have unprotected looping of RPCs. Any bug (now or in the future) that would cause that loop to spin will peg the backend.

@olavloite
Copy link
Contributor

This is basically the same issue as #1662. If the server starts returning a permanent error for BeginTransaction (or something the client will see as an invalid read/write transaction), you could enter this infinite loop.

The pause block mentioned above is intended for the situation when the worker does not need to do anything, and the worker is paused to avoid burning CPU unnecessarily. The exponential backoff that we should implement in case of an error should be separate from that, or at least also take that condition into account.

@olavloite
Copy link
Contributor

I'm wondering whether we should solve this problem by adding a circuit breaker to the for loop instead of an exponential backoff. My reasoning and suggestion for this is:

  1. Any transient errors on the BeginTransaction call are handled by gax and automatically retried according to the retry settings for this RPC. Any error that bubbles up to the client is therefore not considered a transient error, and retrying it after X time still has a high probability of failing again.
  2. We currently check specifically check for PermissionDenied and DatabaseNotFound errors and stop the background process if one of those errors occur. Instead of adding more specific checks for errors that should stop the process, it would be more robust if we turned this logic around: Any gRPC error that bubbles up to the client should be considered a permanent error that should stop the process.
  3. The takeWriteSession() method will try to start a new transaction if no write prepared sessions can be found in the pool. If an error still occurs on the BeginTransaction RPC, the error will be returned to the user. If no error occurs, we should enable the background preparing of sessions again.

@hengfengli
Copy link
Contributor

hengfengli commented Dec 2, 2019

There is no error happening here. The problem is that missing of ID in the Transaction response leads to the unhealthy write transaction and getNextForTx() will always return a non-empty value:

ws := getNextForTx()

So it enters an endless loop: there is always an unhealthy ws and it keeps trying to prepare for write.

If we want to add the circuit breaker, we need to distinguish the cases:

  • ws or rs are not empty and we should prepare for write or ping to keep alive.
  • ws or rs are not empty but always be unhealthy for a while and we should break it instead of keep trying. (this is a bit tricky because what if we do have a large number of unhealthy sessions to prepare for write).

Also, there is a sleep interval when no work needs to be done. But currently, we don't have any sleep interval for two consecutive health checks (when rs or ws is not empty and there are some works to do).

Exponential backoff may not work properly here, because if we have a number of unhealthy sessions, it should run normally instead of waiting exponentially.

@olavloite
Copy link
Contributor

In my opinion, we should use the value that is returned here to determine whether we should stop the process of preparing transactions:

err := ws.prepareForWrite(ctx)

That method will also normally not return an error if an empty transaction is returned by the server (which by the way should not be possible), but that can easily be added to the method. The reason I think we should use the return value of that method to stop the prepare process is that it will also catch any errors that might be introduced by other bugs, server problems, etc.

@hengfengli
Copy link
Contributor

You're right. In my testing, this method does not return any error at the moment.

I agree that it will be good to use this error for the circuit breaker.

My only concern is that there is no sleep internal for two consecutive health checks. Should we add a sleep interval? Because we may end up with a risk again that rs or ws is always not empty and the loop will keep running and burning cpu.

@olavloite
Copy link
Contributor

My only concern is that there is no sleep internal for two consecutive health checks. Should we add a sleep interval? Because we may end up with a risk again that rs or ws is always not empty and the loop will keep running and burning cpu.

I would rather explore whether we should split the two things that this for-loop is doing into two different loops/methods, as the error handling that we would want to apply to these are probably different:

  1. Prepare session: If a prepare session fails with an error other than Session not found, I would say we should break the loop and only start it again if a 'manual' prepare session succeeds (i.e. the client application calls takeWriteSession and the prepare that is executed by that method succeeds. That is a clear sign that any (semi-permanent) error for creating read/write transactions has been cleared.
  2. Ping session (i.e. keep-alive): This is something that we should normally not shutdown, and it might make sense to do an exponential sleep if we start getting errors for these calls (other than Session not found). The reason I think we should not shutdown this loop is that if these ping calls fail, it is also very probable that the client application will experience a lot of errors and the only real solution is to shutdown the application or wait until the (temporary) problem has been cleared. The best way to determine whether the problem has been cleared is probably to retry, but with an exponential sleep after each failure. I can think of two reasonable scenarios when these calls would start failing:
    1. Network failure or backend outage: The calls will start to timeout or fail with UNAVAILABLE. These calls will be retried automatically by gax until the total timeout / total retries has been reached. Although gax is already doing a sleep-and-retry, it would make sense to make the for-loop sleep as well between retries.
    2. Database was deleted or the user no longer has permission to see the database: The calls will start to fail with PERMISSION_DENIED and/or NOT_FOUND (Database not found) errors. These errors will also affect every other call the client application is executing on the database. Retrying could make sense, as it might be a temporary problem until the user has been granted permission again, but it is not very probable. I would say that doing an exponential sleep between these retries would also make sense.

I don't think we should add any sleep between two consecutive health checks (pings) if they succeed. The health checker should be allowed to ping sessions when deemed necessary without any artificial delay. We should however ensure that the ping method always returns an error if the ping in some way did not succeed or returned invalid data (which it currently does, as far as I can tell).

@hengfengli
Copy link
Contributor

Thanks a lot for the detailed explanation. This sounds good to me.

BTW: I think the spanner in google-cloud-java also has this issue. When I first run my test, I used the java client library.

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Dec 8, 2019
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Dec 19, 2019
@hengfengli
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants