spanner is pegging the backend #1687

jeanbza · 2019-11-29T04:41:03Z

1: Start a basic echo server (something like https://gist.github.com/jadekler/b85fdf6a7a71583419e05763bbb1ce7c) that sends an empty transaction &tpb.Transaction{} in BeginTransaction
2. Start the client library
3. See thousands of requests BeginTransaction per second to echo server

@skuruppu believes it is this block:

google-cloud-go/spanner/session.go

Lines 1171 to 1211 in 25d2e81

    
           for { 
        
           	if hc.isClosing() { 
        
           		// Exit when the pool has been closed and all sessions have been 
        
           		// destroyed or when health checker has been closed. 
        
           		hc.waitWorkers.Done() 
        
           		return 
        
           	} 
        
           	ws := getNextForTx() 
        
           	if ws != nil { 
        
           		ctx, cancel := context.WithTimeout(context.Background(), time.Second) 
        
           		err := ws.prepareForWrite(ctx) 
        
           		cancel() 
        
           		if err != nil { 
        
           			// Skip handling prepare error, session can be prepared in next 
        
           			// cycle. 
        
           			log.Printf("Failed to prepare session, error: %v", toSpannerError(err)) 
        
           		} 
        
           		hc.pool.recycle(ws) 
        
           		hc.pool.mu.Lock() 
        
           		hc.pool.prepareReqs-- 
        
           		hc.pool.mu.Unlock() 
        
           		hc.markDone(ws) 
        
           	} 
        
           	rs := getNextForPing() 
        
           	if rs == nil { 
        
           		if ws == nil { 
        
           			// No work to be done so sleep to avoid burning CPU. 
        
           			pause := int64(100 * time.Millisecond) 
        
           			if pause > int64(hc.interval) { 
        
           				pause = int64(hc.interval) 
        
           			} 
        
           			select { 
        
           			case <-time.After(time.Duration(rand.Int63n(pause) + pause/2)): 
        
           			case <-hc.done: 
        
           			} 
        
           		} 
        
           		continue 
        
           	} 
        
           	hc.healthCheck(rs) 
        
           }

Likely this entire for-loop needs to be re-written to have exponential backoff. We should probably use https://godoc.org/github.com/googleapis/gax-go/v2 instead of the "sometimes pause 100ms"

google-cloud-go/spanner/session.go

Lines 1195 to 1209 in 25d2e81

    
           if rs == nil { 
        
           	if ws == nil { 
        
           		// No work to be done so sleep to avoid burning CPU. 
        
           		pause := int64(100 * time.Millisecond) 
        
           		if pause > int64(hc.interval) { 
        
           			pause = int64(hc.interval) 
        
           		} 
        
           		select { 
        
           		case <-time.After(time.Duration(rand.Int63n(pause) + pause/2)): 
        
           		case <-hc.done: 
        
           		} 
        
           	} 
        
           	continue 
        
           }

. (100ms needs to be changed to exponential backoff, and the conditionality of the pause needs to be fixed)

Marking this a p1 bug since it seems like the kind of thing that can cause unintended DOS of GCP.

The text was updated successfully, but these errors were encountered:

jeanbza · 2019-11-29T04:45:57Z

Update: appears to be fixed if the transaction returned has a non-nil ID. But, I think this probably highlights a general scary-ness around that for-loop. We should in general not have unprotected looping of RPCs. Any bug (now or in the future) that would cause that loop to spin will peg the backend.

olavloite · 2019-11-29T06:21:36Z

This is basically the same issue as #1662. If the server starts returning a permanent error for BeginTransaction (or something the client will see as an invalid read/write transaction), you could enter this infinite loop.

The pause block mentioned above is intended for the situation when the worker does not need to do anything, and the worker is paused to avoid burning CPU unnecessarily. The exponential backoff that we should implement in case of an error should be separate from that, or at least also take that condition into account.

olavloite · 2019-11-29T07:01:37Z

I'm wondering whether we should solve this problem by adding a circuit breaker to the for loop instead of an exponential backoff. My reasoning and suggestion for this is:

Any transient errors on the BeginTransaction call are handled by gax and automatically retried according to the retry settings for this RPC. Any error that bubbles up to the client is therefore not considered a transient error, and retrying it after X time still has a high probability of failing again.
We currently check specifically check for PermissionDenied and DatabaseNotFound errors and stop the background process if one of those errors occur. Instead of adding more specific checks for errors that should stop the process, it would be more robust if we turned this logic around: Any gRPC error that bubbles up to the client should be considered a permanent error that should stop the process.
The takeWriteSession() method will try to start a new transaction if no write prepared sessions can be found in the pool. If an error still occurs on the BeginTransaction RPC, the error will be returned to the user. If no error occurs, we should enable the background preparing of sessions again.

hengfengli · 2019-12-02T04:28:12Z

There is no error happening here. The problem is that missing of ID in the Transaction response leads to the unhealthy write transaction and getNextForTx() will always return a non-empty value:

google-cloud-go/spanner/session.go

Line 1178 in 25d2e81

ws := getNextForTx()

So it enters an endless loop: there is always an unhealthy ws and it keeps trying to prepare for write.

If we want to add the circuit breaker, we need to distinguish the cases:

ws or rs are not empty and we should prepare for write or ping to keep alive.
ws or rs are not empty but always be unhealthy for a while and we should break it instead of keep trying. (this is a bit tricky because what if we do have a large number of unhealthy sessions to prepare for write).

Also, there is a sleep interval when no work needs to be done. But currently, we don't have any sleep interval for two consecutive health checks (when rs or ws is not empty and there are some works to do).

Exponential backoff may not work properly here, because if we have a number of unhealthy sessions, it should run normally instead of waiting exponentially.

olavloite · 2019-12-02T08:44:26Z

In my opinion, we should use the value that is returned here to determine whether we should stop the process of preparing transactions:

google-cloud-go/spanner/session.go

Line 1181 in 25d2e81

err := ws.prepareForWrite(ctx)

That method will also normally not return an error if an empty transaction is returned by the server (which by the way should not be possible), but that can easily be added to the method. The reason I think we should use the return value of that method to stop the prepare process is that it will also catch any errors that might be introduced by other bugs, server problems, etc.

hengfengli · 2019-12-02T23:45:15Z

You're right. In my testing, this method does not return any error at the moment.

I agree that it will be good to use this error for the circuit breaker.

My only concern is that there is no sleep internal for two consecutive health checks. Should we add a sleep interval? Because we may end up with a risk again that rs or ws is always not empty and the loop will keep running and burning cpu.

olavloite · 2019-12-03T07:43:53Z

My only concern is that there is no sleep internal for two consecutive health checks. Should we add a sleep interval? Because we may end up with a risk again that rs or ws is always not empty and the loop will keep running and burning cpu.

I would rather explore whether we should split the two things that this for-loop is doing into two different loops/methods, as the error handling that we would want to apply to these are probably different:

Prepare session: If a prepare session fails with an error other than Session not found, I would say we should break the loop and only start it again if a 'manual' prepare session succeeds (i.e. the client application calls takeWriteSession and the prepare that is executed by that method succeeds. That is a clear sign that any (semi-permanent) error for creating read/write transactions has been cleared.
Ping session (i.e. keep-alive): This is something that we should normally not shutdown, and it might make sense to do an exponential sleep if we start getting errors for these calls (other than Session not found). The reason I think we should not shutdown this loop is that if these ping calls fail, it is also very probable that the client application will experience a lot of errors and the only real solution is to shutdown the application or wait until the (temporary) problem has been cleared. The best way to determine whether the problem has been cleared is probably to retry, but with an exponential sleep after each failure. I can think of two reasonable scenarios when these calls would start failing:
1. Network failure or backend outage: The calls will start to timeout or fail with UNAVAILABLE. These calls will be retried automatically by gax until the total timeout / total retries has been reached. Although gax is already doing a sleep-and-retry, it would make sense to make the for-loop sleep as well between retries.
2. Database was deleted or the user no longer has permission to see the database: The calls will start to fail with PERMISSION_DENIED and/or NOT_FOUND (Database not found) errors. These errors will also affect every other call the client application is executing on the database. Retrying could make sense, as it might be a temporary problem until the user has been granted permission again, but it is not very probable. I would say that doing an exponential sleep between these retries would also make sense.

I don't think we should add any sleep between two consecutive health checks (pings) if they succeed. The health checker should be allowed to ping sessions when deemed necessary without any artificial delay. We should however ensure that the ping method always returns an error if the ping in some way did not succeed or returned invalid data (which it currently does, as far as I can tell).

hengfengli · 2019-12-03T22:55:06Z

Thanks a lot for the detailed explanation. This sounds good to me.

BTW: I think the spanner in google-cloud-java also has this issue. When I first run my test, I used the java client library.

hengfengli · 2019-12-23T01:20:11Z

https://code-review.googlesource.com/c/gocloud/+/49030 is a fix for this.

jeanbza added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. api: spanner Issues related to the Spanner API. labels Nov 29, 2019

jeanbza assigned hengfengli and olavloite Nov 29, 2019

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Dec 8, 2019

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Dec 19, 2019

gopherbot closed this as completed in 4fba433 Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanner is pegging the backend #1687

spanner is pegging the backend #1687

jeanbza commented Nov 29, 2019

jeanbza commented Nov 29, 2019

olavloite commented Nov 29, 2019

olavloite commented Nov 29, 2019

hengfengli commented Dec 2, 2019 •

edited

Loading

olavloite commented Dec 2, 2019

hengfengli commented Dec 2, 2019

olavloite commented Dec 3, 2019

hengfengli commented Dec 3, 2019

hengfengli commented Dec 23, 2019

spanner is pegging the backend #1687

spanner is pegging the backend #1687

Comments

jeanbza commented Nov 29, 2019

jeanbza commented Nov 29, 2019

olavloite commented Nov 29, 2019

olavloite commented Nov 29, 2019

hengfengli commented Dec 2, 2019 • edited Loading

olavloite commented Dec 2, 2019

hengfengli commented Dec 2, 2019

olavloite commented Dec 3, 2019

hengfengli commented Dec 3, 2019

hengfengli commented Dec 23, 2019

hengfengli commented Dec 2, 2019 •

edited

Loading