Skip to content

Commit

Permalink
roachtest: attempt to handle VM overload under tpccbench
Browse files Browse the repository at this point in the history
See #62039.

`tpccbench`, by design, pushes CRDB into overload territory. The test
harness handles nodes crashing or tpmc tanking well. However, it was
not prepared to handle the cloud VMs going unresponsive for ~minutes,
which is one common failure mode.

This commit tweaks the line search to be resilient to failures to
communicate with the cloud VM in the one place where it matters
(stopping the cluster at the beginning of a new search attempt).

The hope is that this will allow the search to run to completion,
even in the face of overload-imposed temporary VM outages. It is
not expected to do this reliably, but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
  • Loading branch information
tbg committed Mar 22, 2021
1 parent c568739 commit dad8261
Showing 1 changed file with 30 additions and 1 deletion.
31 changes: 30 additions & 1 deletion pkg/cmd/roachtest/tpcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -804,7 +804,36 @@ func runTPCCBench(ctx context.Context, t *test, c *cluster, b tpccBenchSpec) {
iteration++
t.l.Printf("initializing cluster for %d warehouses (search attempt: %d)", warehouses, iteration)
m := newMonitor(ctx, c, roachNodes)
c.Stop(ctx, roachNodes)

// We overload the clusters in tpccbench, which can lead to transient infra
// failures. These are a) really annoying to debug and b) hide the actual
// passing warehouse count, making the line search sensitive to the choice
// of starting warehouses. Do a best-effort at waiting for the cloud VM(s)
// to recover without failing the line search.
var ok bool
for i := 0; i < 10; i++ {
if err := ctx.Err(); err != nil {
t.Fatal(err)
}
if err := c.StopE(ctx, roachNodes); err != nil {
t.l.Printf("unable to stop cluster; retrying to allow vm to recover: %s", err)
// We usually spend a long time blocking in StopE anyway, but just in case
// of a fast-failure mode, we still want to spend a little bit of time over
// the course of 10 retries to maximize the chances of things going back to
// working.
select {
case <-time.After(30 * time.Second):
case <-ctx.Done():
}
continue
}
ok = true
break
}
if !ok {
t.Fatalf("VM is hosed; giving up")
}

c.Start(ctx, t, append(b.startOpts(), roachNodes)...)
time.Sleep(restartWait)

Expand Down

0 comments on commit dad8261

Please sign in to comment.