roachtest: attempt to handle VM overload under tpccbench

See #62039. `tpccbench`, by design, pushes CRDB into overload territory. The test harness handles nodes crashing or tpmc tanking well. However, it was not prepared to handle the cloud VMs going unresponsive for ~minutes, which is one common failure mode. This commit tweaks the line search to be resilient to failures to communicate with the cloud VM in the one place where it matters (stopping the cluster at the beginning of a new search attempt). The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this reliably, but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
cockroachdb · Mar 22, 2021 · dad8261 · dad8261
1 parent c568739
commit dad8261
Showing 1 changed file with 30 additions and 1 deletion.
diff --git a/pkg/cmd/roachtest/tpcc.go b/pkg/cmd/roachtest/tpcc.go
@@ -804,7 +804,36 @@ func runTPCCBench(ctx context.Context, t *test, c *cluster, b tpccBenchSpec) {
 		iteration++
 		t.l.Printf("initializing cluster for %d warehouses (search attempt: %d)", warehouses, iteration)
 		m := newMonitor(ctx, c, roachNodes)
-		c.Stop(ctx, roachNodes)
+
+		// We overload the clusters in tpccbench, which can lead to transient infra
+		// failures. These are a) really annoying to debug and b) hide the actual
+		// passing warehouse count, making the line search sensitive to the choice
+		// of starting warehouses. Do a best-effort at waiting for the cloud VM(s)
+		// to recover without failing the line search.
+		var ok bool
+		for i := 0; i < 10; i++ {
+			if err := ctx.Err(); err != nil {
+				t.Fatal(err)
+			}
+			if err := c.StopE(ctx, roachNodes); err != nil {
+				t.l.Printf("unable to stop cluster; retrying to allow vm to recover: %s", err)
+				// We usually spend a long time blocking in StopE anyway, but just in case
+				// of a fast-failure mode, we still want to spend a little bit of time over
+				// the course of 10 retries to maximize the chances of things going back to
+				// working.
+				select {
+				case <-time.After(30 * time.Second):
+				case <-ctx.Done():
+				}
+				continue
+			}
+			ok = true
+			break
+		}
+		if !ok {
+			t.Fatalf("VM is hosed; giving up")
+		}
+
 		c.Start(ctx, t, append(b.startOpts(), roachNodes)...)
 		time.Sleep(restartWait)