Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

production: Admin UI not displaying any data on indigo #17478

Closed
a-robinson opened this issue Aug 7, 2017 · 34 comments
Closed

production: Admin UI not displaying any data on indigo #17478

a-robinson opened this issue Aug 7, 2017 · 34 comments
Assignees
Milestone

Comments

@a-robinson
Copy link
Contributor

The cluster seems to be working fine by other metrics - both reads and writes from kv are being served at the same rate as usual, and I can interactively query/create tables. There aren't any unavailable ranges in the cluster according to the metrics in grafana and the info on the debug pages.

I'll keep digging into what's going on.

@a-robinson a-robinson self-assigned this Aug 7, 2017
@a-robinson a-robinson added this to the 1.1 milestone Aug 7, 2017
@a-robinson
Copy link
Contributor Author

Timeseries writes got blocked nearly 18 hours ago, and apparently don't have a way to get unstuck:

goroutine 186 [chan receive, 1072 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges.func1(0xc4232409c8, 0xc423241108, 0xc423241100, 0xc42324090e, 0xc423241060, 0xc4232409b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:678 +0xe9
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:843 +0x72e
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:609 +0x344
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc42033c0d0, 0x7f9d2c126000, 0xc42bdb6f90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:442 +0x1f2
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).send(0xc420248780, 0x7f9d2c126000, 0xc4224b7ad0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:555 +0x1ff
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).(github.com/cockroachdb/cockroach/pkg/internal/client.send)-fm(0x7f9d2c126000, 0xc4224b7ad0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x83
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x7f9d2c126000, 0xc4224b7ad0, 0xc423241ac0, 0xc42b577500, 0x1, 0xc423241bd8)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:463 +0x103
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run(0xc420248780, 0x7f9d2c126000, 0xc4224b7ad0, 0xc42b577500, 0x1, 0x14d8673fede70000)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x9d
github.com/cockroachdb/cockroach/pkg/ts.(*DB).StoreData(0xc42066c088, 0x7f9d2c126000, 0xc4224b7ad0, 0x1, 0xc4338cc000, 0x197, 0x197, 0x2aa5e00, 0xc420414c60)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:154 +0x78f
github.com/cockroachdb/cockroach/pkg/ts.(*poller).poll.func1(0x7f9d2c126000, 0xc4204edb60)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:111 +0x13d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc4206fde60, 0x7f9d2c126000, 0xc4204edb60, 0x1d29e14, 0xf, 0xc423241e98, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:226 +0xf7
github.com/cockroachdb/cockroach/pkg/ts.(*poller).poll(0xc4206c9620)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:114 +0xdc
github.com/cockroachdb/cockroach/pkg/ts.(*poller).start.func1(0x7f9d2c126000, 0xc420b4c330)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:90 +0x11c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42028fde0, 0xc4206fde60, 0xc42028fdd0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xf7
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:194 +0xad

Since they get done from just a single goroutine, any blocked write prevents all future writes.

@a-robinson
Copy link
Contributor Author

This is an interesting amount of recursion in the dist_sender, but it did appear to eventually bottom out:

goroutine 13815889 [select]:
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc420712000, 0x7f9d2c16a2a8, 0xc42e1d1d00, 0xc420712048, 0x3a1, 0xc43b1acd20, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1156 +0x1525
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc420712000, 0x7f9d2c16a2a8, 0xc42e1d1d00, 0x3a1, 0xc43b1acd20, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:394 +0x2db
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:458 +0x17b
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:942 +0x447
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1002 +0x9b7
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc420712000, 0x7f9d2c126000, 0xc42bdb6ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:807 +0xafe
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0x22f

@petermattis
Copy link
Collaborator

Should we put a timeout on the timeseries writes? Cc @mrtracy

Any clues on why the write got block?

@a-robinson
Copy link
Contributor Author

Yeah, one of the timeseries ranges got borked. The logs from when things got stuck are gone, but if I had to guess I'd guess it's related to #17198.

The current symptom is the current leaseholder repeatedly logging:

I170807 18:23:41.516352 141 storage/replica.go:3638  [n9,s9,r929/1:/System/tsd/cr.store.queue.re…] transferring raft leadership to replica ID 3
I170807 18:23:41.516410 141 vendor/github.com/coreos/etcd/raft/raft.go:993  [n9,s9,r929/1:/System/tsd/cr.store.queue.re…] 1 [term 12] transfer leadership to 3 is in progress, ignores request to same node 3

And less frequently (but still repeatedly) logging

I170807 18:24:20.516664 158 vendor/github.com/coreos/etcd/raft/raft.go:1004  [n9,s9,r929/1:/System/tsd/cr.store.queue.re…] 1 [term 12] starts to transfer leadership to 3
I170807 18:24:20.516687 158 vendor/github.com/coreos/etcd/raft/raft.go:1010  [n9,s9,r929/1:/System/tsd/cr.store.queue.re…] 1 sends MsgTimeoutNow to 3 immediately as 3 already has up-to-date log

While node 5 (which holds the uninitialized replica just repeatedly logs:

I170807 18:26:20.361940 132 vendor/github.com/coreos/etcd/raft/raft.go:1100  [n5,s5,r929/3:{-}] 3 received MsgTimeoutNow from 1 but is not promotable

screen shot 2017-08-07 at 2 22 42 pm

@mrtracy
Copy link
Contributor

mrtracy commented Aug 7, 2017

This isn't really specific to the time series system, this got stuck in DB.Run(). If those calls need timeouts to be safe that's fine, whatever the recommendation is.

However, in this case the ultimate result (no timeseries data available) would not change; if the time series range is not writable, then the requests will just repeatedly time out nothing will be recorded. Even without a timeout, if the range eventually comes back, the hanging request would presumably succeed and the loop will proceed normally.

A timeout would be useful in two cases:

  • If the hanging request is exacerbating the issue (doesn't seem likely, but I'll admit that I don't understand KV that well anymore).
  • If we think that the timeout, presumably coupled with an error log, would have made debugging this issue more clear.

@a-robinson
Copy link
Contributor Author

Whoops, I was wrong up above where I said node 9 is the leaseholder. It's the raft leader, but node 5 is somehow the leaseholder despite not having an initialized replica. That makes me think this may be different than #17198, particularly since there couldn't have been any rebalancing in this scenario given that the replica IDs are 1, 2, and 3.

So the question is how did node 5's replica get into this state of not having a valid range descriptor.

@petermattis
Copy link
Collaborator

How is replica 3 the lease holder when it isn't initialized? We're only supposed to transfer leases to up-to-date replicas.

@a-robinson
Copy link
Contributor Author

Well its raft state appears up to date, it's just missing a valid range descriptor.

@petermattis
Copy link
Collaborator

Is there anything in the logs about failing snapshots for that range?

@a-robinson
Copy link
Contributor Author

Nope, the only logs are the ones I've included here (and a few from the various queues on nodes 2 and 9 indicating that they can't process the range because they aren't the leaseholder).

@a-robinson
Copy link
Contributor Author

Its log size is significantly less than the other two replicas', so it's possible that truncation was involved in the issue, although I don't see any other evidence pointing to that.

@a-robinson
Copy link
Contributor Author

It'd be nice to be able to see what's in the command queue, if anything (cc #9797). It's unclear where the requests would even be blocked given node 5's weird state.

@a-robinson
Copy link
Contributor Author

I have a couple updates here, but no answer:

  1. It looks like the actual request that's being waited on has been lost. It doesn't look like it's actually blocked on the server side of the Send(). There's no goroutine on indigo-5 with 0x3a1 in the stack trace other than on the client side (i.e. in sendToReplicas). If that isn't conclusive, let me know. While periodically refreshing the goroutines, I did see one pop up of a goroutine committing to rocksdb for the range 929/0x3a1. Given the fact that it disappeared in later goroutine dumps, it makes me think the range isn't actually totally blocked. Also, I'm able to do timeseries reads that I think are hitting the range without them blocking.
  2. indigo-5 has been having massive performance problems, with writes often taking 10-15 seconds. Its disk utilization is completely maxed out, according to grafana., while other nodes are working fine. In traces, all the delay is between acquired {raft,replica}mu and applying command. It's not clear whether or how that's related.

I think I'll node 5 offline and manually inspecting its raft state unless anyone has anything specific to check out about the running process. I'm not sure how much more I can derive it.

@a-robinson
Copy link
Contributor Author

@petermattis @tschottdorf - anything else you can think of that I should look at before taking node 5 offline?

@a-robinson
Copy link
Contributor Author

I should also mention that the reason node 5 is unwilling to become raft leader is because it doesn't have an entry for itself in its own progress map (prs map[uint64]*Progress), which make it un-"promotable". This seems pretty surprising, but I'm not familiar with the expectations for it.

@petermattis
Copy link
Collaborator

I can't think of anything else to look at, but perhaps @bdarnell does. Ben is also the most knowledgeable about the Raft internals and various states.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 8, 2017

Since we know there was no rebalancing (Replica IDs 1, 2, 3), these replicas were created by a split. One way in which we could have an uninitialized replica on node 5 is if node 5's replica of the pre-split range was stuck and unable to process the split. But that's inconsistent with some of the other things we see here (such as the apparently valid MVCC stats, and a non-empty raft log - these wouldn't exist unless the split had been processed).

The raft log is the most suspicious thing here. The first (truncation) and last log indexes are the same on all replicas, but the size on node 5 is wrong. How did it get to be non-empty without matching the others? I don't think it could have completed a truncation without becoming initialized along the way.

Or could it be related to #17051? We know there's a bug in which the raft hard state can be written incorrectly on splits, and the fix for this is not enabled until 1.1 migrations are done.

@a-robinson
Copy link
Contributor Author

To add to your comment since I don't think you noticed it from the data above, the range had been operating fine for more than a day before it got stuck. That points even more strongly at log truncation, so I've started investigating #17429/#17448 in case the test failures may be related.

@tbg
Copy link
Member

tbg commented Aug 8, 2017

I should also mention that the reason node 5 is unwilling to become raft leader is because it doesn't have an entry for itself in its own progress map (prs map[uint64]*Progress), which make it un-"promotable". This seems pretty surprising, but I'm not familiar with the expectations for it.

I thought the Progress is only populated on the Raft leader. Which piece of code are you talking about?

I'm currently looking at a user's cluster who also lost his timeseries ranges. Data attached.

There's definitely a problem with the quota pool on that cluster. @irfansharif, any thoughts on the below? It might be an artifact of the ranges being horked, which the attached range pages can hopefully illustrate.
Archive.zip

goroutine 35976949 [chan receive, 6999 minutes]:
github.com/cockroachdb/cockroach/pkg/util/timeutil.(*Timer).Reset(0xc422777740, 0xdf8475800)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/timeutil/timer.go:89 +0x9f
github.com/cockroachdb/cockroach/pkg/storage.(*quotaPool).acquire(0xc420e88c30, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/quota_pool.go:198 +0x69a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeAcquireProposalQuota(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x8, 0x14d7740974383c7f)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:899 +0xd9
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x14d76b9159239640, 0x0, 0x0, 0x0, 0x700000007, 0xc, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2817 +0x69c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4204

@a-robinson
Copy link
Contributor Author

I thought the Progress is only populated on the Raft leader. Which piece of code are you talking about?

So it'd be a very weird raft bug if that was only meant to be populated on the leader given how it's used in this follower-specific code.

@tbg
Copy link
Member

tbg commented Aug 8, 2017

Oh, apologies. I was thinking of something else (we use the progress map in Replica sometimes, but not on such a fundamental level).

@tbg
Copy link
Member

tbg commented Aug 8, 2017

@a-robinson if you've already cooked up promising vmodule invocations, would you mind sharing them?

@a-robinson
Copy link
Contributor Author

I wouldn't say I have anything particularly promising given my lack of progress, so it may be better if you come up with your own. Definitely make sure to include replica and raft in it, though.

@tbg
Copy link
Member

tbg commented Aug 8, 2017

Random observations from the cluster I'm looking at: the clock offset messages talk about "17 nodes are within the configured ....". The cluster has 9 nodes. I think this is probably explainable when nodes were removed from the cluster.

@a-robinson
Copy link
Contributor Author

Well, taking node 5 offline and investigating its raft state a little more closely didn't turn up anything that I was able to notice. Here are the few relevant debug dumps:

range-descriptors.txt
929-raftlog.txt
929-data.txt

The most notable thing I derived from that is that the on-disk range descriptor is present and correct, so we must have done something bad within the process to mess up the in-memory state without destroying the on-disk state.

After I killed node 5, the cluster very quickly up-replicated and its performance took off from 300 qps to 3000 qps. Also, the number of leader-not-leaseholder ranges dropped from ~50 to 0.

I've cordoned off the data directory into /mnt/data2 and restarted the node with an empty /mnt/data1 to try to determine whether the bad performance was caused by the VM or by something in the store.

After adding node 5 back with an empty data directory, the cluster's performance is slowly dipping as more ranges are added to it, and node 5's latency is crawling back up. That leads me to think that something is wrong with either the disk or the VM. I already checked how the disk is formatted, and it's formatted the same as the other nodes (so we're not hitting that issue again).

Although, even if the VM's/disk's performance is garbage, the replica shouldn't have gotten into the state that it was in, and I don't know what to look into at this point. I could swap node 5 back to its original data directory at some point, but I expect that wouldn't have any great affect given that all of its replicas will just get GC'ed once it rejoins.

@a-robinson
Copy link
Contributor Author

Heh, the reason the disk's performance is so bad appears to be because it's getting overloaded by more than 1000 OMSConsistencyInvoker processes on the machine:

$ ps xa
...
  657 ?        Ss     0:00 /bin/sh -c /opt/omi/bin/OMSConsistencyInvoker >/dev/null 2>&1
  658 ?        Sl     0:00 /opt/omi/bin/OMSConsistencyInvoker
  732 ?        S      0:00 /usr/sbin/CRON -f
  735 ?        Ss     0:00 /bin/sh -c /opt/omi/bin/OMSConsistencyInvoker >/dev/null 2>&1
  737 ?        Sl     0:00 /opt/omi/bin/OMSConsistencyInvoker
  791 ?        S      0:00 /usr/sbin/CRON -f
...
$ ps xa | grep "/bin/sh -c /opt/omi/bin/OMSConsistencyInvoker" | wc
   1027    9243   91405

Any idea what all these are doing here, @mberhault?

@bdarnell
Copy link
Contributor

bdarnell commented Aug 9, 2017

@a-robinson
Copy link
Contributor Author

Yeah, my googling led to the same place. The weird thing, though, is that disabling the OMS agent (sudo service omid stop) hasn't actually improved anything. Throughput and latency on the node are still trash, and its disk is still seeing ~100% utilization.

@a-robinson
Copy link
Contributor Author

jdb2 is way busier than I'd expect for such a low write throughput. Maybe it's an after effect of the OMS agents, but I don't really know:

screen shot 2017-08-09 at 7 21 41 pm

@a-robinson
Copy link
Contributor Author

@mberhault, let me know if you care to look into the slow disk issue. If I don't hear from you, I'm going to delete and recreate the VM/disk tomorrow under the assumption that it's just azure being azure.

@mberhault
Copy link
Contributor

Yeah, I think this can be replaced.

@a-robinson
Copy link
Contributor Author

I replaced node 5 a few hours ago and things are much happier now. That leaves us with no leads into figuring out the original issue, though, so I won't be looking into it until it strikes again or some sort of inspiration hits me.

@dianasaur323
Copy link
Contributor

@a-robinson ok, let me shift the milestone to later so that we keep a record of this, but this way we can clean out our 1.1 milestone

@dianasaur323 dianasaur323 modified the milestones: Later, 1.1 Aug 30, 2017
@tbg
Copy link
Member

tbg commented Apr 19, 2018

Closing as not actionable.

@tbg tbg closed this as completed Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants