Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft stuck on writing hard state #1573

Closed
tbg opened this issue Jun 29, 2015 · 7 comments
Closed

Raft stuck on writing hard state #1573

tbg opened this issue Jun 29, 2015 · 7 comments
Labels
C-test-failure Broken test (automatically or manually discovered).

Comments

@tbg
Copy link
Member

tbg commented Jun 29, 2015

The following test appears to have failed:

#4308:

I0629 23:38:18.502005     331 range.go:1042] gossiping zones config from store 1, range 1
I0629 23:38:18.502271     331 split_queue.go:90] splitting range=1 ["" - "\xff\xff") at keys ["db1" "db2"]
I0629 23:38:18.503061     331 range_command.go:837] initiating a split of range=1 ["" - "\xff\xff") at key "db1"
I0629 23:38:18.507518     331 range_command.go:774] range 2: new leader lease replica 1:1 00:00:00.000 +1.000s
I0629 23:38:18.507671     331 range_command.go:837] initiating a split of range=2 ["db1" - "\xff\xff") at key "db2"
panic: test timed out after 30s

goroutine 6049 [running]:
testing.func·008()
    /usr/src/go/src/testing/testing.go:681 +0x12f
created by time.goFunc
    /usr/src/go/src/time/sleep.go:129 +0x4b

goroutine 1 [chan receive]:
testing.RunTests(0x11c2ad0, 0x16a49e0, 0xb4, 0xb4, 0xc207fff501)
    /usr/src/go/src/testing/testing.go:556 +0xad6
--
    /go/src/github.com/cockroachdb/cockroach/storage/queue.go:207 +0x2cb
github.com/cockroachdb/cockroach/util.func·001()
    /go/src/github.com/cockroachdb/cockroach/util/stopper.go:75 +0x51
created by github.com/cockroachdb/cockroach/util.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/util/stopper.go:76 +0xe3
FAIL    github.com/cockroachdb/cockroach/storage    30.034s
=== RUN TestBatchBasics-2
I0629 23:39:06.728048     453 rocksdb.go:88] opening in-memory rocksdb instance
I0629 23:39:06.728756     453 rocksdb.go:113] closing in-memory rocksdb instance
--- PASS: TestBatchBasics-2 (0.00s)
=== RUN TestBatchGet-2
I0629 23:39:06.729353     453 rocksdb.go:88] opening in-memory rocksdb instance
I0629 23:39:06.729530     453 rocksdb.go:113] closing in-memory rocksdb instance
--- PASS: TestBatchGet-2 (0.00s)
=== RUN TestBatchMerge-2
I0629 23:39:06.730154     453 rocksdb.go:88] opening in-memory rocksdb instance

Please assign, take a look and update the issue accordingly.

@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label Jun 29, 2015
@tbg
Copy link
Member Author

tbg commented Jun 30, 2015

the stopper gets stuck on Quiesce because there's an outstanding command in Raft which doesn't proceed. The problem seems to be that Raft is stuck writing a HardState:

goroutine 6074 [runnable, locked to thread]:
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBPut(0x7f9c400aef10, 0xc2080786c0, 0xf, 0xc2087965a0, 0x21, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/:177 +0x63
github.com/cockroachdb/cockroach/storage/engine.(*RocksDB).Put(0xc2083200c0, 0xc2080786c0, 0xf, 0x400, 0xc2087965a0, 0x21, 0x120, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:147 +0x1c7
github.com/cockroachdb/cockroach/storage/engine.PutProto(0x7f9c97eefb18, 0xc208036050, 0xc2080786c0, 0xf, 0x400, 0x7f9c97eeff28, 0xc208078580, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/engine.go:159 +0x1e2
github.com/cockroachdb/cockroach/storage/engine.mvccPutInternal(0x7f9c97eefb18, 0xc208036050, 0x0, 0xc20810a650, 0xa, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:719 +0x68b
github.com/cockroachdb/cockroach/storage/engine.MVCCPut(0x7f9c97eefb18, 0xc208036050, 0x0, 0xc20810a650, 0xa, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:669 +0x3d2
github.com/cockroachdb/cockroach/storage/engine.MVCCPutProto(0x7f9c97eefb18, 0xc208036050, 0x0, 0xc20810a650, 0xa, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:403 +0x1fc
github.com/cockroachdb/cockroach/storage.(*Range).SetHardState(0xc208328480, 0x6, 0x100000001, 0x7e46, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/range_raftstorage.go:429 +0x1ad
github.com/cockroachdb/cockroach/multiraft.func·007()
    /go/src/github.com/cockroachdb/cockroach/multiraft/storage.go:168 +0x68e
github.com/cockroachdb/cockroach/util.func·001()
    /go/src/github.com/cockroachdb/cockroach/util/stopper.go:75 +0x51
created by github.com/cockroachdb/cockroach/util.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/util/stopper.go:76 +0xe3

goroutine 6077 [chan receive]:
github.com/cockroachdb/cockroach/storage.(*Range).addWriteCmd(0xc208328480, 0x7f9c97ef1968, 0xc20810c900, 0x7f9c97ef5378, 0xc208784000, 0x7f9c97ef53d0, 0xc208353570, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/range.go:713 +0x6e3
github.com/cockroachdb/cockroach/storage.(*Range).AddCmd(0xc208328480, 0x7f9c97ef1968, 0xc20810c900, 0x7f9c97ef5378, 0xc208784000, 0x7f9c97ef53d0, 0xc208353570, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/range.go:508 +0x1bd

The MultiRaft goroutine is correspondingly stuck waiting for the WriteTask.

@tbg
Copy link
Member Author

tbg commented Jun 30, 2015

I am tempted to say this isn't our fault, but I don't know enough about potential reasons for a cgo call to get stuck.

@tbg tbg changed the title Test failure in CI build 4308 Raft stuck on writing hard state Jun 30, 2015
@bdarnell
Copy link
Contributor

Is it blocked, or is it busy-looping? There's no "blocked for 30s" message on any of the stack traces, although I'm not sure what exactly causes those to appear. I think it's more likely that we're updating HardState frequently and that just happened to be where the thread was when we reached the timeout.

@tamird
Copy link
Contributor

tamird commented Jun 30, 2015

I think the threshold for those messages is higher (1 minute?)

@tbg
Copy link
Member Author

tbg commented Jun 30, 2015

I can't tell from the Trace unfortunately, but the annotations on the stack trace I've only seen for >=1minute intervals.

@tbg
Copy link
Member Author

tbg commented Jun 30, 2015

it might well be an infinite loop - with --verbosity=1 we would know but somehow we've removed that (why?).

The command which is running is an AdminSplit, and it's in the store retry loop trying to resolve intents - for that transaction those will be on the meta descriptors. @spencerkimball went to quite some lengths to prevent something like that, but perhaps it's got something to do with it. Or we have a dangling previous split that died, and its txn isn't old enough yet to be pushed (10s, so that seems unlikely). We should have V(1) logging.

tbg added a commit to tbg/cockroach that referenced this issue Jul 1, 2015
this will a) log more, which is relevant for cockroachdb#1573 and
b) log traces for all requests made during tests.
@jess-edwards jess-edwards mentioned this issue Aug 18, 2015
78 tasks
@tbg
Copy link
Member Author

tbg commented Sep 21, 2015

With no useful logs above and many changes to the codebase (including Raft) in the meantime, I'm going to close this. If it happens again, we'll have more information.

@tbg tbg closed this as completed Sep 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

3 participants