-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysterious exit redux #12141
Comments
Hmm. Can you check whether it got OOM-killed? The process must have been SIGKILL'ed or else there would be a bunch of shutdown-related messages at the end of the log. |
Sure, where would I find that? I don't see anything in /var/log/kern.log, or anything in syslog or daemon.log that seems relevant (grepped it for 'cockroach'). (also checked /var/log/messages, sorry) |
What did the |
Also no smoking gun (to my eyes) in dmesg. Here's the last bit of that: [ 4.652247] random: nonblocking pool is initialized |
BTW there are times I've done 'top' and seen cockroach taking 50% under %MEM, which seems... strange? I thought it throttles itself to 25% (or is that just its internal caches)? In my app I am possibly 'abusing' things btw, I do store large blobs (column type BYTES), between say 30k and 1MB. Oh, it just crashed again while I was looking around the machine and I got a stack dump to stderr (I guess?). I'm attaching it here! It does say OOM so... good info at least.. |
It limits its RocksDB cache size to 25% of the machine's physical memory by default, but other parts of the system use memory as well (e.g. we allow SQL queries to use up to 25% of the system memory as well). There are more details about cockroach's memory usage in https://www.cockroachlabs.com/blog/memory-usage-cockroachdb/ What kind of SQL queries are you throwing at it / how much data are you storing? @knz, does anything stand out to you? |
I see in this panic a table scan -- can you tell us which query was running when this failed? |
Unfortunately I don't know for certain, there are multiple clients and I don't have fine-grained logging. I think the most likely culprit was one of the following (possibly several of these queries in flight): Here's the table def: it might also have been a query like this: UPDATE stats SET value=value+$1 WHERE key=$2 ($1=1, $2=some string) |
BTW it did it again, similar stack trace (attaching for completeness). |
In both cases the query is a plain SELECT. |
"If" there is another crash, I like your optimism, ha :) Anyhow, restarted with that flag. Latest crash has a slightly different stack signature. See attached for last bit of INFO log & dump |
BTW I just saw this in top (note mem usage by cockroach)
cockroach_dump_highmemusage.txt amazingly, it did NOT crash, although I got a dump of the traces, attached. (Also mixed in with the trace is some of the INFO log, I was tail -f'ing it in the same window) |
Just an update: it did crash a few more times, with nothing obvious in the logs, and I didn't catch stacktraces. What I ended up doing was 1. no longer storing byte blobs as one of the fields and 2. reindexing the same table such that rowid wasn't involved (since there was a better field to index off, but note that the blob field was never in an index). Since then I haven't had any crashes, probably due to it not trying to do anything with those blobs, but I don't know for certain, obviously (and if it was simply a scan, it doesn't seem like that would have fixed it). So for now I'm ok with closing this bug as I won't be able to provide additional details, but feel free to keep it open if you think it's going to help you / has enough info for further debugging. Thanks for the engagement throughout, though, I really appreciate it! |
Huh now you're saying this I actually have an idea -- how big were these blobs exactly? (I know we're not doing size checks in that part of the code yet, so you may have ran into an edge case.) |
Sorry for slow response -- size varied, from say 32k to 1MB, averaging around 150k. |
Another question about this: I see there were several requests in-flight. How many active connections were there on the crashing node? |
FWIW the crash is in our RocksDB interface (that's what MVCCIterate is right?):
@petermattis does this ring a bell? |
An out of memory error in RocksDB (or C++) usually indicates there was a memory leak on the Go side. That is, the allocation in C++ was the straw that broke the camel's back, but not the real cause of the leak. If this leak/crash is reproducible, it would be useful to get Go memory profiles. Either set Looking at the earlier messages on this issue, I see:
Notice that over the course of 10s, If you use the |
I moved the blobs out of the sql DB into a blob-specific store so I haven't seen this happen since. Probably won't be able to dupe from my end, sorry :( |
Seems like there is nothing more that can be investigated here. |
Please follow the steps below to help us help you.
Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of
grep -F '[config]' cockroach-data/logs/cockroach.INFO
When log files are not available, supply the output of
cockroach version
and all flags/environment variables passed to
cockroach start
instead.I161206 21:45:08.553476 1 util/log/clog.go:1002 [config] file created at: 2016/12/06 21:45:08
I161206 21:45:08.553476 1 util/log/clog.go:1002 [config] running on machine: sonicfe
I161206 21:45:08.553476 1 util/log/clog.go:1002 [config] binary: CockroachDB beta-20161201 (linux amd64, built 2016/12/01 18:32:34, go1.7.3)
I161206 21:45:08.553476 1 util/log/clog.go:1002 [config] arguments: [cockroach start --store=/home/sonic/cockroach-data/ --http-port 8280]
Stopped responding, process has ended. Nothing special in the logs. Here's the last bit of the info and warning logs:
I161207 08:54:29.366580 60 server/status/runtime.go:228 runtime stats: 2.5 GiB RSS, 60 goroutines, 28 MiB/65 MiB/245 MiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 1000.21cgo/sec, 0.06/0.00 %(u/s)time, 0.00 %gc (4x)
I161207 08:54:39.366627 60 server/status/runtime.go:228 runtime stats: 2.5 GiB RSS, 60 goroutines, 27 MiB/66 MiB/245 MiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 992.40cgo/sec, 0.06/0.00 %(u/s)time, 0.00 %gc (3x)
I161207 08:54:49.366463 60 server/status/runtime.go:228 runtime stats: 2.5 GiB RSS, 60 goroutines, 25 MiB/68 MiB/245 MiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 929.42cgo/sec, 0.05/0.00 %(u/s)time, 0.00 %gc (3x)
I161207 08:54:59.366518 60 server/status/runtime.go:228 runtime stats: 2.5 GiB RSS, 60 goroutines, 29 MiB/65 MiB/245 MiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 1158.09cgo/sec, 0.06/0.00 %(u/s)time, 0.00 %gc (3x)
I161207 08:55:09.324103 57 gossip/gossip.go:435 [n1] gossip status (ok, 1 node)
gossip client (0/3 cur/max conns)
gossip server (0/3 cur/max conns, infos 0/0 sent/received, bytes 0B/0B sent/received)
I161207 08:55:09.366643 60 server/status/runtime.go:228 runtime stats: 2.5 GiB RSS, 60 goroutines, 60 MiB/38 MiB/245 MiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 898.79cgo/sec, 0.05/0.00 %(u/s)time, 0.00 %gc (2x)
I161207 08:55:19.465146 60 server/status/runtime.go:228 runtime stats: 6.0 GiB RSS, 69 goroutines, 3.6 GiB/20 MiB/3.8 GiB GO alloc/idle/total, 1.8 GiB/2.4 GiB CGO alloc/total, 7333.36cgo/sec, 0.46/0.26 %(u/s)time, 0.00 %gc (12x)
and the entirety of the most recent warning log:
W161206 21:45:08.716350 1 util/log/clog.go:1002 [config] file created at: 2016/12/06 21:45:08
W161206 21:45:08.716350 1 util/log/clog.go:1002 [config] running on machine: sonicfe
W161206 21:45:08.716350 1 util/log/clog.go:1002 [config] binary: CockroachDB beta-20161201 (linux amd64, built 2016/12/01 18:32:34, go1.7.3)
W161206 21:45:08.716350 1 util/log/clog.go:1002 [config] arguments: [cockroach start --store=/home/sonic/cockroach-data/ --http-port 8280]
W161206 21:45:08.716350 1 util/log/clog.go:1002 line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg
W161206 21:45:08.716348 1 server/server.go:154 [n?] running in insecure mode, this is strongly discouraged. See --insecure.
W161206 21:45:08.716844 1 gossip/gossip.go:1124 [n?] no resolvers found; use --join to specify a connected node
W161206 21:45:09.293588 1 storage/store.go:817 [n1] storeMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start for 302.927783ms (>100ms):
goroutine 1 [running]:
runtime/debug.Stack(0x1e7f6c8, 0x1e7f708, 0x3b)
/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x120e4fa7)
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:67 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc4203589f0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:97 +0xb6
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start(0xc420358700, 0x2533aa0, 0xc4203dc7e0, 0xc42016d4d0, 0xc420360c60, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1098 +0x4d1
github.com/cockroachdb/cockroach/pkg/server.(*Node).initStores(0xc420158000, 0x2533aa0, 0xc4203dc7e0, 0xc4203ac710, 0x1, 0x1, 0xc42016d4d0, 0x0, 0x0, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:417 +0x209
github.com/cockroachdb/cockroach/pkg/server.(*Node).start(0xc420158000, 0x2533aa0, 0xc4203dc7e0, 0x2525e60, 0xc420302380, 0xc4203ac710, 0x1, 0x1, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:339 +0x11d
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc4201ee400, 0x2533aa0, 0xc4203dc7e0, 0xc4201ceec0, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:600 +0xd03
github.com/cockroachdb/cockroach/pkg/cli.runStart(0x250f500, 0xc420282cc0, 0x0, 0x3, 0x0, 0xc4202a8dc0)
/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:343 +0x8fa
github.com/cockroachdb/cockroach/pkg/cli.maybeDecorateGRPCError.func1(0x250f500, 0xc420282cc0, 0x0, 0x3, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/cli/error.go:30 +0x5a
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute(0x250f500, 0xc420282c30, 0x3, 0x3, 0x250f500, 0xc420282c30)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:632 +0x23e
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x250b100, 0xc4200101a0, 0xb9d07b536048cad2, 0xc4200101d8)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:722 +0x367
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute(0x250b100, 0xc4200101c0, 0xb9d07b536048cad2)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:681 +0x2b
github.com/cockroachdb/cockroach/pkg/cli.Run(0xc42000c0b0, 0x4, 0x4, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:95 +0x6d
main.main()
/go/src/github.com/cockroachdb/cockroach/main.go:35 +0xe1
Only thing I can think of is that I started the process interactively not from a process manager, but I stopped running it under supervisord because it was crashing so often (the previous beta from Nov) that I wanted to know whenever it went down.
But I am pretty certain I started it with --background (regardless of what the [config] lines say, and that that terminal is still active.
Even if it died because the shell that started it died, I expect to see SOMETHING in the logs, but as I said, I don't think that happened.
No evidence of why the process ended.
The text was updated successfully, but these errors were encountered: