-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552
Comments
System load is stable at about 30. Box has 36 cores (HT is off; we found little benefit to it being on). Traffic is coming in in a very bursty way (up to 600Mbit/sec) - see influxdata/telegraf#298 (comment). The peak IOPs are very high (25k), but most of the time its much lower. We are running these settings:
Our entire data dir (inc. WAL) is on high performance PCIe flash. There is no wait time to speak of, and i've never seen the devices more than 40% utilized. In the code, it says
...
My guess is that we need to increase wal-max-memory-size-threshold (which is l.MaxMemorySizeThreshold I believe) even above 2G. The concerning thing is that we have no iowait at the time that this is happening, which suggests the l.flush method isnt doing what I would expect (which would be saturate the WAL device IO). The fact we actually didnt see that much IO makes me wonder if the goroutines that are supposed to be doing the flushing are somehow blocking. Suggestions for how to profile that memory leak very welcome (is there a way of drawing a SVG of memory by some useful separator to figure out the cause?) Also concerning is that we appear to have a memory leak in the writer node (see RSS graph i'll shortly attach) - but not the other cluster nodes that were accepting the same writes. Finally, we had a panic after about an hour running. I filed #4554 for that. |
RSS graph - all writes went to the blue node, other 2 nodes are same cluster. Interestingly, other stats look similar to the other cluster nodes. I guess its possible there was backpressure in the sends to the other cluster nodes, under these relatively high loads? Not sure how to diagnose that best. |
We explicitly monitor RSS usage of our systems under test, though we're mostly testing single-nodes right now. We have not seen any indication that we are leaking memory. https://influxdb.com/blog/2015/10/20/testing_tsm.html Are you running queries against the same node? |
Thanks for the quick response. We're currently writing to testmetricsl7, and reading from testmetricsl5 (Grafana). Is there any useful way to dump memory usage in a similar fashion to pprof? Apologies for my lack of Go profiling experience... |
@otoolep , are you doing any of that at scale testing with Clustering? I think what we are doing differently here is trying to make a simple cluster (number of replicas = number of nodes = 3). We know its beta, so we dont expect there to be no edge cases, but if you guys are almost entirely testing single nodes, that probably explains why we keep having problems! |
FWIW, when we move to using a single node, we also see a stable RSS for influxd. It therefore appears that the runaway RSS is caused by an attempt to use clustering |
Spoke to soon - running a single node suddenly reported the following:
I'm attaching the Grafana dash we have for this node. I'm assuming the network traffic is a consequence of the telegraf agents retrying their flushes - we didn't make any agent changes during this time. |
How are you sending data into the system? (HTTP, Graphite...) Or is it all via Telegraf? How many series are you generating? What is the sampling rate? |
@otoolep |
OK @daviesalex Right now I don't have any specific ideas. What you are doing is very close to what we are doing in the sense we run multiple single nodes regularly, and hammer them with line-protocol data (which is what Telegraf emits) for days at a time. RSS remains steady for us. Of course, we don't have the same number of connections that you do -- more like 10-50 concurrent TCP connections, over which line-protocol data travels. If you can I suggest you add the following line to
Check the process table after the service has started to ensure these flags have been passed to This will write a memory profile dump to that file, on process exit. The next time you have a node start consuming an excessive amount of memory stop the service and send us this file. We can then examine it for hints on where the memory is getting used. |
@otoolep memprofile with 93GB heap (after about 1 minute of uptime): https://gist.github.com/wrigtim/a5dac2b299a50fb9ef08 We're going to back off the Thanks again for all your support here! If there's anything else you'd like us to run/profile we're more than happy to. |
Just adding stats for the above: |
@wrigtim -- that file is not what I was actually expecting. I was looking for a binary memory profile dump, which we feed into |
Can you graph the Go runtime params on your system? I'd like to see if this is mmap usage or Go heap. https://influxdb.com/blog/2015/09/22/monitoring_internal_show_stats.html for examples of the queries to run. |
Is there a big difference between Go runtime alloc and the RSS of the process? I am wondering if any difference could be accounted by mmaps. |
@otoolep So to explain the timings for the above graph - the stable portion was with ~1000 telegraf clients spread across many geographically diverse datacenters. The other 3000 nodes were blocked by iptables rules... At 11:15 I removed one of the iptables rules, which allowed around 500 additional nodes to connect (all from one DC). At this point you see points_written_ok plummet and the RSS and Heap start to increase what appears to be fairly proportionately. Does this help? I'm just shutting down the database to get the output from -memprofile, although it looks to be ASCII as before. I'll try loading it in to go pprof before sending anything over. |
I think I am experiencing the same issue:
It looks like the Go runtime is allocating a lot of memory from the system for "HeapIdle": The spike near the end of this 2-day graph is an increase of points written by around factor 10. Some hours later, memory usage reached the RAM+Swap limits and points started being dropped. Some minutes later (even after I had reduced the number of writes again), the process OOM'd: As usually, after being killed that way, InfluxDB refuses to come up again: |
Great info on this issue, It would be nice if we could create a standard Grafana Dashboard template to monitor all the important metrics from the _internal db. |
Sounds like a great idea. Please note we just changed the names of the metrics, so they are more in Execute SHOW MEASUREMENTS against the internal database for latest names. On Friday, October 30, 2015, Sebastian YEPES notifications@github.com
|
Do a SELECT * on each measurement to see the new field names. On Friday, October 30, 2015, Philip O'Toole philip@influxdb.com wrote:
|
@otoolep, @pauldix ...any ideas on the graph we included a couple of days ago? What's interesting with this development is that the app stayed up (no panic), but writes were blocked for a long period while write_requests continued to grow...so something is still listening for connections but unable to commit fast enough to the wal/index. |
tsm1 development is in active development right now @sebito91, as you probably know. The code may undergo significant changes yet, so until it is more stable, so these issues may be solved soon. If not we'll definitely investigate as all tsm1 issues must be addressed before 0.9.5 ships. |
@otoolep, here is another screenshot using the latest nightly...you can see the RSS grow astronomically and writes keep up for a time, until we finally fluctuate out of control and crap out. Is this at all related to garbage collection, are we holding onto metrics in memory after the flush to index? |
To clarify my point (and maybe point out my golang inadequacy), the inflated RSS may come not from the writes but from the reads and leaving files open after read either inadvertently or for the GC to collect on our behalf. Maybe we need to be explicit and close the files after read to prevent leaking resources? For example in the tsdb/engine/tsm1/tsm1.go readFields() function we open the file for read given the range we're after, but we never close before returning. Pretty much each of the read functions in tsm1.go are doing the same thing (return without close). At the very least should we add something like
|
@otoolep For info - we've upgraded the nightly from 6ecb62e to 664b20e (Go 1.4.2 + a few other bits), I don't want to get too excited, but so far we're seeing:
...but it's only been up 2 hours so far. Will leave it over the weekend and report back (with hopefully good news). |
Should be fixed in |
We upgraded to 0.9.5-nightly-827c513 today to take the fix for #4388. It was running fairly well for around 40 minutes (no errors in logs), until the node we are writing to in the 3-node cluster started reporting a large number of the below write failures. The other two nodes reported the same errors shortly afterwards.
Directory listing for telegraf/default/4/: https://gist.github.com/wrigtim/d0c086d589539dda0ace
Currently running TSM/WAL on PCIe SSDs to hopefully reduce IO performance impact...
The text was updated successfully, but these errors were encountered: