0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

wrigtim · 2015-10-23T16:20:19Z

We upgraded to 0.9.5-nightly-827c513 today to take the fix for #4388. It was running fairly well for around 40 minutes (no errors in logs), until the node we are writing to in the 3-node cluster started reporting a large number of the below write failures. The other two nodes reported the same errors shortly afterwards.

Directory listing for telegraf/default/4/: https://gist.github.com/wrigtim/d0c086d589539dda0ace

Currently running TSM/WAL on PCIe SSDs to hopefully reduce IO performance impact...

[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[handoff] 2015/10/23 11:07:26 remote write failed: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory

The text was updated successfully, but these errors were encountered:

daviesalex · 2015-10-23T17:22:00Z

System load is stable at about 30. Box has 36 cores (HT is off; we found little benefit to it being on).

Traffic is coming in in a very bursty way (up to 600Mbit/sec) - see influxdata/telegraf#298 (comment). The peak IOPs are very high (25k), but most of the time its much lower.

We are running these settings:

  wal-flush-memory-size-threshold = 26214400 # Default 5MB; ours 25MB
  wal-max-memory-size-threshold = 2147483648 # Default 100MB; ours 2GB

Our entire data dir (inc. WAL) is on high performance PCIe flash. There is no wait time to speak of, and i've never seen the devices more than 40% utilized.

In the code, it says

// addToCache will add the points, measurements, and fields to the cache and return true if successful. They will be queryable
// immediately after return and will be flushed at the next flush cycle. Before adding to the cache we check if we're over the
// max memory threshold. If we are we request a flush in a new goroutine and return false, indicating we didn't add the values
// to the cache and that writes should return a failure.
func (l *Log) addToCache(points []models.Point, fields map[string]*tsdb.MeasurementFields, series []*tsdb.SeriesCreate, checkMemory bool) bool {

...

                if l.memorySize > l.MaxMemorySizeThreshold {
                        return false
                }

My guess is that we need to increase wal-max-memory-size-threshold (which is l.MaxMemorySizeThreshold I believe) even above 2G.

The concerning thing is that we have no iowait at the time that this is happening, which suggests the l.flush method isnt doing what I would expect (which would be saturate the WAL device IO). The fact we actually didnt see that much IO makes me wonder if the goroutines that are supposed to be doing the flushing are somehow blocking. Suggestions for how to profile that memory leak very welcome (is there a way of drawing a SVG of memory by some useful separator to figure out the cause?)

Also concerning is that we appear to have a memory leak in the writer node (see RSS graph i'll shortly attach) - but not the other cluster nodes that were accepting the same writes.

Finally, we had a panic after about an hour running. I filed #4554 for that.

daviesalex · 2015-10-23T17:23:31Z

RSS graph - all writes went to the blue node, other 2 nodes are same cluster. Interestingly, other stats look similar to the other cluster nodes.

I guess its possible there was backpressure in the sends to the other cluster nodes, under these relatively high loads? Not sure how to diagnose that best.

otoolep · 2015-10-24T19:56:28Z

We explicitly monitor RSS usage of our systems under test, though we're mostly testing single-nodes right now. We have not seen any indication that we are leaking memory.

https://influxdb.com/blog/2015/10/20/testing_tsm.html

Are you running queries against the same node?

wrigtim · 2015-10-24T20:48:43Z

Thanks for the quick response. We're currently writing to testmetricsl7, and reading from testmetricsl5 (Grafana). Is there any useful way to dump memory usage in a similar fashion to pprof? Apologies for my lack of Go profiling experience...

daviesalex · 2015-10-26T14:28:01Z

@otoolep , are you doing any of that at scale testing with Clustering?

I think what we are doing differently here is trying to make a simple cluster (number of replicas = number of nodes = 3). We know its beta, so we dont expect there to be no edge cases, but if you guys are almost entirely testing single nodes, that probably explains why we keep having problems!

daviesalex · 2015-10-27T12:37:37Z

FWIW, when we move to using a single node, we also see a stable RSS for influxd.

It therefore appears that the runaway RSS is caused by an attempt to use clustering

wrigtim · 2015-10-27T13:57:03Z

Spoke to soon - running a single node suddenly reported the following:

[monitor] 2015/10/27 08:27:10 failed to store statistics: timeout
<followed by>
[write] 2015/10/27 08:34:02 write failed for shard 2 on node 1: engine: WAL backed up flushing to index, hit max memory

I'm attaching the Grafana dash we have for this node. I'm assuming the network traffic is a consequence of the telegraf agents retrying their flushes - we didn't make any agent changes during this time.

otoolep · 2015-10-27T18:30:51Z

How are you sending data into the system? (HTTP, Graphite...) Or is it all via Telegraf? How many series are you generating? What is the sampling rate?

daviesalex · 2015-10-27T20:02:37Z

@otoolep
Entirely via Telegraf. Sample rate 10 seconds, batching and sending every 60 seconds with a jitter of 60. Pretty much entirely stock plugins. ~4k nodes.

otoolep · 2015-10-28T00:50:44Z

OK @daviesalex

Right now I don't have any specific ideas. What you are doing is very close to what we are doing in the sense we run multiple single nodes regularly, and hammer them with line-protocol data (which is what Telegraf emits) for days at a time. RSS remains steady for us. Of course, we don't have the same number of connections that you do -- more like 10-50 concurrent TCP connections, over which line-protocol data travels.

If you can I suggest you add the following line to /etc/default/influxdb (assuming you are running it via a service):

INFLUXD_OPTS="-memprofile /tmp/memprofile"

Check the process table after the service has started to ensure these flags have been passed to influxd.

This will write a memory profile dump to that file, on process exit. The next time you have a node start consuming an excessive amount of memory stop the service and send us this file. We can then examine it for hints on where the memory is getting used.

wrigtim · 2015-10-28T16:05:56Z

@otoolep
So right now the best way I have to reproduce this is by starting influxd with all ~4k telegraf instances already running. The RSS grows rather quickly, but what's odd is that the system doesn't appear to be CPU or IO-bound. Sure, there's a lot of CPU activity, but it's far from pinned, and the SSDs are hardly being touched....

memprofile with 93GB heap (after about 1 minute of uptime): https://gist.github.com/wrigtim/a5dac2b299a50fb9ef08

We're going to back off the flush_interval, but I'll take some packet captures beforehand to look for microbursts. There's around 1000 nodes about 3 switch-hops from the database, and given our timesync is fairly tight, I'm expecting it's fairly burst (at least on the top of any given second).

Thanks again for all your support here! If there's anything else you'd like us to run/profile we're more than happy to.

wrigtim · 2015-10-28T16:25:52Z

Just adding stats for the above:

https://gist.github.com/wrigtim/c150481d33167b0bddd3

otoolep · 2015-10-28T19:42:13Z

@wrigtim -- that file is not what I was actually expecting. I was looking for a binary memory profile dump, which we feed into go pprof. Am I missing something?

otoolep · 2015-10-28T19:43:41Z

Can you graph the Go runtime params on your system? I'd like to see if this is mmap usage or Go heap.

https://influxdb.com/blog/2015/09/22/monitoring_internal_show_stats.html

for examples of the queries to run.

sebito91 · 2015-10-28T20:44:35Z

Attached a graph with HeapAlloc, HeapSys and NumGoRoutines for the past few hours...if you want/need others let me know.

otoolep · 2015-10-28T21:14:26Z

Is there a big difference between Go runtime alloc and the RSS of the process? I am wondering if any difference could be accounted by mmaps.

wrigtim · 2015-10-29T13:01:04Z

@otoolep

wrigtim · 2015-10-29T13:06:46Z

@otoolep So to explain the timings for the above graph - the stable portion was with ~1000 telegraf clients spread across many geographically diverse datacenters. The other 3000 nodes were blocked by iptables rules...

At 11:15 I removed one of the iptables rules, which allowed around 500 additional nodes to connect (all from one DC). At this point you see points_written_ok plummet and the RSS and Heap start to increase what appears to be fairly proportionately. Does this help?

I'm just shutting down the database to get the output from -memprofile, although it looks to be ASCII as before. I'll try loading it in to go pprof before sending anything over.

KiNgMaR · 2015-10-29T13:21:26Z

I think I am experiencing the same issue:

memory usage is constantly growing
more writes accelerate the growth

It looks like the Go runtime is allocating a lot of memory from the system for "HeapIdle":

The spike near the end of this 2-day graph is an increase of points written by around factor 10. Some hours later, memory usage reached the RAM+Swap limits and points started being dropped. Some minutes later (even after I had reduced the number of writes again), the process OOM'd: Out of memory: Kill process 29675 (influxd).

As usually, after being killed that way, InfluxDB refuses to come up again: run: open server: open tsdb store: failed to open shard 1: open engine: unable to parse 'cq_xxx_rows_per_minute_total 1446123660000000000': invalid field format

sebito91 · 2015-10-30T22:02:10Z

Here's an updated version of the dashboard, you can see where the influxdb process blocks incoming writes thought the write_req continues unabated...this is precisely where the shard(s) get overrun.

syepes · 2015-10-30T23:27:28Z

Great info on this issue, It would be nice if we could create a standard Grafana Dashboard template to monitor all the important metrics from the _internal db.

otoolep · 2015-10-31T00:20:49Z

Sounds like a great idea.

Please note we just changed the names of the metrics, so they are more in
line with Go style. Runtime metrics are not affected, since they are
already idiomatic.

Execute SHOW MEASUREMENTS against the internal database for latest names.

On Friday, October 30, 2015, Sebastian YEPES notifications@github.com
wrote:

Great info on this issue, It would be nice if we could create a standard
Grafana Dashboard template to monitor all the important metrics from the
_internal db.

—
Reply to this email directly or view it on GitHub
#4552 (comment).

otoolep · 2015-10-31T00:21:30Z

Do a SELECT * on each measurement to see the new field names.

On Friday, October 30, 2015, Philip O'Toole philip@influxdb.com wrote:

Sounds like a great idea.

Please note we just changed the names of the metrics, so they are more in
line with Go style. Runtime metrics are not affected, since they are
already idiomatic.

Execute SHOW MEASUREMENTS against the internal database for latest names.

On Friday, October 30, 2015, Sebastian YEPES <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Great info on this issue, It would be nice if we could create a standard
Grafana Dashboard template to monitor all the important metrics from the
_internal db.

—
Reply to this email directly or view it on GitHub
#4552 (comment)
.

sebito91 · 2015-11-02T02:20:06Z

@otoolep, @pauldix ...any ideas on the graph we included a couple of days ago?

What's interesting with this development is that the app stayed up (no panic), but writes were blocked for a long period while write_requests continued to grow...so something is still listening for connections but unable to commit fast enough to the wal/index.

otoolep · 2015-11-05T20:05:36Z

tsm1 development is in active development right now @sebito91, as you probably know. The code may undergo significant changes yet, so until it is more stable, so these issues may be solved soon.

If not we'll definitely investigate as all tsm1 issues must be addressed before 0.9.5 ships.

sebito91 · 2015-11-06T15:58:29Z

@otoolep, here is another screenshot using the latest nightly...you can see the RSS grow astronomically and writes keep up for a time, until we finally fluctuate out of control and crap out. Is this at all related to garbage collection, are we holding onto metrics in memory after the flush to index?

sebito91 · 2015-11-06T17:08:55Z

To clarify my point (and maybe point out my golang inadequacy), the inflated RSS may come not from the writes but from the reads and leaving files open after read either inadvertently or for the GC to collect on our behalf. Maybe we need to be explicit and close the files after read to prevent leaking resources?

For example in the tsdb/engine/tsm1/tsm1.go readFields() function we open the file for read given the range we're after, but we never close before returning. Pretty much each of the read functions in tsm1.go are doing the same thing (return without close). At the very least should we add something like defer f.Close() after the os.OpenFile like we do for readSeries()?

1544 func (e *Engine) readFields() (map[string]*tsdb.MeasurementFields, error) {                                                                                                                                                           
1545     fields := make(map[string]*tsdb.MeasurementFields)                                                                                                                                                                                
1546                                                                                                                                                                                                                                       
1547     f, err := os.OpenFile(filepath.Join(e.path, FieldsFileExtension), os.O_RDONLY, 0666)                                                                                                                                              
1548     if os.IsNotExist(err) {                                                                                                                                                                                                           
1549         return fields, nil                                                                                                                                                                                                            
1550     } else if err != nil {                                                                                                                                                                                                            
1551         return nil, err                                                                                                                                                                                                               
1552     }                                                                                                                                                                                                                                 
1553     b, err := ioutil.ReadAll(f)                                                                                                                                                                                                       
1554     if err != nil {                                                                                                                                                                                                                   
1555         return nil, err                                                                                                                                                                                                               
1556     }                                                                                                                                                                                                                                 
1557                                                                                                                                                                                                                                       
1558     data, err := snappy.Decode(nil, b)                                                                                                                                                                                                
1559     if err != nil {                                                                                                                                                                                                                   
1560         return nil, err                                                                                                                                                                                                               
1561     }                                                                                                                                                                                                                                 
1562                                                                                                                                                                                                                                       
1563     if err := json.Unmarshal(data, &fields); err != nil {                                                                                                                                                                             
1564         return nil, err                                                                                                                                                                                                               
1565     }                                                                                                                                                                                                                                 
1566                                                                                                                                                                                                                                       
1567     return fields, nil                                                                                                                                                                                                                
1568 }

otoolep · 2015-11-06T22:27:48Z

Thanks @sebito91 -- could be something there. We'll need to check.

Any comment @jwilder ?

wrigtim · 2015-11-13T15:52:33Z

@otoolep For info - we've upgraded the nightly from 6ecb62e to 664b20e (Go 1.4.2 + a few other bits), I don't want to get too excited, but so far we're seeing:

120M+ points ingested from 800 clients with 0 write failures (around 20K points per sec)
RSS still creeping up, but leveling off around 9GB
CPU utilization has plummeted
15-min load average has dropped from 10+ to 0.3

...but it's only been up 2 hours so far. Will leave it over the weekend and report back (with hopefully good news).

jwilder · 2015-12-09T17:24:05Z

Should be fixed in 0.9.6 release.

beckettsean added the area/tsm label Oct 23, 2015

sebito91 mentioned this issue Oct 30, 2015

0.9.5-nightly-9633410 Telegraf isnt able to write #4562

Closed

daviesalex mentioned this issue Nov 4, 2015

WAL backed up flushing to index, hit max memory #4609

Closed

jwilder added the panic label Nov 5, 2015

jwilder closed this as completed Dec 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

wrigtim commented Oct 23, 2015

daviesalex commented Oct 23, 2015

daviesalex commented Oct 23, 2015

otoolep commented Oct 24, 2015

wrigtim commented Oct 24, 2015

daviesalex commented Oct 26, 2015

daviesalex commented Oct 27, 2015

wrigtim commented Oct 27, 2015

otoolep commented Oct 27, 2015

daviesalex commented Oct 27, 2015

otoolep commented Oct 28, 2015

wrigtim commented Oct 28, 2015

wrigtim commented Oct 28, 2015

otoolep commented Oct 28, 2015

otoolep commented Oct 28, 2015

sebito91 commented Oct 28, 2015

otoolep commented Oct 28, 2015

wrigtim commented Oct 29, 2015

wrigtim commented Oct 29, 2015

KiNgMaR commented Oct 29, 2015

sebito91 commented Oct 30, 2015

syepes commented Oct 30, 2015

otoolep commented Oct 31, 2015

otoolep commented Oct 31, 2015

sebito91 commented Nov 2, 2015

otoolep commented Nov 5, 2015

sebito91 commented Nov 6, 2015

sebito91 commented Nov 6, 2015

otoolep commented Nov 6, 2015

wrigtim commented Nov 13, 2015

jwilder commented Dec 9, 2015

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

Comments

wrigtim commented Oct 23, 2015

daviesalex commented Oct 23, 2015

daviesalex commented Oct 23, 2015

otoolep commented Oct 24, 2015

wrigtim commented Oct 24, 2015

daviesalex commented Oct 26, 2015

daviesalex commented Oct 27, 2015

wrigtim commented Oct 27, 2015

otoolep commented Oct 27, 2015

daviesalex commented Oct 27, 2015

otoolep commented Oct 28, 2015

wrigtim commented Oct 28, 2015

wrigtim commented Oct 28, 2015

otoolep commented Oct 28, 2015

otoolep commented Oct 28, 2015

sebito91 commented Oct 28, 2015

otoolep commented Oct 28, 2015

wrigtim commented Oct 29, 2015

wrigtim commented Oct 29, 2015

KiNgMaR commented Oct 29, 2015

sebito91 commented Oct 30, 2015

syepes commented Oct 30, 2015

otoolep commented Oct 31, 2015

otoolep commented Oct 31, 2015

sebito91 commented Nov 2, 2015

otoolep commented Nov 5, 2015

sebito91 commented Nov 6, 2015

sebito91 commented Nov 6, 2015

otoolep commented Nov 6, 2015

wrigtim commented Nov 13, 2015

jwilder commented Dec 9, 2015