Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory #4552

Closed
wrigtim opened this issue Oct 23, 2015 · 30 comments

Comments

@wrigtim
Copy link

wrigtim commented Oct 23, 2015

We upgraded to 0.9.5-nightly-827c513 today to take the fix for #4388. It was running fairly well for around 40 minutes (no errors in logs), until the node we are writing to in the 3-node cluster started reporting a large number of the below write failures. The other two nodes reported the same errors shortly afterwards.

Directory listing for telegraf/default/4/: https://gist.github.com/wrigtim/d0c086d589539dda0ace

Currently running TSM/WAL on PCIe SSDs to hopefully reduce IO performance impact...

[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[handoff] 2015/10/23 11:07:26 remote write failed: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
[write] 2015/10/23 11:07:26 write failed for shard 4 on node 2: error code 1: write shard 4: engine: WAL backed up flushing to index, hit max memory
@daviesalex
Copy link
Contributor

System load is stable at about 30. Box has 36 cores (HT is off; we found little benefit to it being on).

Traffic is coming in in a very bursty way (up to 600Mbit/sec) - see influxdata/telegraf#298 (comment). The peak IOPs are very high (25k), but most of the time its much lower.

We are running these settings:

  wal-flush-memory-size-threshold = 26214400 # Default 5MB; ours 25MB
  wal-max-memory-size-threshold = 2147483648 # Default 100MB; ours 2GB

Our entire data dir (inc. WAL) is on high performance PCIe flash. There is no wait time to speak of, and i've never seen the devices more than 40% utilized.

In the code, it says

// addToCache will add the points, measurements, and fields to the cache and return true if successful. They will be queryable
// immediately after return and will be flushed at the next flush cycle. Before adding to the cache we check if we're over the
// max memory threshold. If we are we request a flush in a new goroutine and return false, indicating we didn't add the values
// to the cache and that writes should return a failure.
func (l *Log) addToCache(points []models.Point, fields map[string]*tsdb.MeasurementFields, series []*tsdb.SeriesCreate, checkMemory bool) bool {

...

                if l.memorySize > l.MaxMemorySizeThreshold {
                        return false
                }

My guess is that we need to increase wal-max-memory-size-threshold (which is l.MaxMemorySizeThreshold I believe) even above 2G.

The concerning thing is that we have no iowait at the time that this is happening, which suggests the l.flush method isnt doing what I would expect (which would be saturate the WAL device IO). The fact we actually didnt see that much IO makes me wonder if the goroutines that are supposed to be doing the flushing are somehow blocking. Suggestions for how to profile that memory leak very welcome (is there a way of drawing a SVG of memory by some useful separator to figure out the cause?)

Also concerning is that we appear to have a memory leak in the writer node (see RSS graph i'll shortly attach) - but not the other cluster nodes that were accepting the same writes.

Finally, we had a panic after about an hour running. I filed #4554 for that.

@daviesalex
Copy link
Contributor

RSS graph - all writes went to the blue node, other 2 nodes are same cluster. Interestingly, other stats look similar to the other cluster nodes.

I guess its possible there was backpressure in the sends to the other cluster nodes, under these relatively high loads? Not sure how to diagnose that best.

image

@otoolep
Copy link
Contributor

otoolep commented Oct 24, 2015

We explicitly monitor RSS usage of our systems under test, though we're mostly testing single-nodes right now. We have not seen any indication that we are leaking memory.

https://influxdb.com/blog/2015/10/20/testing_tsm.html

Are you running queries against the same node?

@wrigtim
Copy link
Author

wrigtim commented Oct 24, 2015

Thanks for the quick response. We're currently writing to testmetricsl7, and reading from testmetricsl5 (Grafana). Is there any useful way to dump memory usage in a similar fashion to pprof? Apologies for my lack of Go profiling experience...

@daviesalex
Copy link
Contributor

@otoolep , are you doing any of that at scale testing with Clustering?

I think what we are doing differently here is trying to make a simple cluster (number of replicas = number of nodes = 3). We know its beta, so we dont expect there to be no edge cases, but if you guys are almost entirely testing single nodes, that probably explains why we keep having problems!

@daviesalex
Copy link
Contributor

FWIW, when we move to using a single node, we also see a stable RSS for influxd.

It therefore appears that the runaway RSS is caused by an attempt to use clustering

@wrigtim
Copy link
Author

wrigtim commented Oct 27, 2015

Spoke to soon - running a single node suddenly reported the following:

[monitor] 2015/10/27 08:27:10 failed to store statistics: timeout
<followed by>
[write] 2015/10/27 08:34:02 write failed for shard 2 on node 1: engine: WAL backed up flushing to index, hit max memory

I'm attaching the Grafana dash we have for this node. I'm assuming the network traffic is a consequence of the telegraf agents retrying their flushes - we didn't make any agent changes during this time.

2015-10-27 13_37_05-grafana - metrics backend - production

@otoolep
Copy link
Contributor

otoolep commented Oct 27, 2015

How are you sending data into the system? (HTTP, Graphite...) Or is it all via Telegraf? How many series are you generating? What is the sampling rate?

@daviesalex
Copy link
Contributor

@otoolep
Entirely via Telegraf. Sample rate 10 seconds, batching and sending every 60 seconds with a jitter of 60. Pretty much entirely stock plugins. ~4k nodes.

@otoolep
Copy link
Contributor

otoolep commented Oct 28, 2015

OK @daviesalex

Right now I don't have any specific ideas. What you are doing is very close to what we are doing in the sense we run multiple single nodes regularly, and hammer them with line-protocol data (which is what Telegraf emits) for days at a time. RSS remains steady for us. Of course, we don't have the same number of connections that you do -- more like 10-50 concurrent TCP connections, over which line-protocol data travels.

If you can I suggest you add the following line to /etc/default/influxdb (assuming you are running it via a service):

INFLUXD_OPTS="-memprofile /tmp/memprofile"

Check the process table after the service has started to ensure these flags have been passed to influxd.

This will write a memory profile dump to that file, on process exit. The next time you have a node start consuming an excessive amount of memory stop the service and send us this file. We can then examine it for hints on where the memory is getting used.

@wrigtim
Copy link
Author

wrigtim commented Oct 28, 2015

@otoolep
So right now the best way I have to reproduce this is by starting influxd with all ~4k telegraf instances already running. The RSS grows rather quickly, but what's odd is that the system doesn't appear to be CPU or IO-bound. Sure, there's a lot of CPU activity, but it's far from pinned, and the SSDs are hardly being touched....

memprofile with 93GB heap (after about 1 minute of uptime): https://gist.github.com/wrigtim/a5dac2b299a50fb9ef08

We're going to back off the flush_interval, but I'll take some packet captures beforehand to look for microbursts. There's around 1000 nodes about 3 switch-hops from the database, and given our timesync is fairly tight, I'm expecting it's fairly burst (at least on the top of any given second).

Thanks again for all your support here! If there's anything else you'd like us to run/profile we're more than happy to.

@wrigtim
Copy link
Author

wrigtim commented Oct 28, 2015

Just adding stats for the above:

https://gist.github.com/wrigtim/c150481d33167b0bddd3

@otoolep
Copy link
Contributor

otoolep commented Oct 28, 2015

@wrigtim -- that file is not what I was actually expecting. I was looking for a binary memory profile dump, which we feed into go pprof. Am I missing something?

@otoolep
Copy link
Contributor

otoolep commented Oct 28, 2015

Can you graph the Go runtime params on your system? I'd like to see if this is mmap usage or Go heap.

https://influxdb.com/blog/2015/09/22/monitoring_internal_show_stats.html

for examples of the queries to run.

@sebito91
Copy link
Contributor

Attached a graph with HeapAlloc, HeapSys and NumGoRoutines for the past few hours...if you want/need others let me know.

debug

@otoolep
Copy link
Contributor

otoolep commented Oct 28, 2015

Is there a big difference between Go runtime alloc and the RSS of the process? I am wondering if any difference could be accounted by mmaps.

@wrigtim
Copy link
Author

wrigtim commented Oct 29, 2015

@otoolep
2015-10-29 11_54_30-grafana - debug

@wrigtim
Copy link
Author

wrigtim commented Oct 29, 2015

@otoolep So to explain the timings for the above graph - the stable portion was with ~1000 telegraf clients spread across many geographically diverse datacenters. The other 3000 nodes were blocked by iptables rules...

At 11:15 I removed one of the iptables rules, which allowed around 500 additional nodes to connect (all from one DC). At this point you see points_written_ok plummet and the RSS and Heap start to increase what appears to be fairly proportionately. Does this help?

I'm just shutting down the database to get the output from -memprofile, although it looks to be ASCII as before. I'll try loading it in to go pprof before sending anything over.

@KiNgMaR
Copy link

KiNgMaR commented Oct 29, 2015

I think I am experiencing the same issue:

  • memory usage is constantly growing
  • more writes accelerate the growth

It looks like the Go runtime is allocating a lot of memory from the system for "HeapIdle":

snap 2015-10-29 at 09 04 37

The spike near the end of this 2-day graph is an increase of points written by around factor 10. Some hours later, memory usage reached the RAM+Swap limits and points started being dropped. Some minutes later (even after I had reduced the number of writes again), the process OOM'd: Out of memory: Kill process 29675 (influxd).

As usually, after being killed that way, InfluxDB refuses to come up again: run: open server: open tsdb store: failed to open shard 1: open engine: unable to parse 'cq_xxx_rows_per_minute_total 1446123660000000000': invalid field format

@sebito91
Copy link
Contributor

Here's an updated version of the dashboard, you can see where the influxdb process blocks incoming writes thought the write_req continues unabated...this is precisely where the shard(s) get overrun.

problems

@syepes
Copy link

syepes commented Oct 30, 2015

Great info on this issue, It would be nice if we could create a standard Grafana Dashboard template to monitor all the important metrics from the _internal db.

@otoolep
Copy link
Contributor

otoolep commented Oct 31, 2015

Sounds like a great idea.

Please note we just changed the names of the metrics, so they are more in
line with Go style. Runtime metrics are not affected, since they are
already idiomatic.

Execute SHOW MEASUREMENTS against the internal database for latest names.

On Friday, October 30, 2015, Sebastian YEPES notifications@github.com
wrote:

Great info on this issue, It would be nice if we could create a standard
Grafana Dashboard template to monitor all the important metrics from the
_internal db.


Reply to this email directly or view it on GitHub
#4552 (comment).

@otoolep
Copy link
Contributor

otoolep commented Oct 31, 2015

Do a SELECT * on each measurement to see the new field names.

On Friday, October 30, 2015, Philip O'Toole philip@influxdb.com wrote:

Sounds like a great idea.

Please note we just changed the names of the metrics, so they are more in
line with Go style. Runtime metrics are not affected, since they are
already idiomatic.

Execute SHOW MEASUREMENTS against the internal database for latest names.

On Friday, October 30, 2015, Sebastian YEPES <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Great info on this issue, It would be nice if we could create a standard
Grafana Dashboard template to monitor all the important metrics from the
_internal db.


Reply to this email directly or view it on GitHub
#4552 (comment)
.

@sebito91
Copy link
Contributor

sebito91 commented Nov 2, 2015

@otoolep, @pauldix ...any ideas on the graph we included a couple of days ago?

What's interesting with this development is that the app stayed up (no panic), but writes were blocked for a long period while write_requests continued to grow...so something is still listening for connections but unable to commit fast enough to the wal/index.

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

tsm1 development is in active development right now @sebito91, as you probably know. The code may undergo significant changes yet, so until it is more stable, so these issues may be solved soon.

If not we'll definitely investigate as all tsm1 issues must be addressed before 0.9.5 ships.

@sebito91
Copy link
Contributor

sebito91 commented Nov 6, 2015

@otoolep, here is another screenshot using the latest nightly...you can see the RSS grow astronomically and writes keep up for a time, until we finally fluctuate out of control and crap out. Is this at all related to garbage collection, are we holding onto metrics in memory after the flush to index?

rss

@sebito91
Copy link
Contributor

sebito91 commented Nov 6, 2015

To clarify my point (and maybe point out my golang inadequacy), the inflated RSS may come not from the writes but from the reads and leaving files open after read either inadvertently or for the GC to collect on our behalf. Maybe we need to be explicit and close the files after read to prevent leaking resources?

For example in the tsdb/engine/tsm1/tsm1.go readFields() function we open the file for read given the range we're after, but we never close before returning. Pretty much each of the read functions in tsm1.go are doing the same thing (return without close). At the very least should we add something like defer f.Close() after the os.OpenFile like we do for readSeries()?

1544 func (e *Engine) readFields() (map[string]*tsdb.MeasurementFields, error) {                                                                                                                                                           
1545     fields := make(map[string]*tsdb.MeasurementFields)                                                                                                                                                                                
1546                                                                                                                                                                                                                                       
1547     f, err := os.OpenFile(filepath.Join(e.path, FieldsFileExtension), os.O_RDONLY, 0666)                                                                                                                                              
1548     if os.IsNotExist(err) {                                                                                                                                                                                                           
1549         return fields, nil                                                                                                                                                                                                            
1550     } else if err != nil {                                                                                                                                                                                                            
1551         return nil, err                                                                                                                                                                                                               
1552     }                                                                                                                                                                                                                                 
1553     b, err := ioutil.ReadAll(f)                                                                                                                                                                                                       
1554     if err != nil {                                                                                                                                                                                                                   
1555         return nil, err                                                                                                                                                                                                               
1556     }                                                                                                                                                                                                                                 
1557                                                                                                                                                                                                                                       
1558     data, err := snappy.Decode(nil, b)                                                                                                                                                                                                
1559     if err != nil {                                                                                                                                                                                                                   
1560         return nil, err                                                                                                                                                                                                               
1561     }                                                                                                                                                                                                                                 
1562                                                                                                                                                                                                                                       
1563     if err := json.Unmarshal(data, &fields); err != nil {                                                                                                                                                                             
1564         return nil, err                                                                                                                                                                                                               
1565     }                                                                                                                                                                                                                                 
1566                                                                                                                                                                                                                                       
1567     return fields, nil                                                                                                                                                                                                                
1568 }   

@otoolep
Copy link
Contributor

otoolep commented Nov 6, 2015

Thanks @sebito91 -- could be something there. We'll need to check.

Any comment @jwilder ?

@wrigtim
Copy link
Author

wrigtim commented Nov 13, 2015

@otoolep For info - we've upgraded the nightly from 6ecb62e to 664b20e (Go 1.4.2 + a few other bits), I don't want to get too excited, but so far we're seeing:

  • 120M+ points ingested from 800 clients with 0 write failures (around 20K points per sec)
  • RSS still creeping up, but leveling off around 9GB
  • CPU utilization has plummeted
  • 15-min load average has dropped from 10+ to 0.3

...but it's only been up 2 hours so far. Will leave it over the weekend and report back (with hopefully good news).

@jwilder
Copy link
Contributor

jwilder commented Dec 9, 2015

Should be fixed in 0.9.6 release.

@jwilder jwilder closed this as completed Dec 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants