InfluxDB memory leakage #605

mwielgus · 2015-09-24T16:44:58Z

This issue is to gather information about the memory leakage in InfluxDB and check if there are some immediate steps that can be taken to reduce its size.

@vishh Can you provide more info on how big the leak is?

mwielgus · 2015-09-24T16:57:56Z

cc: @piosz @fgrzadkowski @jszczepkowski

vishh · 2015-09-24T21:11:17Z

The recommendation is 8GB memory limit for InfluxDB. I usually see InfluxDB
consuming all the CPU and memory allocated to it. When left un-limited, it
frequently used >4 cores and more than 3GB of memory in small sized
clusters.

I think the underlying problem is not a leak. If we want good performance
out of InfluxDB, we need to allocate enough resources. It might be worth
running some of the benchmarks published by InfluxDB, to come up with a
resource usage profile.

One more thing to consider is the amount of data being written to InfluxDB.
Once we cross 20k points, we will have to run InfluxDB clusters.
I feel it is beyond the scope of this project, to help setup and manage
InfluxDB clusters for various deployment scenarios.

On Thu, Sep 24, 2015 at 9:57 AM, Marcin Wielgus notifications@github.com
wrote:

cc: @piosz https://github.com/piosz @fgrzadkowski
https://github.com/fgrzadkowski @jszczepkowski
https://github.com/jszczepkowski

—
Reply to this email directly or view it on GitHub
#605 (comment)
.

mwielgus · 2015-09-24T23:50:21Z

Personally I would double check whether we are not writing too much data to influxdb or doing it in the suboptimal way. From a brief & tired look at how the influxdb sink is implemented I would suspect that we are are writing samples with the default 5 sec resolution. What about changing it to 30 sec (not perfect but better than switching off infuxdb)?

Other idea is to play with how batches are constructed and sink frequency - if the underlying storage is key-value like it might be beneficial to write data less often than every 10s so that more data is written under a single key in one batch and the key lookup/disk seeks happen less often.

In influxdb 0.8 there seem to be pluggable backends which deeply differ in terms of write performance https://influxdb.com/blog/2014/06/20/leveldb_vs_rocksdb_vs_hyperleveldb_vs_lmdb_performance.html
with HyperLevelDB appearing to be the best choice for write-intensive workloads. Did we try it?

vishh · 2015-09-25T00:20:34Z

We can increase the sink duration. It is currently set to 10 seconds by
default.
A thorough analysis of batch write performance is pending though.
With InfluxDB v0.9, we have switched to using tags, which is expected to
significantly reduce storage and lookup times. In prior versions, we used
to use columns, which resulted in multiple lookups within the DB in the
read path.

InfluxDB v0.8 is now being deprecated. So I don't see value in
investigating storage backend performance.

A meta comment that I want to re-state is that, I don't see value in making
InfluxDB work in a scale scenario for this project. InfluxDB offers a
hosted product, and it is pretty simple to use that instead.

As of now, we do not endorse InfluxDB as the recommended backend. Are you
suggesting we should instead endorse it?

On Thu, Sep 24, 2015 at 4:50 PM, Marcin Wielgus notifications@github.com
wrote:

Personally I would double check whether we are not writing too much data
to it or in the suboptimal way. From a brief & tired look at how the
influxdb sink is implemented I would suspect that we are are writing
samples with the default 5 sec resolution. What about changing it to 30 sec
(not perfect but better than switching of infuxdb)?

Other idea is to play with how batches are constructed and sink frequency

if the underlying storage is key-value like it might be beneficial to
write data less often than every 10s so that more data is written under a
single key in one batch and the key lookup/disk seek happen less often.

In influxdb 0.8 there seem to be pluggable backends which deeply differ in
terms of write performance
https://influxdb.com/blog/2014/06/20/leveldb_vs_rocksdb_vs_hyperleveldb_vs_lmdb_performance.html
with HyperLevelDB appearing to be the best choice for write-intensive
workloads. Did we try it?

—
Reply to this email directly or view it on GitHub
#605 (comment)
.

mwielgus · 2015-09-25T08:33:33Z

I personally don't care whether we use Influxdb or some other time series database like GCM. I just want to have some kind of permanent storage (cloud-provider permitting) for metrics. And I'm convinced that we should not write another one as there is lots of other, more important challenges in K8s. If influxdb is not the best choice for big customers - lets change it in 1.2 and for now provide an extra flag in kube-up to run a heapster without any storage.

And if there are performance issues with Influxdb 0.8, and we cannot upgrade to 0.9 (for K8s 1.1), lets try some simple workarounds first (increasing resolution to 30 sec will decrease the load 6x) so it is usable for 80-90% of customers with small/moderate clusters before dropping it completely.

mwielgus · 2015-09-25T11:59:47Z

After a discussion with @piosz we agreed that 30 sec resolution would be even more handy for him. We also took a look at the metrics that are exported - there are 16 but InitialResources need only 6 of them (but network may also be handy so lets count 8). So if we added a flag to control what metrics are propagated to influxdb we could reduce the load even more.

Assuming the K8S 1.0 scalability target of 100 nodes with 30 pods each we have:

(100 nodes * 30 pods/node * 8 metrics/pod) / 30 sec resolution = 800 data points per second.

vishh · 2015-09-25T17:00:04Z

@mwielgus: We store all 16 of those metrics today in InfluxDB, and possibly more soon (disk io, load stats, tcp stats, etc).
So the actual points per second is (100 nodes * 30 pods/node * 2 containers per pod * 16 metrics per container) = ~96000 points per batch write or ~3200 points per second. This is not taking into account the system containers and the node itself.

In any case, I think we are on the same page when it comes to the level of support for InfluxDB. Ensuring that the default setup doesn't overwhelm InfluxDB makes total sense 👍

@piosz: Thanks for posting the PR!

piosz · 2016-09-03T05:35:35Z

closing in favor of kubernetes/kubernetes#27630

piosz mentioned this issue Sep 25, 2015

Bumbed Heapster to v0.18.2 and changed its config kubernetes/kubernetes#14559

Merged

piosz closed this as completed Sep 3, 2016

trinitronx mentioned this issue Mar 4, 2017

Support High Cardinality Tags and Series influxdata/influxdb#7151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfluxDB memory leakage #605

InfluxDB memory leakage #605

mwielgus commented Sep 24, 2015

mwielgus commented Sep 24, 2015

vishh commented Sep 24, 2015

mwielgus commented Sep 24, 2015

vishh commented Sep 25, 2015

mwielgus commented Sep 25, 2015

mwielgus commented Sep 25, 2015

vishh commented Sep 25, 2015

piosz commented Sep 3, 2016

InfluxDB memory leakage #605

InfluxDB memory leakage #605

Comments

mwielgus commented Sep 24, 2015

mwielgus commented Sep 24, 2015

vishh commented Sep 24, 2015

mwielgus commented Sep 24, 2015

vishh commented Sep 25, 2015

mwielgus commented Sep 25, 2015

mwielgus commented Sep 25, 2015

vishh commented Sep 25, 2015

piosz commented Sep 3, 2016