-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix InfluxDB using too much memory #1113
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1113 +/- ##
==========================================
+ Coverage 73.39% 73.65% +0.26%
==========================================
Files 142 142
Lines 10323 10353 +30
==========================================
+ Hits 7577 7626 +49
+ Misses 2303 2281 -22
- Partials 443 446 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though please add something in the PR description we can use in the release notes later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a small typo fix missing.
Insecure null.Bool `json:"insecure,omitempty" envconfig:"INFLUXDB_INSECURE"` | ||
PayloadSize null.Int `json:"payloadSize,omitempty" envconfig:"INFLUXDB_PAYLOAD_SIZE"` | ||
PushInterval types.NullDuration `json:"pushInterval,omitempty" envconfig:"INFLUXDB_PUSH_INTERVAL"` | ||
ConcurrentWrites null.Int `json:"concurrentWrites,omitempty" envconfig:"INFLUXDB_CONCURRENT_WRITES"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unrelated to this PR, but do we know if omitempty
is correctly handled by guregu/null
?
Their README mentions it as a bug and there's an open PR that apparently fixes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️ yeah... this is problematic, though we probably have bigger issues than the influxdb config 😞 I think this might be the result of the same bug: #883 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it ...
maybe after #1070 we should fork the gurugu, merge that PR and use that through the modules ? cause that library looks pretty dead to me ... we can of course first ping the author on whether he will at least review the PR in question
d5a653a
to
cf60f03
Compare
Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182
This also upgrades it to the latest version and adds a benchmark which shows some less allocation: name old time/op new time/op delta Influxdb1Second-4 3.56ms ± 0% 3.56ms ± 1% ~ (p=0.841 n=5+5) Influxdb2Second-4 12.1ms ± 0% 12.1ms ± 0% ~ (p=0.095 n=5+5) Influxdb100Milliseconds-4 341µs ± 1% 333µs ± 1% -2.34% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Influxdb1Second-4 19.1kB ± 0% 16.6kB ± 0% -13.37% (p=0.008 n=5+5) Influxdb2Second-4 18.8kB ± 0% 16.2kB ± 0% -13.63% (p=0.016 n=5+4) Influxdb100Milliseconds-4 18.7kB ± 0% 16.1kB ± 0% -13.72% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Influxdb1Second-4 291 ± 0% 231 ± 0% -20.62% (p=0.008 n=5+5) Influxdb2Second-4 291 ± 0% 231 ± 0% -20.62% (p=0.008 n=5+5) Influxdb100Milliseconds-4 291 ± 0% 231 ± 0% -20.62% (p=0.008 n=5+5) The same test as in fce3884 gets 6909 RPS with 2.5GB of memory usage so a considerate bump in rps for small amount of memory.
cf60f03
to
04e1379
Compare
Text for release notes: