influxdb output leaks connections #1058

page31 · 2016-04-19T03:27:44Z

I putted the wrong retention policy in the config file yesterday and every write failed since that. I found there were about 5k tcp connections between telegraf and influxdb(port 8086) this morning.
The issue may be caused by Query function in influxdb/client/v2/client.go which uses JSON decoder to read the response body. The decoder just reads the first value and if there is still some data in the response stream the connection won't be closed or reused.

sparrc · 2016-04-19T03:42:39Z

thanks for the report, I don't think there's anything wrong with the client, it's this that's causing the bug: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/influxdb.go#L198

sparrc · 2016-04-19T04:54:45Z

Although it's not 100% clear to me why your OS is keeping alive so many dead TCP connections, can you give me the output of these commands?

cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes

page31 · 2016-04-19T06:41:19Z

Of course, you should close the client before clear the array. But Close method of http client is just empty.

server running influxdb

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
600
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
15
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
5

server running telegraf:

$cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

sparrc · 2016-04-19T15:37:29Z

you're right, I don't think changing the close method will actually do anything to help this issue.

But I'm still not quite sure I follow the logic on reading the response object. When querying influx there is only a single json object that gets returned (it begins with {), and then there is an array internal to that, which is part of the object that it gets decoded to. I don't really see how there would still be data left in the stream (we also close the response body before exiting the function).

sparrc · 2016-04-19T16:20:09Z

From what I can tell, the problem is that Telegraf creates a new http client by calling Connect() every time there is a write failure. The idea here was that if the database goes away, Telegraf could recreate on restart. (see #836)

The problem in this case is that influxdb never goes away, and thus all tcp connections are still valid, and thus creating new http clients just results in more and more tcp connections.

So by "fixing" #836 I actually unintentionally created a situation where this could happen.~~I don't think there is a way that I can support #836 without the possibility of leaking tcp connections, so I'm going to revert that change.~~

EDIT: I am going to change the implementation so that it checks the error message for "database not found", and if that's there then it tries to recreate. It will no longer create new http clients.

cc @jchauncey

closes #1058

page31 · 2016-04-19T17:03:22Z

@sparrc After digging a lot I just realized that too. Implementing HttpClient's Close method and call CloseIdleConnections of Transport should solve this problem. I'm sorry that I just guessed the reason this morning and that may be misleading.

closes #1058

If users properly call client.Close(), then this will make sure that established tcp connections dont continually grow when creating new http clients. This fixes the case where users are creating new http clients on top of existing _valid_ connections. This was encountered in Telegraf when we were recreating our http clients after getting write failures that were unrelated to the actual connection being severed (such as typos in the retention policy, see influxdata/telegraf#1058)

sparrc · 2016-04-19T19:28:55Z

no problem, thanks for the report!, I've submitted a PR to influxdb to fix that, and a separate PR to telegraf. I think it's probably best to leave the http clients open and only recreate the db if that is the specific error encountered.

closes #1058

closes #1058 closes #1059 also see influxdata/influxdb#6425

sparrc added a commit that referenced this issue Apr 19, 2016

influxdb output Close connections before leaking them

e74a1ea

closes #1058

page31 closed this as completed Apr 19, 2016

page31 reopened this Apr 19, 2016

sparrc added a commit that referenced this issue Apr 19, 2016

influxdb output Close connections before leaking them

d3f05c8

closes #1058

sparrc mentioned this issue Apr 19, 2016

Prevent InfluxDB client from leaking TCP connections influxdata/influxdb#6425

Merged

2 tasks

sparrc added a commit that referenced this issue Apr 19, 2016

influxdb output Close connections before leaking them

4d3719e

closes #1058

sparrc added a commit that referenced this issue Apr 19, 2016

influxdb output Close connections before leaking them

4d5dca6

closes #1058

sparrc added a commit that referenced this issue Apr 19, 2016

influxdb output: close connections & dont always overwrite

8cc7236

closes #1058 closes #1059 also see influxdata/influxdb#6425

sparrc mentioned this issue Apr 19, 2016

influxdb output: close connections & dont always overwrite #1061

Merged

sparrc closed this as completed in #1061 Apr 19, 2016

sparrc mentioned this issue May 18, 2016

[0.12.1-1] too many opened files #1188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

influxdb output leaks connections #1058

influxdb output leaks connections #1058

page31 commented Apr 19, 2016 •

edited

Loading

sparrc commented Apr 19, 2016

sparrc commented Apr 19, 2016

page31 commented Apr 19, 2016

sparrc commented Apr 19, 2016

sparrc commented Apr 19, 2016 •

edited

Loading

page31 commented Apr 19, 2016

sparrc commented Apr 19, 2016

influxdb output leaks connections #1058

influxdb output leaks connections #1058

Comments

page31 commented Apr 19, 2016 • edited Loading

sparrc commented Apr 19, 2016

sparrc commented Apr 19, 2016

page31 commented Apr 19, 2016

sparrc commented Apr 19, 2016

sparrc commented Apr 19, 2016 • edited Loading

page31 commented Apr 19, 2016

sparrc commented Apr 19, 2016

page31 commented Apr 19, 2016 •

edited

Loading

sparrc commented Apr 19, 2016 •

edited

Loading