Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxdb output leaks connections #1058

Closed
page31 opened this issue Apr 19, 2016 · 7 comments · Fixed by #1061
Closed

influxdb output leaks connections #1058

page31 opened this issue Apr 19, 2016 · 7 comments · Fixed by #1061

Comments

@page31
Copy link

page31 commented Apr 19, 2016

I putted the wrong retention policy in the config file yesterday and every write failed since that. I found there were about 5k tcp connections between telegraf and influxdb(port 8086) this morning.
The issue may be caused by Query function in influxdb/client/v2/client.go which uses JSON decoder to read the response body. The decoder just reads the first value and if there is still some data in the response stream the connection won't be closed or reused.

@sparrc
Copy link
Contributor

sparrc commented Apr 19, 2016

thanks for the report, I don't think there's anything wrong with the client, it's this that's causing the bug: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/influxdb.go#L198

@sparrc
Copy link
Contributor

sparrc commented Apr 19, 2016

Although it's not 100% clear to me why your OS is keeping alive so many dead TCP connections, can you give me the output of these commands?

cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes

@page31
Copy link
Author

page31 commented Apr 19, 2016

Of course, you should close the client before clear the array. But Close method of http client is just empty.

server running influxdb

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
600
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
15
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
5

server running telegraf:

$cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

@sparrc
Copy link
Contributor

sparrc commented Apr 19, 2016

you're right, I don't think changing the close method will actually do anything to help this issue.

But I'm still not quite sure I follow the logic on reading the response object. When querying influx there is only a single json object that gets returned (it begins with {), and then there is an array internal to that, which is part of the object that it gets decoded to. I don't really see how there would still be data left in the stream (we also close the response body before exiting the function).

@sparrc
Copy link
Contributor

sparrc commented Apr 19, 2016

From what I can tell, the problem is that Telegraf creates a new http client by calling Connect() every time there is a write failure. The idea here was that if the database goes away, Telegraf could recreate on restart. (see #836)

The problem in this case is that influxdb never goes away, and thus all tcp connections are still valid, and thus creating new http clients just results in more and more tcp connections.

So by "fixing" #836 I actually unintentionally created a situation where this could happen.I don't think there is a way that I can support #836 without the possibility of leaking tcp connections, so I'm going to revert that change.

EDIT: I am going to change the implementation so that it checks the error message for "database not found", and if that's there then it tries to recreate. It will no longer create new http clients.

cc @jchauncey

@page31
Copy link
Author

page31 commented Apr 19, 2016

@sparrc After digging a lot I just realized that too. Implementing HttpClient's Close method and call CloseIdleConnections of Transport should solve this problem. I'm sorry that I just guessed the reason this morning and that may be misleading.

@page31 page31 closed this as completed Apr 19, 2016
@page31 page31 reopened this Apr 19, 2016
sparrc added a commit to influxdata/influxdb that referenced this issue Apr 19, 2016
If users properly call client.Close(), then this will make sure that
established tcp connections dont continually grow when creating new http
clients.

This fixes the case where users are creating new http clients on top of
existing _valid_ connections.

This was encountered in Telegraf when we were recreating our http
clients after getting write failures that were unrelated to the actual
connection being severed (such as typos in the retention policy, see
influxdata/telegraf#1058)
@sparrc
Copy link
Contributor

sparrc commented Apr 19, 2016

no problem, thanks for the report!, I've submitted a PR to influxdb to fix that, and a separate PR to telegraf. I think it's probably best to leave the http clients open and only recreate the db if that is the specific error encountered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants