Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary debug/error logged during idle connection teardown #40824

Closed
strawgate opened this issue Sep 13, 2024 · 12 comments · Fixed by elastic/elastic-agent-libs#241
Closed
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@strawgate
Copy link
Contributor

strawgate commented Sep 13, 2024

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Here: https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

When the connection is torn down, this receives an errNetClosing instead of an EOF (likely because we are the ones closing the connection and not the server. This code path only checks for EOF and thus we get the use of closed network connection err debug logged during the read.

Image

This error message “pollutes” our debug logs and leads customers to believing there is a network issue.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 13, 2024
@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Sep 13, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@rdner
Copy link
Member

rdner commented Oct 23, 2024

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

This error does not happen only in this situation, it can happen with a legit interrupted connection by either side. I don't think it's a good idea to fully conceal this error. It would make it difficult to find some networking issues (proxy?) on the customer's side.

So, we cannot simply change https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54 to ignore the error. If we want to address this issue right, we need to find the way to handle this error in those "keep alive session maintenance" requests instead. The error is simply returned from the Read function, so it can be handled level higher.

@rdner
Copy link
Member

rdner commented Oct 23, 2024

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

For example, we could add a new function Connection() net.Conn to the logging wrapper that returns the underlying connection, that you can use without logging. That would need some ugly type casting though, so I'm not sure.

@cmacknz
Copy link
Member

cmacknz commented Oct 23, 2024

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The keepalives are most probably on the connections to the Elasticsearch _bulk endpoint in the Beats. That is pretty much the only network connection they regularly make.

The idle connection timeout is managed within the net/http implementation: https://cs.opensource.google/go/go/+/master:src/net/http/transport.go;drc=77e42fdeaf98d241dc09f4eb92e09225aea8f4c3;l=1090

@rdner
Copy link
Member

rdner commented Oct 24, 2024

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

@strawgate sounds like you spent some time debugging it, could you give us some pointers? Where did you see this HTTP request and how did you read the buffer?

@strawgate
Copy link
Contributor Author

If you apply the breakpoint at https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50 and then wait for it to hit, do one step out, you'll land in net/http/transport.go at Read and you can read the buffer from the vscode debugger under p: Image

Example:
"HTTP/1.1 200 OK\r\nContent-Encoding: gzip\r\nContent-Type: application/json\r\nX-Cloud-Request-Id: hx4DEVDVTJettHh_3xDChw\r\nX-Elastic-Product: Elasticsearch\r\nX-Found-Handling-Cluster: 856e0d3694624bf18dffce04dd4505f0\r\nX-Found-Handling-Instance: instance-0000000002\r\nDate: Thu, 24 Oct 2024 14:24:19 GMT\r\nContent-Length: 124\r\n\r\n\x1f\x8b\b\x00\x00\x00\x00\x00\x00\x00\xec\x971\n\x80@\f\x04\xff\x92\xfa\n\xb5\xbc\xaf\x88\xc5!\x11\x04EHbu\xe4\xef\xde'\xec\xa6L҅dw\xa7\x8b\x9a=\xe6R\x8fv\xb9\x169C\xefQ\xad]v\xd3\x16*\xb5\x8bG\x8bw4\x97i\xce,L\xd8\x01w\xc0/\xa0\ah\"\xbe\x807\x92\x0f\xc8H\xe4D\xb22\xbc\x00\x19AF\x90\x11d\xf4\v\x19m\xf9\x01\x00\x00\xff\xff\x03\x00PY\xab8\xac\x11\x00\x00\xcd1f-\x0e\f\x7f\x1e\xd8\xd1\x04\xc5\xfa\xbaTNXoN6\xac\xc1\t\xffvˋ\xb9\xae\xf0\xd7\xc7\x04\xaa\xfe\x12r\xeb(\xf3\x1e\xd8\xf7\xf4\xc7\xe7\x04\xfadt\xd2*\xa1Fǎϩd\xc15\x92k\x94J\xd4?\x8d\x13...+3584 more"

@cmacknz
Copy link
Member

cmacknz commented Oct 24, 2024

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50-L52

func (l *loggingConn) Read(b []byte) (int, error) {
	n, err := l.Conn.Read(b)
	if err != nil && !errors.Is(err, io.EOF) {
		l.logger.Debugf("Error reading from connection: %v", err)
	}
	return n, err
}

@khushijain21
Copy link
Contributor

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Hello @strawgate, can you list the steps to reproduce this error. This was required to test the fix, thanks

@strawgate
Copy link
Contributor Author

strawgate commented Nov 4, 2024

I believe all you need to do is start a relatively new metricbeat (i was using main) with default settings (enable the system module) with debug logging enabled logging.level: debug and this will get logged continuously.

@khushijain21
Copy link
Contributor

khushijain21 commented Nov 5, 2024

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

I tried this locally, but it still throws an error. The buffer may or may not be empty when keep alive session attempts to read - it is not guaranteed.

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The logger wrapper is attached to connections made to ES. https://github.com/khushijain21/beats/blob/main/libbeat/esleg/eslegclient/connection.go#L164. I think we may risk losing "all" connection details if we remove it

Any suggestion is appreciated.

@rdner
Copy link
Member

rdner commented Nov 7, 2024

Just to summarize my thoughts on this issue:

  1. We should definitely keep the log wrapper, we should not consider removing it
  2. The log message must stay because it's generic, if the connection is closed because of a server crash, we still need to log it.
  3. The problem here is not the error message but that we have the idle timeout in the connection now that closes the connection and then tries to use a closed connection. See where it was introduced for more details – this is what we need to find a fix for.

@rdner
Copy link
Member

rdner commented Nov 13, 2024

After looking into internals, @khushijain21 and I decided to just remove the log line and let the consumer of the Read function handle the error (which would probably be logged where necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
5 participants