quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

MarcoPolo · 2022-12-02T06:20:04Z

We lost some time debugging this. Basically Kubo folks saw this log line

{12D... Application error 0x0: conn-27571160: system: cannot reserve inbound connection: resource limit exceeded}

Even with the resource manager disabled. And naturally assumed we were still blocked despite using a null resource manager. This error actually came from the peer when it closed the connection here

We should be more explicit locally that this is a remote error from the peer, and not an error we generated locally.

On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

The text was updated successfully, but these errors were encountered:

marten-seemann · 2022-12-02T06:22:53Z

We should be more explicit locally that this is a remote error from the peer, and not an error we generated locally.

Is that something that quic-go should do? The string representation is generated there. Any suggestion where to sneak that bit of information into the message?

On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

It's also scary. We should be more deliberate about what we send in the error message. Having an error code for "you exceed a resource limit" would definitely make sense though (see libp2p/specs#479 for a proposal).

MarcoPolo · 2022-12-02T17:28:40Z

We should add extra information to the error on the receiving side, so that out of date clients or malicious/broken clients can't send confusing errors.

We should also sanitize the errors we send back. I think a "I'm overloaded try again later" style error is fine. But we shouldn't be sending errors with this high level of fidelity back.

MarcoPolo · 2022-12-02T17:51:34Z

Is that something that quic-go should do?

I think so.

Any suggestion where to sneak that bit of information into the message?

Not sure. Maybe wherever we receive the application error code and the error string?

marten-seemann · 2022-12-06T06:39:22Z

quic-go/quic-go#3629 would fix this. @MarcoPolo, would you mind doing a quick review over in quic-go?

BigLep · 2022-12-06T16:44:40Z

Thanks @marten-seemann . A couple of things:

Would this make it in for 0.24 (or maybe a followup in 0.24.1). I'm thinking about it from the lens of making resource manager errors clearer for Kubo users.
What will the resulting error message look like in Kubo after this change (assuming Kubo doesn't make any changes)? Is it boing to be:

{12D... Application error (remote) 0x0: conn-27571160: system: cannot reserve inbound connection: resource limit exceeded}

marten-seemann · 2022-12-06T18:56:51Z

Would this make it in for 0.24 (or maybe a followup in 0.24.1). I'm thinking about it from the lens of making resource manager errors clearer for Kubo users.

Definitely won't make it into v0.24.0, because that one has shipped last night :)
We can do a patch release if there's desire for it. Should we target the RC or the final release with that?

2. What will the resulting error message look like in Kubo after this change (assuming Kubo doesn't make any changes)? Is it boing to be:

That's correct.

BigLep · 2022-12-06T22:43:39Z

I think we can do this for for the final release in case other stuff comes up between now and then. Thanks @marten-seemann .

markg85 · 2023-01-30T22:16:52Z

Hi,

Sorry for posting in a closed issue, but this seems to be touching exactly what i "thought" i was looking for.

This is all with Kubo 0.18.1 (yup, brand spanking new for just some hours as i write this).

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Is a message that is spammed so much when warning logs are enabled.
But when looking at the first post here by @MarcoPolo i'm lead to believe that this message actually means: On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

So now i'm like totally confused.
The message i see in ipfs leads me to think that a resource is too constrained. Which one?
But here this issue tells me the remote is blocking. That has nothing todo with resource limit exceeded which the error claims.

Some clarity would be helpful here.
Also, those error messages in the log really should be more informative. If there is an actual resource limit exceeded then i'd very much suggest adding a note on which resource that might be and how to fix it.

To be clear here. The mentioned error message starts popping up within mere seconds of starting ipfs when it certainly doesn't have many connections yet.

marten-seemann · 2023-02-01T02:06:31Z

This is all with Kubo 0.18.1 (yup, brand spanking new for just some hours as i write this).

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

As the message says, this error happened on the remote node. For some reason a resource limit is being exceeded there. There's nothing you can do about it locally.

BigLep · 2023-02-16T12:37:24Z

@marten-seemann : I'm going to reopen given this is still causing confusion for Kubo users per:
ipfs/kubo#9432 (comment)
ipfs/kubo#9432 (comment)
ipfs/kubo#9432 (comment)

I think the suggestion here and here are good input for driving clarity:

The log entries could be made more clear if (remote) is turned into (from remote), else it's easy to confuse this as being related to remote something, not that it is coming from the remote node.
Wrapping it in quotes to show that it is "verbatim" would probably help as well

markg85 · 2023-02-16T14:02:45Z

A thing that didn't hit me till @MarcoPolo explained it to me is that a message like this:
Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Has a lot more meaning then it seems. Knowledge you just don't know unless you go around asking people and/or happen to find docs that explain it.

That message means:

A connection has been made with the remote.
The remote can't maintain the connection and dropped it
It has limits that prevent the connection from being maintained.

But the message reads like: "There was an error, my local system cannot reserve a slot to accept that remote connection"
The points in the message that trigger that thought are (highlighting them)

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Changing "remote" to "remote peer" is marginally making it better but still gives a message that is very confusing to read without the background information. Do note, anyone reading this very github issue likely already has a ton more background information then the average user trying to debug ipfs.

Imho, these parts of the message should go:

Remove the "Application error" prefix. It leads one to think that there is an error locally. There is not. Just not.
Remove "system", this very vaguely describes something about a system? What system? Where? Local? Remote?
Remove "resource", what resource? CPU? Ram? Network? Volts from the PSU ;) It's just too ambiguous and tells you "something" is wrong but not telling you what.

A reworded message that "makes sense to me" would look something like:

Network error (conn-1940203) remote peer dropped connection attempt because it's limits have been exceeded

Alternatively I wouldn't even mention anything about limits being exceeded. It's an internal detail the remote knows more about and should not be of any value for someone making a connection to that peer. An error that is more appropriate would look something like this:

Network error (conn-1940203) remote peer could not maintain our connection and dropped it

In this latter case you could provide more info on the peer itself that drops a connection. So for example if my node drops an incoming connection then it can tell the verbose details (limits, resources, whatnot) as to why it dropped that connection. The remote that is being dropped should not care about those details.

I hope this insight in my thinking about that error message helps :)

marten-seemann · 2023-02-17T04:05:13Z

A few things here:

This error is emitted by quic-go. In the context of QUIC, "Application Error" has a very specific meaning. It makes sense that a QUIC stack would make use of that terminology.
It's a bug that we're sending the entire error message when using QUIC. We shouldn't reveal that much information to a (potentially malicious) peer. Instead, we should only inform about the broad category, e.g. by using an error code
Once implemented, libp2p should define a string representation that makes sense for all transports. This will be easy, since the error returned by quic-go contains all the information we need.

ajnavarro mentioned this issue Dec 2, 2022

[Tracking issue] go-libp2p resource manager critical post release fixes ipfs/kubo#9442

Closed

9 tasks

marten-seemann mentioned this issue Dec 6, 2022

qerr: include role (remote / local) in error string representations quic-go/quic-go#3629

Merged

marten-seemann added P0 Critical: Tackled by core team ASAP effort/hours Estimated to take one or several hours kind/enhancement A net-new feature or improvement to an existing feature labels Dec 6, 2022

marten-seemann self-assigned this Dec 6, 2022

ajnavarro mentioned this issue Dec 7, 2022

cannot reserve inbound connection: resource limit exceeded ipfs/kubo#9432

Closed

3 tasks

marten-seemann mentioned this issue Dec 8, 2022

go-libp2p v0.24.1 #1940

Closed

3 tasks

marten-seemann closed this as completed in quic-go/quic-go#3629 Dec 8, 2022

marten-seemann reopened this Dec 8, 2022

marten-seemann mentioned this issue Dec 8, 2022

quic: update quic-go to v0.31.1 #1942

Merged

marten-seemann closed this as completed in #1942 Dec 8, 2022

Wondertan mentioned this issue Jan 20, 2023

Unstable connectivity and stream negotiation on test networks celestiaorg/celestia-node#1623

Closed

3 tasks

BigLep reopened this Feb 16, 2023

marten-seemann mentioned this issue Feb 17, 2023

quic: don't send detailed error messages when closing connections #2112

Merged

marten-seemann closed this as completed in #2112 Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

MarcoPolo commented Dec 2, 2022

marten-seemann commented Dec 2, 2022

MarcoPolo commented Dec 2, 2022

MarcoPolo commented Dec 2, 2022

marten-seemann commented Dec 6, 2022

BigLep commented Dec 6, 2022

marten-seemann commented Dec 6, 2022

BigLep commented Dec 6, 2022

markg85 commented Jan 30, 2023

marten-seemann commented Feb 1, 2023

BigLep commented Feb 16, 2023

markg85 commented Feb 16, 2023

marten-seemann commented Feb 17, 2023

quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

Comments

MarcoPolo commented Dec 2, 2022

marten-seemann commented Dec 2, 2022

MarcoPolo commented Dec 2, 2022

MarcoPolo commented Dec 2, 2022

marten-seemann commented Dec 6, 2022

BigLep commented Dec 6, 2022

marten-seemann commented Dec 6, 2022

BigLep commented Dec 6, 2022

markg85 commented Jan 30, 2023

marten-seemann commented Feb 1, 2023

BigLep commented Feb 16, 2023

markg85 commented Feb 16, 2023

marten-seemann commented Feb 17, 2023