Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quic: Error from peer should be explicit that it is coming from remote peer rather than a local error #1928

Closed
Tracked by #1940
MarcoPolo opened this issue Dec 2, 2022 · 12 comments · Fixed by quic-go/quic-go#3629, #1942 or #2112
Assignees
Labels
effort/hours Estimated to take one or several hours kind/enhancement A net-new feature or improvement to an existing feature P0 Critical: Tackled by core team ASAP

Comments

@MarcoPolo
Copy link
Collaborator

We lost some time debugging this. Basically Kubo folks saw this log line

{12D... Application error 0x0: conn-27571160: system: cannot reserve inbound connection: resource limit exceeded}

Even with the resource manager disabled. And naturally assumed we were still blocked despite using a null resource manager. This error actually came from the peer when it closed the connection here

We should be more explicit locally that this is a remote error from the peer, and not an error we generated locally.

On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

@marten-seemann
Copy link
Contributor

We should be more explicit locally that this is a remote error from the peer, and not an error we generated locally.

Is that something that quic-go should do? The string representation is generated there. Any suggestion where to sneak that bit of information into the message?

On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

It's also scary. We should be more deliberate about what we send in the error message. Having an error code for "you exceed a resource limit" would definitely make sense though (see libp2p/specs#479 for a proposal).

@MarcoPolo
Copy link
Collaborator Author

We should add extra information to the error on the receiving side, so that out of date clients or malicious/broken clients can't send confusing errors.

We should also sanitize the errors we send back. I think a "I'm overloaded try again later" style error is fine. But we shouldn't be sending errors with this high level of fidelity back.

@MarcoPolo
Copy link
Collaborator Author

Is that something that quic-go should do?

I think so.

Any suggestion where to sneak that bit of information into the message?

Not sure. Maybe wherever we receive the application error code and the error string?

@marten-seemann
Copy link
Contributor

quic-go/quic-go#3629 would fix this. @MarcoPolo, would you mind doing a quick review over in quic-go?

@BigLep
Copy link
Contributor

BigLep commented Dec 6, 2022

Thanks @marten-seemann . A couple of things:

  1. Would this make it in for 0.24 (or maybe a followup in 0.24.1). I'm thinking about it from the lens of making resource manager errors clearer for Kubo users.
  2. What will the resulting error message look like in Kubo after this change (assuming Kubo doesn't make any changes)? Is it boing to be:

{12D... Application error (remote) 0x0: conn-27571160: system: cannot reserve inbound connection: resource limit exceeded}

@marten-seemann
Copy link
Contributor

  1. Would this make it in for 0.24 (or maybe a followup in 0.24.1). I'm thinking about it from the lens of making resource manager errors clearer for Kubo users.

Definitely won't make it into v0.24.0, because that one has shipped last night :)
We can do a patch release if there's desire for it. Should we target the RC or the final release with that?

2. What will the resulting error message look like in Kubo after this change (assuming Kubo doesn't make any changes)? Is it boing to be:

That's correct.

@BigLep
Copy link
Contributor

BigLep commented Dec 6, 2022

I think we can do this for for the final release in case other stuff comes up between now and then. Thanks @marten-seemann .

@markg85
Copy link

markg85 commented Jan 30, 2023

Hi,

Sorry for posting in a closed issue, but this seems to be touching exactly what i "thought" i was looking for.

This is all with Kubo 0.18.1 (yup, brand spanking new for just some hours as i write this).

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Is a message that is spammed so much when warning logs are enabled.
But when looking at the first post here by @MarcoPolo i'm lead to believe that this message actually means: On the flip side, it's cool that the local peer can see the remote peer is blocking connections because of rcmgr.

So now i'm like totally confused.
The message i see in ipfs leads me to think that a resource is too constrained. Which one?
But here this issue tells me the remote is blocking. That has nothing todo with resource limit exceeded which the error claims.

Some clarity would be helpful here.
Also, those error messages in the log really should be more informative. If there is an actual resource limit exceeded then i'd very much suggest adding a note on which resource that might be and how to fix it.

To be clear here. The mentioned error message starts popping up within mere seconds of starting ipfs when it certainly doesn't have many connections yet.

@marten-seemann
Copy link
Contributor

This is all with Kubo 0.18.1 (yup, brand spanking new for just some hours as i write this).

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

As the message says, this error happened on the remote node. For some reason a resource limit is being exceeded there. There's nothing you can do about it locally.

@BigLep
Copy link
Contributor

BigLep commented Feb 16, 2023

@marten-seemann : I'm going to reopen given this is still causing confusion for Kubo users per:
ipfs/kubo#9432 (comment)
ipfs/kubo#9432 (comment)
ipfs/kubo#9432 (comment)

I think the suggestion here and here are good input for driving clarity:

The log entries could be made more clear if (remote) is turned into (from remote), else it's easy to confuse this as being related to remote something, not that it is coming from the remote node.
Wrapping it in quotes to show that it is "verbatim" would probably help as well

@BigLep BigLep reopened this Feb 16, 2023
@markg85
Copy link

markg85 commented Feb 16, 2023

A thing that didn't hit me till @MarcoPolo explained it to me is that a message like this:
Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Has a lot more meaning then it seems. Knowledge you just don't know unless you go around asking people and/or happen to find docs that explain it.

That message means:

  • A connection has been made with the remote.
  • The remote can't maintain the connection and dropped it
  • It has limits that prevent the connection from being maintained.

But the message reads like: "There was an error, my local system cannot reserve a slot to accept that remote connection"
The points in the message that trigger that thought are (highlighting them)

Application error 0x0 (remote): conn-1940203: system: cannot reserve inbound connection: resource limit exceeded

Changing "remote" to "remote peer" is marginally making it better but still gives a message that is very confusing to read without the background information. Do note, anyone reading this very github issue likely already has a ton more background information then the average user trying to debug ipfs.

Imho, these parts of the message should go:

  • Remove the "Application error" prefix. It leads one to think that there is an error locally. There is not. Just not.
  • Remove "system", this very vaguely describes something about a system? What system? Where? Local? Remote?
  • Remove "resource", what resource? CPU? Ram? Network? Volts from the PSU ;) It's just too ambiguous and tells you "something" is wrong but not telling you what.

A reworded message that "makes sense to me" would look something like:

Network error (conn-1940203) remote peer dropped connection attempt because it's limits have been exceeded

Alternatively I wouldn't even mention anything about limits being exceeded. It's an internal detail the remote knows more about and should not be of any value for someone making a connection to that peer. An error that is more appropriate would look something like this:

Network error (conn-1940203) remote peer could not maintain our connection and dropped it

In this latter case you could provide more info on the peer itself that drops a connection. So for example if my node drops an incoming connection then it can tell the verbose details (limits, resources, whatnot) as to why it dropped that connection. The remote that is being dropped should not care about those details.

I hope this insight in my thinking about that error message helps :)

@marten-seemann
Copy link
Contributor

A few things here:

  • This error is emitted by quic-go. In the context of QUIC, "Application Error" has a very specific meaning. It makes sense that a QUIC stack would make use of that terminology.
  • It's a bug that we're sending the entire error message when using QUIC. We shouldn't reveal that much information to a (potentially malicious) peer. Instead, we should only inform about the broad category, e.g. by using an error code
  • Once implemented, libp2p should define a string representation that makes sense for all transports. This will be easy, since the error returned by quic-go contains all the information we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours kind/enhancement A net-new feature or improvement to an existing feature P0 Critical: Tackled by core team ASAP
Projects
None yet
4 participants