-
Notifications
You must be signed in to change notification settings - Fork 1.2k
ERR_ENCRYPTION_FAILED when reconnecting to previously known peer #3702
Comments
One surprising thing here. When you restart node:a and node:b they do not try to connect each other. The expectation was that peer list survives a node restart. |
Might be related: when node:a connects to node:b (single port and WebSocket provided in multiaddr), there are two TCP connections from node:a or node:b. |
@ukstv Thanks for reporting this and making the repo to reproduce the issue. Unfortunately, it "works on my machine," and I can't get it to reproduce the bug 😞 Here's what I see in the log for "node b":
Do you always see the errors, or is it an intermittent thing? |
Also, if you could give some info about your environment (operating system, architecture, nodejs version, etc), that would be great. I have some other machines lying around I could test on, just in case the issue only shows up on some platforms. |
That's unfortunate @yusefnapora . Actually, I tried to create a more reproducible case in Vagrant, and Docker environments, but that magically works with no errors reported :(( I can still reproduce the issue on my machine. It is Logs I have, which you might find useful: https://gist.github.com/ukstv/ac830536e678a38875b4b21f91f4705c |
The logs from @ukstv's reproduction look similar, but slightly different, than the logs we're seeing on the nodes running in our production environment. Those logs look like:
As far as we can tell these nodes are still behaving fine, but those logs are definitely worrisome. It's also weird because this issue shows up on 2 of the nodes in our infra, but not on any of our other nodes, which are all configured it the same was as far as we can tell. |
@vasco-santos have you got any thoughts on this? |
On a peer restart, it will try to dial previously known peers stored in the PeerStore (having a new setup will cause to have an empty peerStore). |
These logs are happening extremely consistently on an ongoing basis, not just at node startup. Seems to be happening about 2-4 times a minute. Here are some timestamps from the logs filtered to log lines containing the error message "ERR_ENCRYPTION_FAILED" On node 1:
Same time frame on node 2:
|
Hey @stbrody I got some time to look into your test case. Thanks for providing it. I have a couple of questions.
I think that by mistake I got exactly your issue. I got the wrong PeerId. So once I started both nodes, nodeB prints:
I got
The correct peerId is So, going ahead with the incorrect peerId and running
and nodes will still try to connect and fail. NodeB will keep trying and getting:
while the output of nodeA when receives this connect is
Which is expected, as we are trying to connect a peer with a wrong peerId. So, it will be reachable via the multiaddr but then when they do the encryption, it will fail because peerA understands that peerB is not who it thought so. Regarding the continuous attempts, when obviously this peer should be discarded, we still need to make these improvements in js-libp2p as part of libp2p/js-libp2p#744 and libp2p/js-libp2p#593 Let me know if this is the problem, as if I use the correct PeerId everything works as expected. And please, remove the repos before starting as they will be polluted with this peer in the PeerStore. |
@ukstv @v-stickykeys: see above. Could this be what's happening on our ELP nodes? |
IPFS node b reports its peer id on console. So It might be the case with wrong peer ids. I guess, if what we see is straight "invalid peer id" situation, the Maybe somehow IPFS and/or libp2p try to reuse peer id from the past? Maybe it is not a peer id, but some other data structure related to peer that is reused while establishing connection? And, this happens while valid peer id is present and the connection overall is established. Like, I do not know, there is an orphan data structure, maybe? |
That seems a high possibility, and is why I mention removing the repo. When you dial other peers, the addr is stored in the addressBook and on startup, the node tries to dial the peers previously stored until have the minimum number of peers connected. If removing the repo is not a possibility for you at the moment, you can add logs for the peerId that is trying to dial and see if it is one of your nodes. If not, you can programatically remove it with https://github.com/libp2p/js-libp2p/blob/master/doc/API.md#peerstoredelete |
Something is bothering me about the hypothesis. Two of Ceramic hosted IPFS nodes just wait for an external connection. I run an IPFS node on my machine, and command it to connect to one of the hosted node. Then I see the same XX handshake error on my node's log. I guess, the relationship between stale peer and XX handshake error is not that straightforward. Anyway, we'll try removing peers. CC @stbrody |
@v-stickykeys @zachferland - I didn't quite follow the suggestion of removing the peers from the repo - does this make sense to you? Is this something we could try on our ELP nodes to see if it fixes the issue? |
If you have problematic addresses in your node's PeerStore (addresses with wrong peerIds), this error will happen as we do not have multiaddr confidence in place yet that would make libp2p intelligent to get rid of them. This is because when you start a node, it will try to dial over time the peers it previously knew (peerStore) in order to get a bigger connection pool. To be clear, when you try to dial The question I still have in your test case is that I did what is expected #3702 (comment) and you were dialling the incorrect PeerId there. |
We are definitely connecting to the correct peerids but there is a chance that the peerstore is polluted. Will look into it now and report back |
I just looked into this again and my ability to repro supports what you are saying @vasco-santos There's two issues that remain:
|
|
js-ipfs is being deprecated in favor of Helia. You can #4336 and read the migration guide. Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see #4336). Assigning to @achingbrain to answer whether this issue is already resolved in Helia, or if this issue needs to be migrated to that repo! |
This appears to have been fixed - at least, I run the repro example and the nodes reconnect to each other. If you're still seeing problems, please open an issue against https://github.com/ChainSafe/js-libp2p-noise as that is likely where any fix will land. |
With
DEBUG=libp2p:*
one could see an interesting (and alarming) messages when a node connects to another node:Repo to reproduce: https://github.com/ukstv/ipfs-encryption-repro
Sometimes this error on reconnection persists: one could see repeated messages about XX handshake in a log, but this one I can not reproduce on request.
This worries me, as it might be related to instability of IPFS-to-IPFS connection. Sometimes messages get lost, or one could not retrieve a record, if it exists on another seemingly connected node.
The text was updated successfully, but these errors were encountered: