-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: unable to establish connection to seed node #489
Comments
The punch protocol is executed regardless whether there is NAT or not. So you would want to establish whether your home node is receiving punch packets from the testnet node. You can do this by using wireshark or by logging out in the code on the punch message/receive message handler in If that is true, then the error is somewhere else, maybe some logic problem or timeout issue. If that is not true, then the punch protocol is failing. This may be related to the #474 nat testing failures since that was failing locally too. |
Original context found here. #487 (comment) |
Testing locally gives us the following error.
The bug here is really weird.. just before the error in Oh, it's called twice, the first time is correct, the 2nd one isnt. |
I think we've seen this before. There's always confusion between |
Here's an example of a confusion I've seen before in the codebase: One source of confusion is that we are using |
I've found the problem. When creating a We need to fix this logic. What needs to happen is that the seed node fills in the details of the node making the hole punch request. This can be done in two ways. Our node an request our connection information from the seed node and fill in the details. Or the seed node when it receives the hole punch relay request can fill in the connection details of the node making the request. I've been vaguely aware of this problem for a while now. I think I've mentioned it in the #365 issue. |
That hole punch messages is only relevant if we are connecting to another (third) node though. I remember that in ICE, we do 3 things in parallel, direct connection, hole punch signalling, relay message. Now since we are hitting the seed nodes, this is different from hitting random nodes. I imagine that upon the initial network entry, it should not bother with trying to "hole punch signal", the seed nodes. It should only attempt direct connection. We haven't really worked out how #365 nor #182 would work. So at the network entry stage, there shouldn't be an ICE logic. Only direct connections should be necessary. |
I don't understand this. The "hole punch message" is for signalling. The connecting node should not be putting in its own IP and port information because like you said, it wouldn't even know the right IP and port info. |
The seed node "fills in the details" when it receives a signalling message. It does this dynamically at the point of handling the service call. |
Overall I think the process needs to be refactored. currently we ask the seed node to relay a hole punch message to our target. In this case we are providing all of the information here. What we need to do is ask the seed node to co-ordinate a connection between our self and our target node. It would look something like this.
The key difference here is that |
As you're fixing this, can you change the name of this mechanism to be "signalling"? That is |
It's an old implementation that hasn't been reviewed. |
Atm, the |
Apply the first fix #489 (comment), and then test to see if it's working. Then apply second fix. Because the second fix #489 (comment) would not execute that the first fix. |
Fixes should go straight to staging. |
Just realised that I still need to check the fix against the testnet. Reopening this. |
Connecting to the testnet is still failing so that's a different problem. Debugging and fixing will be a little tricky since the problem is with the seed node. |
Did you test this locally? |
yes, it's working locally. the local bug is different to the seed nodes bug. Locally it was an explicit error due to a bad ip address for hole punching. For the seed node it's timing out forming the reverse connection. |
Use the scripts to push up build and push new image, the CI/CD does this automatically but the scripts can also run locally. I think you need to first build the image though. Afterwards, have to establish if your home node is actually receiving punch packets from the seed node. |
Using wireshark, I can see that we're getting response packets.
So it's not a problem with the networking. Now that I think about it, it's not failing to establish a connection back to our node. it's the There's a diagram here for reference. #361 (comment) So now the question is, why is it failing to connect to itself locally? Is this a container thing? A EC2 networking thing? |
Here is the seed nodes start information
And when running locally.
|
Check the actual task definition in |
That doesn't show the Remember the agent service is not meant to be publically exposed. It's meant to just be generated for |
If the |
Might be related to host networking mode? "The task uses the host's network which bypasses Docker's built-in virtual network by mapping container ports directly to the ENI of the Amazon EC2 instance that hosts the task. Dynamic port mappings can’t be used in this network mode. A container in a task definition that uses this mode must specify a specific hostPort number. A port number on a host can’t be used by multiple tasks. As a result, you can’t run multiple tasks of the same task definition on a single Amazon EC2 instance." Specifically A quick test for this would be to hard code the agent port and see if that fixes it. |
We don't have a way to set the agent host/port when starting the agent. So to test this I will have to push new container images up. |
I'd be surprised if that was the reason. The host network just means it's using the host's interface. At the end of the day the agent service is supposed to be bound to If you are able to ssh into the EC2 instance, you can also then bring in some networking tools to observe the port registrations there used by the container. Use things like |
it seems to be bound.
|
I'm installing tcpdump on the EC2 instance so I can see what is going on with the connection. |
Here is a snippet from the TCP dump.
This is odd.. First, the GRPC agent seems to be responding here. second, there should be more traffic than a single packet of 9 bytes right? |
That may just be the initial connection to the listening server. there is a 2nd stage to the communication on new ports using http
We can see the HTTP handshake working. But then the connecting side initiates the end of the connection. |
3 things to look into.
|
We will need a source of truth to compare against. So record all the logs locally, all the UTP message handling. Then we compare with the remote node, after creating an image and pushing/deploying to the testnet. Make sure you have your We should also use tshark to compare against the wireshark logs since these are likely to be more consistent with each other. |
I should get Tshark to work on the EC2 instance so we can compare with the local packet dump. |
The agent is failing randomly and it crashed by itself without any iteraction the last midnight. So we need to investigate why it would do this without any logs. This would indicate something breaking in the background. It has to log out ALL uncaught exceptions and uncaught rejections. |
Debugging Procedure Fixing the Timed Cancellable AbortWe know that starting a connection for
The This means this is the maximum deadline for our abortion timeout. We should add a Then set the default deadline to For this to work, the punch interval should be set to 50ms or 100ms. At this point it should be possible to abort the connection within 1 second and NOT see the To ensure that we do not see a Connection Protocol LoggingWe need to log out all the relevant parts of the connection protocol from both
These log messages MUST have a timing information as part of the logs. This is so we can compare the order. Local Simulation TestingOpen up wireshark. And run a seed node and a client node. Both nodes should run and output the log messages above. Observe that a full connection WORKS until being terminated by the node connection TTL which defaults to 60 seconds. Look at the logs to see that the entire protocol is being followed. Pushing up same image to ECRAuthenticate to skopeo using the command in the README.md. Use Now we will compare the logs between our office node and the testnet node with the logs we captured locally. This is to identify where it is failing, where there is a discrepancy. The only thing that could be problematic is the fact that we have a double NAT here in the office (specifically carrier grade NAT). If this is a problem, we can see if it works from another AWS EC2 system to see if it is a NAT issue. Fixing the Random Process TerminationCover all our bases by having unique exit codes for:
Log out the exit codes, and replicate the random failure by fuzz testing a local seed node while simulating conditions on the EC2, by running a local docker container image. Identifying the correct exit code should be able to tell us what is happening. ALSO check that we are not being killed by something else like the OOM on the operating system. Otherwise if we cannot get anything useful, we will use strace on the entire nodejs process and run it there, and try and trigger a crash. Alternatively we use https://rr-project.org/ and see https://fitzgeraldnick.com/2015/11/02/back-to-the-futurre.html |
I've checked the timeout for forward and reverse connection. The time out was working BUT it was using the wrong defaults. I've fixed this up now. When setting the timeouts to 0 I'm seeing I can confirm it is coming from the |
I'm getting a bit sidetracked here, I've make a comment on #473 about the new details I've discovered about this problem. I'll leave this for later, it's only triggered on an extremely short timeout on the reverse connection establishment. |
The timeout stuff has been fixed. I've opted for 2000ms timeout to match what the ping was set to. General fixes for the abort proble.
|
Here are the logs for normal operation. Seed node
agent
wireshark
|
Next step is to update the ECR image and test against that. |
After updating the ECR, this is the logs connecting to the seed node. seed
node
|
It seems after updating the image in the ECR. The seed node is handling connections properly now. We see the connection fully established and the connecting node gets added to the the seeds node graph. |
I'll make a new issue for the 4th part for |
I've added this commit as a reference for logging outputs. This will be reverted because the added messages are too spammy for normal usage. Related #489
Question about the logs. Why does:
Have a |
It was printing out the data which I assumed was a buffer but ended up just printing a string. 0 and 1 are non-printable characters. |
Specification
During the network entry procedure where we attempt to connect to a seed node is failing to connect. The expectation here is that we specify the seed node using the
--seed-node
CLI parameter and our node connects to said seed node. What is happening is we start the connection but the seed node times out when establishing the reverse connection.Specifically the connection is being started so the seed node sees this connection. When handling the connection it should create a reverse connection and compose it. We are failing to start the reverse connection during
ReverseConnection.start()
due to timing out.Additional context
Related MatrixAI/Polykey-CLI#71
Related #487
Tasks
The text was updated successfully, but these errors were encountered: