-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491
Conversation
👇 Click on the image for a new way to code review
Legend |
The The references are:
Along with the tests:
The above usages need to be changed, but it requires investigating #483. So I suggest @tegefaulkes that you take over this PR and commit a fix for both #483 and #484 in this PR and merge to staging for integration tests. |
965dd85
to
29171d3
Compare
Here's an important realisation. import type { Hostname } from './src/network/types';
import { Timer } from '@matrixai/timer';
import * as networkUtils from './src/network/utils';
async function main(){
const timer = new Timer({ delay: 10000 });
const hosts = await networkUtils.resolveHostname(
'localhost' as Hostname,
{
timer,
}
);
timer.cancel();
console.log('HOSTS', hosts);
}
main(); Since the timer is passed from the outside, the HOF will not cancel the timer. So it's important to explicitly close the timer. This is why I wanted to provide some convenience utility that is if there is a default delay it would work automatically, but also if I just pass a number, then the expectation is that the timer can just be cancelled automatically. Like: await networkUtils.resolveHostname(..., { timer: 1000 });
// And in this case, rather than inheriting `Timer`, it inherits just a parameter
// and thus it is equivalent to "overriding" the default and passing `timer: undefined` But that can be done later. |
e8fe24e
to
8dbad30
Compare
8dbad30
to
5d839f9
Compare
Note that using Can always change back to |
Do the simplest thing to maintain node connections to the seed nodes at all times. If there's a connection loss to the seed nodes, try to reconnect. Seed node connections must not be subject to node connection TTL. Seed nodes must not be removed from the node graph. Last manual test is the priority here. |
Most of the changes have been applied now and it is mostly working. There are a lot more regressions than I was expecting outside of the nodes domain. One of the main issues is, In the case where we have a hostname that resolves to multiple host addresses. How do we handle the connection errors? If we connect to 2 hosts and 1 fails to connect and the other fails the verification, what did it really fail as? Since we're tying to connect to a single node in this case then it could just be the verification error. When trying to find multiple nodes at once however the failure becomes more ambiguous. There are about 20 errors still left over. I'm not sure if they will affect manual testing or not. I'm still looking into them |
If you are connecting to |
If none of the connections succeed, then you don't need an aggregate error, you can just say that you couldn't connect to |
Some tests are expecting a specific reason for failing to connect. I guess I could make them more generic. |
You can still provide specific reasons. You can provide a reason for why none of the connections succeeded, just like Those tests might be too specific in that case, those tests should be made more generic, or completely deleted. |
Related #483 [ci skip]
19bf1c1
to
7617607
Compare
Ok, so after that refactoring it's mostly working again. I resolved most of the test failures. What ever failures are left are known problems and stuff that won't interfere with manual testing. |
We should push up from this branch in case manual testing reveals bugs to be fixed here. |
Do any of the tasks above need to be ticked off? |
I've just run a test where a node connects to the seed node to see how the changes work. I can see that we're attempting 2 connections, one for each seed node on the I'll add logs of the attempt in a moment. |
Don't we need to first test if both seed nodes are connecting to each other? |
For that i'll need to apply some small changes to the infrastructure and config. |
[ci skip]
72da5e2
to
4a99008
Compare
We want to handle the case where we are creating a proxy connection and the client connection drops before composing. Right now i think we only clean up proxy connections when the proxy connection fails or the composed client connection is closed. I need to double check this. Right now I'm thinking we can prevent resource leak here by applying a TTL to un-composed connections. This should be pretty simple to implement. Composed connections can be handled by the There is a concurrency concern with the TTL timing out while composing. Composing sets up some events and the data pipes. Stopping Cleans up a bunch of stuff. I'm sure we can end up with some undefined behaviour if we do both of them concurrently. To fix this the Do we need to apply this to the Since |
I've implemented the 2nd option here. when the client socket closes we cancel the |
I think we need to keep this as simple as possible because we are planning replace this system with a more robust networking solution later. So the problem right is that starting a forward connection is being triggered before a await conn.start({ timer }); While this is occurring, it's possible that // HERE, it is already started the connection
const conn = await this.establishConnectionForward(
nodeId,
proxyHost,
proxyPort,
timer,
);
// What if the `clientSocket` is **already** terminated?
conn.compose(clientSocket); Therefore the simplest solution here would be to attach an event handler to the client socket, which would then trigger a cancellation on the |
This sounds similar to what I proposed above, except I was saying that |
It's pretty much what you suggested. |
Ok, well once composition is enabled, that event handler should be removed. There's no need to keep around that event handler which was only used between |
It gets un-registered in the finally block. |
12 can be addressed by a re-occurring ephemeral task that checks if all of the seed nodes has a connection and re-attempts them. If no seed nodes have an active connection we can re-attempt a network entry. Now that I think about it, if we fail to connect to any seed nodes then we want to end the There are 4 steps to this.
|
I think task 12 can just be done by monitoring the lifecycle of the seed connections. You already have a callback whenever a connection fails right? Why not expand that to include when the seed connection fails, you just immediately retry? This would require the |
Right now the callback isn't called if the connection failed to establish in the first place and we don't want to retry immediately since it's very unlikely to connect right after it failed. Doing it that way would also spread out the logic within the NCM. Keeping it all in a task keeps it all in one place and clear what is going on. I'm leaning towards that unless we need the functionality of reacting the second the connection goes down or individual restart delays. |
Ok do it the simplest way for now, as long as it works. But I'll mark this to be redesigned when NCM is refactored so that even during the |
Going to copy this to the docs issue; #491 (comment) |
I just ran a test on the seed nodes, Everything seems to be working including retrying network entry and seed connections. Most jests tests are passing, there are a small amount of tests that need fixing due to changes from this PR. I'm going to resolve the review comments now. |
Changed `Signalling` to `signaling`.
That's all the review comments addressed. Last thing to do is to fix any remaining tests that are failing. |
Ok, the only tests failing now are
Let me do some final merge prep and then this is done. |
Yea on staging, we should expect that connecting to the testnet should work, but it will require redeployment. |
Description
This brings in
resolveHostname
that will resolve a hostname to multiple IPs including IPv4 and IPv6, and do a BFS over CNAMEs.It uses the
dns.resolve*
functions instead of the lookup. This means it doesn't abide by the local OS's get name lookup. Which also means it cannot be affected by/etc/hosts
. I didn't think/etc/hosts
is useful here. This is also better our performance is better here. See: https://nodejs.org/api/dns.html#implementation-considerationsHowever it still uses the OS provided initial DNS server configuration. Which means local DNS servers are still used, and local resolvers can still cache the records.
In the future, we can add a flag to switch to using lookup for edge cases
Issues Fixed
hostname
to multiple addresses when connecting to a node #484NodeConnectionManager
#483Tasks
resolveHost
withresolveHostname
resolveHostname
, use common domains likegoogle.com
andlocalhost
NodeConnectionManager
#483todo:
Any of the odd methods used withinAnything I could in-line is used in multiple places.syncNodeGraph
should be in-lined to avoid noise. Also add logging outlining the process of syncing the node graph.syncNodeGraph
Final checklist