Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

CMCDragonkai · 2022-10-31T12:25:47Z

Description

This brings in resolveHostname that will resolve a hostname to multiple IPs including IPv4 and IPv6, and do a BFS over CNAMEs.

It uses the dns.resolve* functions instead of the lookup. This means it doesn't abide by the local OS's get name lookup. Which also means it cannot be affected by /etc/hosts. I didn't think /etc/hosts is useful here. This is also better our performance is better here. See: https://nodejs.org/api/dns.html#implementation-considerations

However it still uses the OS provided initial DNS server configuration. Which means local DNS servers are still used, and local resolvers can still cache the records.

In the future, we can add a flag to switch to using lookup for edge cases

Issues Fixed

Tasks

1. Replace resolveHost with resolveHostname
2. Add tests to resolveHostname, use common domains like google.com and localhost
3. Fix up all usages of this in relation to feat: generic many to many connection ability for NodeConnectionManager #483
4. Test the timed and cancellable variants on this, you may need to pass a custom server set

todo:

Final checklist

ghost · 2022-10-31T12:26:27Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

CMCDragonkai · 2022-10-31T12:27:29Z

The resolveHostname is ready to be used.

The references are:

nodes/NodeConnectionManager.ts
297:        const targetHost = await networkUtils.resolveHost(targetAddress.host);
803:    host = await networkUtils.resolveHost(host);

nodes/NodeManager.ts
110:    const host_ = await networkUtils.resolveHost(host);
240:    const targetHost = await networkUtils.resolveHost(targetAddress.host);

Along with the tests:

network/utils.test.ts
28:      networkUtils.resolveHost('www.google.com' as Host),
30:    const host = await networkUtils.resolveHost('www.google.com' as Host);
33:      networkUtils.resolveHost('invalidHostname' as Host),

The above usages need to be changed, but it requires investigating #483.

So I suggest @tegefaulkes that you take over this PR and commit a fix for both #483 and #484 in this PR and merge to staging for integration tests.

CMCDragonkai · 2022-10-31T12:50:48Z

Here's an important realisation.

import type { Hostname } from './src/network/types';
import { Timer } from '@matrixai/timer';
import * as networkUtils from './src/network/utils';

async function main(){
  const timer = new Timer({ delay: 10000 });
  const hosts = await networkUtils.resolveHostname(
    'localhost' as Hostname,
    {
      timer,
    }
  );
  timer.cancel();
  console.log('HOSTS', hosts);

}

main();

Since the timer is passed from the outside, the HOF will not cancel the timer. So it's important to explicitly close the timer.

This is why I wanted to provide some convenience utility that is if there is a default delay it would work automatically, but also if I just pass a number, then the expectation is that the timer can just be cancelled automatically.

Like:

await networkUtils.resolveHostname(..., { timer: 1000 });
// And in this case, rather than inheriting `Timer`, it inherits just a parameter
// and thus it is equivalent to "overriding" the default and passing `timer: undefined`

But that can be done later.

CMCDragonkai · 2022-10-31T13:04:38Z

Note that using dns.resolve may have issues with mDNS later in MatrixAI/js-mdns#1. Also this may result in some changes to the docker configuration because I don't know what the docker container or how it will receive DNS servers.

Can always change back to dns.lookup, if we do so, we won't be really doing anything with CNAME at all, as that will all up to the OS to manage that, instead it will just be an array of IPv4 and IPv6 addresses.

CMCDragonkai · 2022-11-02T07:32:49Z

Do the simplest thing to maintain node connections to the seed nodes at all times. If there's a connection loss to the seed nodes, try to reconnect.

Seed node connections must not be subject to node connection TTL.

Seed nodes must not be removed from the node graph.

Last manual test is the priority here.

tegefaulkes · 2022-11-03T04:52:31Z

Most of the changes have been applied now and it is mostly working. There are a lot more regressions than I was expecting outside of the nodes domain.

One of the main issues is, In the case where we have a hostname that resolves to multiple host addresses. How do we handle the connection errors? If we connect to 2 hosts and 1 fails to connect and the other fails the verification, what did it really fail as? Since we're tying to connect to a single node in this case then it could just be the verification error. When trying to find multiple nodes at once however the failure becomes more ambiguous.

There are about 20 errors still left over. I'm not sure if they will affect manual testing or not. I'm still looking into them

CMCDragonkai · 2022-11-03T05:03:38Z

If you are connecting to NodeIdA, and you end up trying 10 different connections, as long as 1 connection succeeds, then the connection to NodeIdA is successful, the other failures should be ignored.

CMCDragonkai · 2022-11-03T05:04:05Z

If none of the connections succeed, then you don't need an aggregate error, you can just say that you couldn't connect to NodeIdA.

tegefaulkes · 2022-11-03T05:05:57Z

Some tests are expecting a specific reason for failing to connect. I guess I could make them more generic.

CMCDragonkai · 2022-11-03T05:07:55Z

You can still provide specific reasons. You can provide a reason for why none of the connections succeeded, just like AggregateError, although we wouldn't use this, we just a specialised exception inside the network or nodes to represent this. You can also use the cause.

Those tests might be too specific in that case, those tests should be made more generic, or completely deleted.

Related #483 [ci skip]

tegefaulkes · 2022-11-03T07:49:16Z

Ok, so after that refactoring it's mostly working again. I resolved most of the test failures. What ever failures are left are known problems and stuff that won't interfere with manual testing.

tests/nodes/NodeConnectionManager.lifecycle.test.ts

tests/nodes/NodeConnectionManager.seednodes.test.ts

CMCDragonkai · 2022-11-03T07:51:45Z

We should push up from this branch in case manual testing reveals bugs to be fixed here.

CMCDragonkai · 2022-11-03T07:54:19Z

Do any of the tasks above need to be ticked off?

tegefaulkes · 2022-11-03T07:55:57Z

I've just run a test where a node connects to the seed node to see how the changes work. I can see that we're attempting 2 connections, one for each seed node on the testnet.poklykey.io domain. 1 connection is getting established however its disconnecting far too soon afterwards. I need to look into this.

I'll add logs of the attempt in a moment.

CMCDragonkai · 2022-11-03T07:56:59Z

Don't we need to first test if both seed nodes are connecting to each other?

tegefaulkes · 2022-11-03T07:59:57Z

For that i'll need to apply some small changes to the infrastructure and config.

[ci skip]

tegefaulkes · 2022-11-08T03:35:26Z

We want to handle the case where we are creating a proxy connection and the client connection drops before composing. Right now i think we only clean up proxy connections when the proxy connection fails or the composed client connection is closed. I need to double check this.

Right now I'm thinking we can prevent resource leak here by applying a TTL to un-composed connections. This should be pretty simple to implement. Composed connections can be handled by the NodeConnectionManagers TTL logic since shutting down the client connection will clean up the proxy connection.

There is a concurrency concern with the TTL timing out while composing. Composing sets up some events and the data pipes. Stopping Cleans up a bunch of stuff. I'm sure we can end up with some undefined behaviour if we do both of them concurrently. To fix this the compose and stop method needs to share a lock. If we can share the lifecycle lock here that would work.

Do we need to apply this to the ReverseConnection as well? I'm leaning towards no. It's life-cycle is mostly driven by the complementary ForwardConnection.

Since connectForward is a PromiseCancellable we can just cancel it if the GRPC connection ends. This can be a first order fix for the problem. We will still need the TTL as a final stop for handling the connection leaks.

tegefaulkes · 2022-11-08T04:06:21Z

I've implemented the 2nd option here. when the client socket closes we cancel the connectForward promise. That should be enough to fix 14. for now.

…osing [ci skip]

CMCDragonkai · 2022-11-08T04:08:10Z

I think we need to keep this as simple as possible because we are planning replace this system with a more robust networking solution later.

So the problem right is that starting a forward connection is being triggered before a compose is called. This is because in the Proxy.establishConnectionForward, it's calling:

    await conn.start({ timer });

While this is occurring, it's possible that clientSocket has already terminated before composition.

    // HERE, it is already started the connection
    const conn = await this.establishConnectionForward(
      nodeId,
      proxyHost,
      proxyPort,
      timer,
    );
    // What if the `clientSocket` is **already** terminated?
    conn.compose(clientSocket);

Therefore the simplest solution here would be to attach an event handler to the client socket, which would then trigger a cancellation on the conn.start promise.

CMCDragonkai · 2022-11-08T04:08:58Z

I've implemented the 2nd option here. when the client socket closes we cancel the connectForward promise. That should be enough to fix 14. for now.

This sounds similar to what I proposed above, except I was saying that conn.start itself could be PromiseCancellable?

tegefaulkes · 2022-11-08T04:14:52Z

It's pretty much what you suggested. conectForward propagates the ctx down to the conn.start ultimately.

CMCDragonkai · 2022-11-08T04:18:29Z

Ok, well once composition is enabled, that event handler should be removed. There's no need to keep around that event handler which was only used between start and compose.

tegefaulkes · 2022-11-08T04:19:51Z

It gets un-registered in the finally block.

tegefaulkes · 2022-11-08T04:39:45Z

12 can be addressed by a re-occurring ephemeral task that checks if all of the seed nodes has a connection and re-attempts them. If no seed nodes have an active connection we can re-attempt a network entry.

Now that I think about it, if we fail to connect to any seed nodes then we want to end the syncNodeGraph early. There's no point in starting refresh buckets when no connections can be made.

There are 4 steps to this.

use a task to periodically check the state of the seed node connections.
If any seed node connections are down we attempt to re-establish it.
if all seed nodes are down we re-attempt the syncNodeGraph
If no seed node connections are established during syncNodeGraph then we end it early.

CMCDragonkai · 2022-11-08T04:42:23Z

I think task 12 can just be done by monitoring the lifecycle of the seed connections. You already have a callback whenever a connection fails right? Why not expand that to include when the seed connection fails, you just immediately retry?

This would require the syncNodeGraph to be using the same "data flow" as the rest of the NCM. But it should be done this now anyway.

tegefaulkes · 2022-11-08T04:53:30Z

Right now the callback isn't called if the connection failed to establish in the first place and we don't want to retry immediately since it's very unlikely to connect right after it failed. Doing it that way would also spread out the logic within the NCM.

Keeping it all in a task keeps it all in one place and clear what is going on. I'm leaning towards that unless we need the functionality of reacting the second the connection goes down or individual restart delays.

CMCDragonkai · 2022-11-08T05:03:46Z

Ok do it the simplest way for now, as long as it works. But I'll mark this to be redesigned when NCM is refactored so that even during the syncNodeGraph, all connection management (restarts... etc) has a single control flow.

src/PolykeyAgent.ts

src/agent/service/nodesHolePunchMessageSend.ts

src/nodes/NodeConnection.ts

src/nodes/NodeManager.ts

CMCDragonkai · 2022-11-08T05:08:54Z

Going to copy this to the docs issue; #491 (comment)

[ci skip]

tegefaulkes · 2022-11-08T06:41:04Z

I just ran a test on the seed nodes, Everything seems to be working including retrying network entry and seed connections.

Most jests tests are passing, there are a small amount of tests that need fixing due to changes from this PR.

I'm going to resolve the review comments now.

Changed `Signalling` to `signaling`.

tegefaulkes · 2022-11-08T06:59:54Z

That's all the review comments addressed.

Last thing to do is to fix any remaining tests that are failing.

tegefaulkes · 2022-11-08T07:49:29Z

Ok, the only tests failing now are

nat, expected
testnetConnection, expected
ping.test.ts, not expected but I can leave it till after the merge.

Let me do some final merge prep and then this is done.

CMCDragonkai · 2022-11-08T07:50:25Z

Yea on staging, we should expect that connecting to the testnet should work, but it will require redeployment.

CMCDragonkai assigned CMCDragonkai and tegefaulkes Oct 31, 2022

CMCDragonkai changed the title ~~feat: multi-host DNS resolution~~ Multi-Host DNS and Multi NodeID resolution - for network entry and general usage Oct 31, 2022

CMCDragonkai force-pushed the feature-dns-multi branch from 965dd85 to 29171d3 Compare October 31, 2022 12:43

CMCDragonkai force-pushed the feature-dns-multi branch 2 times, most recently from e8fe24e to 8dbad30 Compare October 31, 2022 12:58

feat: multi-host DNS resolution

5d839f9

CMCDragonkai force-pushed the feature-dns-multi branch from 8dbad30 to 5d839f9 Compare October 31, 2022 12:59

feat: multi-node verification for ConnectionForward

58463f3

feat: multi-node connection

7617607

Related #483 [ci skip]

tegefaulkes force-pushed the feature-dns-multi branch from 19bf1c1 to 7617607 Compare November 3, 2022 07:27

CMCDragonkai commented Nov 3, 2022

View reviewed changes

tests/nodes/NodeConnectionManager.lifecycle.test.ts Show resolved Hide resolved

CMCDragonkai commented Nov 3, 2022

View reviewed changes

tests/nodes/NodeConnectionManager.lifecycle.test.ts Show resolved Hide resolved

CMCDragonkai commented Nov 3, 2022

View reviewed changes

tests/nodes/NodeConnectionManager.seednodes.test.ts Show resolved Hide resolved

feat: concurrent mult-host connection

4a99008

[ci skip]

tegefaulkes force-pushed the feature-dns-multi branch from 72da5e2 to 4a99008 Compare November 8, 2022 02:49

feat: Proxy connections are cleaned up if client closes before comp…

b3c83b5

…osing [ci skip]

CMCDragonkai commented Nov 8, 2022

View reviewed changes

src/PolykeyAgent.ts Show resolved Hide resolved

CMCDragonkai commented Nov 8, 2022

View reviewed changes

src/agent/service/nodesHolePunchMessageSend.ts Outdated Show resolved Hide resolved

CMCDragonkai commented Nov 8, 2022

View reviewed changes

src/nodes/NodeConnection.ts Show resolved Hide resolved

CMCDragonkai commented Nov 8, 2022

View reviewed changes

src/nodes/NodeManager.ts Outdated Show resolved Hide resolved

feat: retrying seed node connections and network entry

3c59838

[ci skip]

fix: review fixes

87df1c1

Changed `Signalling` to `signaling`.

tests: test fixes

dcb85cc

fix: switched new info level logs to debug

ec5d32f

tegefaulkes merged commit 1890801 into staging Nov 8, 2022

CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

CMCDragonkai commented Oct 31, 2022 •

edited by tegefaulkes

Loading

ghost commented Oct 31, 2022 •

edited by ghost

Loading

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022 •

edited

Loading

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Nov 2, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

tegefaulkes commented Nov 8, 2022 •

edited

Loading

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022 •

edited

Loading

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

Conversation

CMCDragonkai commented Oct 31, 2022 • edited by tegefaulkes Loading

Description

Issues Fixed

Tasks

Final checklist

ghost commented Oct 31, 2022 • edited by ghost Loading

Legend

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022 • edited Loading

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Nov 2, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

CMCDragonkai commented Nov 3, 2022

tegefaulkes commented Nov 3, 2022

tegefaulkes commented Nov 8, 2022 • edited Loading

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022 • edited Loading

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

tegefaulkes commented Nov 8, 2022

CMCDragonkai commented Nov 8, 2022

CMCDragonkai commented Oct 31, 2022 •

edited by tegefaulkes

Loading

ghost commented Oct 31, 2022 •

edited by ghost

Loading

CMCDragonkai commented Oct 31, 2022 •

edited

Loading

tegefaulkes commented Nov 8, 2022 •

edited

Loading

CMCDragonkai commented Nov 8, 2022 •

edited

Loading