Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Host DNS and Multi NodeID resolution - for network entry and general usage #491

Merged
merged 17 commits into from
Nov 8, 2022

Conversation

CMCDragonkai
Copy link
Member

@CMCDragonkai CMCDragonkai commented Oct 31, 2022

Description

This brings in resolveHostname that will resolve a hostname to multiple IPs including IPv4 and IPv6, and do a BFS over CNAMEs.

It uses the dns.resolve* functions instead of the lookup. This means it doesn't abide by the local OS's get name lookup. Which also means it cannot be affected by /etc/hosts. I didn't think /etc/hosts is useful here. This is also better our performance is better here. See: https://nodejs.org/api/dns.html#implementation-considerations

However it still uses the OS provided initial DNS server configuration. Which means local DNS servers are still used, and local resolvers can still cache the records.

In the future, we can add a flag to switch to using lookup for edge cases

Issues Fixed

Tasks

todo:

  1. 1. improve logging for life cycle NCM and the creation/destruction flow of the node connections
  2. 2. defence in depth for using our own nodeId in the node graph. We need to filter out our own nodeId from information when we receive it, resolve it or attempt to use it.
  3. 3. relaying nodes should reject requests from nodes to themselves. Should check the nodeId and the address.
  4. 4. Make the node connection creation for the multi-node resolving serial instead of concurrent. Need to double check the destroy callback logic here.
  5. 5. ICE needs to be extracted from the NC and done along side its creation instead.
  6. 6. Special case seed nodes to avoid TTL clean up on connections.
  7. Any of the odd methods used withinsyncNodeGraph should be in-lined to avoid noise. Also add logging outlining the process of syncing the node graph. Anything I could in-line is used in multiple places.
  8. 8. Filter out any Ipv6 addresses when resolving hostnames.
  9. 9. Add key path to logger messages.
  10. 10. Add detailed logging to syncNodeGraph
  11. 11. Separate timeouts for each sub connection.
  12. 12. All nodes need to re-establish to connection seed nodes when they disconnect.
  13. 13. Serial connection establishment needs to be updated to work concurrently.
  14. 14. Proxy needs to cancel forward connection establishment if the underlying GRPC client connection is destroyed.
  15. 15. Special case GC of seed nodes in the NodeGraph.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@ghost
Copy link

ghost commented Oct 31, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@CMCDragonkai
Copy link
Member Author

The resolveHostname is ready to be used.

The references are:

nodes/NodeConnectionManager.ts
297:        const targetHost = await networkUtils.resolveHost(targetAddress.host);
803:    host = await networkUtils.resolveHost(host);

nodes/NodeManager.ts
110:    const host_ = await networkUtils.resolveHost(host);
240:    const targetHost = await networkUtils.resolveHost(targetAddress.host);

Along with the tests:

network/utils.test.ts
28:      networkUtils.resolveHost('www.google.com' as Host),
30:    const host = await networkUtils.resolveHost('www.google.com' as Host);
33:      networkUtils.resolveHost('invalidHostname' as Host),

The above usages need to be changed, but it requires investigating #483.

So I suggest @tegefaulkes that you take over this PR and commit a fix for both #483 and #484 in this PR and merge to staging for integration tests.

@CMCDragonkai CMCDragonkai changed the title feat: multi-host DNS resolution Multi-Host DNS and Multi NodeID resolution - for network entry and general usage Oct 31, 2022
@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 31, 2022

Here's an important realisation.

import type { Hostname } from './src/network/types';
import { Timer } from '@matrixai/timer';
import * as networkUtils from './src/network/utils';

async function main(){
  const timer = new Timer({ delay: 10000 });
  const hosts = await networkUtils.resolveHostname(
    'localhost' as Hostname,
    {
      timer,
    }
  );
  timer.cancel();
  console.log('HOSTS', hosts);

}

main();

Since the timer is passed from the outside, the HOF will not cancel the timer. So it's important to explicitly close the timer.

This is why I wanted to provide some convenience utility that is if there is a default delay it would work automatically, but also if I just pass a number, then the expectation is that the timer can just be cancelled automatically.

Like:

await networkUtils.resolveHostname(..., { timer: 1000 });
// And in this case, rather than inheriting `Timer`, it inherits just a parameter
// and thus it is equivalent to "overriding" the default and passing `timer: undefined`

But that can be done later.

@CMCDragonkai CMCDragonkai force-pushed the feature-dns-multi branch 2 times, most recently from e8fe24e to 8dbad30 Compare October 31, 2022 12:58
@CMCDragonkai
Copy link
Member Author

Note that using dns.resolve may have issues with mDNS later in MatrixAI/js-mdns#1. Also this may result in some changes to the docker configuration because I don't know what the docker container or how it will receive DNS servers.

Can always change back to dns.lookup, if we do so, we won't be really doing anything with CNAME at all, as that will all up to the OS to manage that, instead it will just be an array of IPv4 and IPv6 addresses.

@CMCDragonkai
Copy link
Member Author

Do the simplest thing to maintain node connections to the seed nodes at all times. If there's a connection loss to the seed nodes, try to reconnect.

Seed node connections must not be subject to node connection TTL.

Seed nodes must not be removed from the node graph.

Last manual test is the priority here.

@tegefaulkes
Copy link
Contributor

Most of the changes have been applied now and it is mostly working. There are a lot more regressions than I was expecting outside of the nodes domain.

One of the main issues is, In the case where we have a hostname that resolves to multiple host addresses. How do we handle the connection errors? If we connect to 2 hosts and 1 fails to connect and the other fails the verification, what did it really fail as? Since we're tying to connect to a single node in this case then it could just be the verification error. When trying to find multiple nodes at once however the failure becomes more ambiguous.

There are about 20 errors still left over. I'm not sure if they will affect manual testing or not. I'm still looking into them

@CMCDragonkai
Copy link
Member Author

If you are connecting to NodeIdA, and you end up trying 10 different connections, as long as 1 connection succeeds, then the connection to NodeIdA is successful, the other failures should be ignored.

@CMCDragonkai
Copy link
Member Author

If none of the connections succeed, then you don't need an aggregate error, you can just say that you couldn't connect to NodeIdA.

@tegefaulkes
Copy link
Contributor

Some tests are expecting a specific reason for failing to connect. I guess I could make them more generic.

@CMCDragonkai
Copy link
Member Author

You can still provide specific reasons. You can provide a reason for why none of the connections succeeded, just like AggregateError, although we wouldn't use this, we just a specialised exception inside the network or nodes to represent this. You can also use the cause.

Those tests might be too specific in that case, those tests should be made more generic, or completely deleted.

@tegefaulkes
Copy link
Contributor

Ok, so after that refactoring it's mostly working again. I resolved most of the test failures. What ever failures are left are known problems and stuff that won't interfere with manual testing.

@CMCDragonkai
Copy link
Member Author

We should push up from this branch in case manual testing reveals bugs to be fixed here.

@CMCDragonkai
Copy link
Member Author

Do any of the tasks above need to be ticked off?

@tegefaulkes
Copy link
Contributor

I've just run a test where a node connects to the seed node to see how the changes work. I can see that we're attempting 2 connections, one for each seed node on the testnet.poklykey.io domain. 1 connection is getting established however its disconnecting far too soon afterwards. I need to look into this.

I'll add logs of the attempt in a moment.

@CMCDragonkai
Copy link
Member Author

Don't we need to first test if both seed nodes are connecting to each other?

@tegefaulkes
Copy link
Contributor

For that i'll need to apply some small changes to the infrastructure and config.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Nov 8, 2022

We want to handle the case where we are creating a proxy connection and the client connection drops before composing. Right now i think we only clean up proxy connections when the proxy connection fails or the composed client connection is closed. I need to double check this.

Right now I'm thinking we can prevent resource leak here by applying a TTL to un-composed connections. This should be pretty simple to implement. Composed connections can be handled by the NodeConnectionManagers TTL logic since shutting down the client connection will clean up the proxy connection.

There is a concurrency concern with the TTL timing out while composing. Composing sets up some events and the data pipes. Stopping Cleans up a bunch of stuff. I'm sure we can end up with some undefined behaviour if we do both of them concurrently. To fix this the compose and stop method needs to share a lock. If we can share the lifecycle lock here that would work.

Do we need to apply this to the ReverseConnection as well? I'm leaning towards no. It's life-cycle is mostly driven by the complementary ForwardConnection.

Since connectForward is a PromiseCancellable we can just cancel it if the GRPC connection ends. This can be a first order fix for the problem. We will still need the TTL as a final stop for handling the connection leaks.

@tegefaulkes
Copy link
Contributor

I've implemented the 2nd option here. when the client socket closes we cancel the connectForward promise. That should be enough to fix 14. for now.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 8, 2022

I think we need to keep this as simple as possible because we are planning replace this system with a more robust networking solution later.

So the problem right is that starting a forward connection is being triggered before a compose is called. This is because in the Proxy.establishConnectionForward, it's calling:

    await conn.start({ timer });

While this is occurring, it's possible that clientSocket has already terminated before composition.

    // HERE, it is already started the connection
    const conn = await this.establishConnectionForward(
      nodeId,
      proxyHost,
      proxyPort,
      timer,
    );
    // What if the `clientSocket` is **already** terminated?
    conn.compose(clientSocket);

Therefore the simplest solution here would be to attach an event handler to the client socket, which would then trigger a cancellation on the conn.start promise.

@CMCDragonkai
Copy link
Member Author

I've implemented the 2nd option here. when the client socket closes we cancel the connectForward promise. That should be enough to fix 14. for now.

This sounds similar to what I proposed above, except I was saying that conn.start itself could be PromiseCancellable?

@tegefaulkes
Copy link
Contributor

It's pretty much what you suggested. conectForward propagates the ctx down to the conn.start ultimately.

@CMCDragonkai
Copy link
Member Author

Ok, well once composition is enabled, that event handler should be removed. There's no need to keep around that event handler which was only used between start and compose.

@tegefaulkes
Copy link
Contributor

It gets un-registered in the finally block.

@tegefaulkes
Copy link
Contributor

12 can be addressed by a re-occurring ephemeral task that checks if all of the seed nodes has a connection and re-attempts them. If no seed nodes have an active connection we can re-attempt a network entry.

Now that I think about it, if we fail to connect to any seed nodes then we want to end the syncNodeGraph early. There's no point in starting refresh buckets when no connections can be made.

There are 4 steps to this.

  1. use a task to periodically check the state of the seed node connections.
  2. If any seed node connections are down we attempt to re-establish it.
  3. if all seed nodes are down we re-attempt the syncNodeGraph
  4. If no seed node connections are established during syncNodeGraph then we end it early.

@CMCDragonkai
Copy link
Member Author

I think task 12 can just be done by monitoring the lifecycle of the seed connections. You already have a callback whenever a connection fails right? Why not expand that to include when the seed connection fails, you just immediately retry?

This would require the syncNodeGraph to be using the same "data flow" as the rest of the NCM. But it should be done this now anyway.

@tegefaulkes
Copy link
Contributor

Right now the callback isn't called if the connection failed to establish in the first place and we don't want to retry immediately since it's very unlikely to connect right after it failed. Doing it that way would also spread out the logic within the NCM.

Keeping it all in a task keeps it all in one place and clear what is going on. I'm leaning towards that unless we need the functionality of reacting the second the connection goes down or individual restart delays.

@CMCDragonkai
Copy link
Member Author

Ok do it the simplest way for now, as long as it works. But I'll mark this to be redesigned when NCM is refactored so that even during the syncNodeGraph, all connection management (restarts... etc) has a single control flow.

src/nodes/NodeManager.ts Outdated Show resolved Hide resolved
@CMCDragonkai
Copy link
Member Author

Going to copy this to the docs issue; #491 (comment)

@tegefaulkes
Copy link
Contributor

I just ran a test on the seed nodes, Everything seems to be working including retrying network entry and seed connections.

Most jests tests are passing, there are a small amount of tests that need fixing due to changes from this PR.

I'm going to resolve the review comments now.

Changed `Signalling` to `signaling`.
@tegefaulkes
Copy link
Contributor

That's all the review comments addressed.

Last thing to do is to fix any remaining tests that are failing.

@tegefaulkes
Copy link
Contributor

Ok, the only tests failing now are

  • nat, expected
  • testnetConnection, expected
  • ping.test.ts, not expected but I can leave it till after the merge.

Let me do some final merge prep and then this is done.

@CMCDragonkai
Copy link
Member Author

Yea on staging, we should expect that connecting to the testnet should work, but it will require redeployment.

@tegefaulkes tegefaulkes merged commit 1890801 into staging Nov 8, 2022
@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
2 participants