Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction to hybrid net #113

Merged
merged 28 commits into from
Nov 3, 2024
Merged

Introduction to hybrid net #113

merged 28 commits into from
Nov 3, 2024

Conversation

nieznanysprawiciel
Copy link
Contributor

@nieznanysprawiciel nieznanysprawiciel commented Oct 16, 2024

Important

Chapter Broadcasting and neighborhood contains copy of fragments from description of market broadcasting. It should be removed in original place later, when all changes are merged and link directing to this chapter should be left.


This change is Reviewable

@nieznanysprawiciel nieznanysprawiciel marked this pull request as ready for review October 18, 2024 12:10
Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 20 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 566 at r1 (raw file):

Hybrid Net was developed as an intermediate step towards decentralization, enabling peer-to-peer (P2P) communication 
between Golem Nodes. However, since most of the network operates behind NATs, P2P cannot be the sole communication

between some of the Golem Nodes

Code quote:

between Golem Nodes

arch-snapshot/arch.md line 567 at r1 (raw file):

Hybrid Net was developed as an intermediate step towards decentralization, enabling peer-to-peer (P2P) communication 
between Golem Nodes. However, since most of the network operates behind NATs, P2P cannot be the sole communication
method. To address this, the Net implementation supports communication forwarding through specialized Node known

Is this really a Node? As in Yagna Node?

Code quote:

Node

arch-snapshot/arch.md line 568 at r1 (raw file):

between Golem Nodes. However, since most of the network operates behind NATs, P2P cannot be the sole communication
method. To address this, the Net implementation supports communication forwarding through specialized Node known
as relay.  

stray whitespace

Code quote:

··

arch-snapshot/arch.md line 579 at r1 (raw file):

register with the Relay server, providing the necessary information for discovery and connection. The Relay server:
- Maintains a list of Nodes present in the network.
- Stores each Node's identity, public key, and IP address.

Could you elaborate more on the public key? Is it a part of the identity? Or is it ephemeral for network communication only?

Code quote:

public key

arch-snapshot/arch.md line 584 at r1 (raw file):

- Checks if connecting Nodes have public IP port exposed
- Offers functions for:
  - Discovering Nodes and querying their information.

What is there to discover? All nodes are connected to the relay, no?

Code quote:

Discovering 

arch-snapshot/arch.md line 589 at r1 (raw file):

Communication with the Relay server is handled through a custom protocol built on top of UDP, defined using Protocol 
Buffers (protobuf). UDP was chosen for its lightweight nature, as it does not require maintaining open connections, 
which would consume more system resources compared to TCP. This makes it possible to handle a large number of Nodes 

I love dumb questions, so I can't resist this one: isn't this a premature optimization?

For every Node on the network you need to keep its identity, public keys, IP address and probably some overheads. Is this really that much smaller than the kernel memory footprint for a TCP connection? Besides, even if we assume that an idle TCP connection eats up 1K of kernel memory, you can have 1e6 machines connected per every GB of memory. I'd dare to say that 1M machines is way more than a single relay can handle anyway.

Code quote:

Buffers (protobuf). UDP was chosen for its lightweight nature, as it does not require maintaining open connections,
which would consume more system resources compared to TCP. This makes it possible to handle a large number of Nodes

arch-snapshot/arch.md line 590 at r1 (raw file):

Buffers (protobuf). UDP was chosen for its lightweight nature, as it does not require maintaining open connections, 
which would consume more system resources compared to TCP. This makes it possible to handle a large number of Nodes 
concurrently, ensuring decent scalability.

Does this apply to relaying communication between nodes behind NATs as well? Is yes, then how?

Code quote:

Communication with the Relay server is handled through a custom protocol built on top of UDP, defined using Protocol
Buffers (protobuf). UDP was chosen for its lightweight nature, as it does not require maintaining open connections,
which would consume more system resources compared to TCP. This makes it possible to handle a large number of Nodes
concurrently, ensuring decent scalability.

arch-snapshot/arch.md line 608 at r1 (raw file):

which must compute a hash with a specified number of leading zeros. The difficulty level, set by the Relay server, 
determines the number of leading zeros required. This computationally expensive task protects the Relay server from 
DDoS attacks by forcing the Node to complete a certain amount of work before establishing a Session.  

Stray whitespace

Code quote:

··

arch-snapshot/arch.md line 610 at r1 (raw file):

DDoS attacks by forcing the Node to complete a certain amount of work before establishing a Session.  

The challenge is cryptographically signed using the private key of each identity associated with the Node. The Relay 

response, right?

Code quote:

 challenge

arch-snapshot/arch.md line 611 at r1 (raw file):

The challenge is cryptographically signed using the private key of each identity associated with the Node. The Relay 
server can then recover the Node's identities and public keys from these signatures, verifying that the Node 

extract?

Code quote:

recover

arch-snapshot/arch.md line 612 at r1 (raw file):

The challenge is cryptographically signed using the private key of each identity associated with the Node. The Relay 
server can then recover the Node's identities and public keys from these signatures, verifying that the Node 
possesses the corresponding private keys. The recovered public keys are later made available to other Nodes that 

extracted?

Code quote:

recovered

arch-snapshot/arch.md line 619 at r1 (raw file):

When a Node initiates a Session, an additional mechanism is required to determine whether the IP address from which 
the packets are received is behind a NAT. This is achieved by sending Ping packets (network protocol ping, not to be 
confused with the Linux command) from a different UDP port than the one used for receiving incoming Sessions. This 

ICMP packet

Code quote:

Linux command

arch-snapshot/arch.md line 622 at r1 (raw file):

ensures that any network devices between the Node and the Relay server won't recognize these Ping packets as part of 
the same communication stream. If the ports are not publicly exposed, the Ping packets will be dropped, confirming 
that the Node is behind a NAT.

Is there a designated port for that? You could get unlucky and accidentally reach a port of a different Yagna behind the same NAT. Imagine a server room with 100 servers, where each server has a running Yagna. If all of them are behind the same NAT, the likelihood of this event happening is roughly 1-(1-100/65536)**100 which is 15%.

Code quote:

the same communication stream. If the ports are not publicly exposed, the Ping packets will be dropped, confirming
that the Node is behind a NAT.

arch-snapshot/arch.md line 636 at r1 (raw file):

    RelayServer->>RelayServer: Verify solution
    RelayServer->>RelayServer: Recover identities from signatures
    RelayServer->>GolemNode: Session established response

I must be missing something. What prevents me from grabbing your NodeId and presenting myself to the relay with a public key for my own private key rather than yours?

Code quote:

    GolemNode->>RelayServer: Session request
    RelayServer->>GolemNode: Challenge
    GolemNode->>GolemNode: Solve challenge
    GolemNode->>GolemNode: Sign solution with Node's identities
    GolemNode->>RelayServer: Challenge response
    RelayServer->>RelayServer: Verify solution
    RelayServer->>RelayServer: Recover identities from signatures
    RelayServer->>GolemNode: Session established response

arch-snapshot/arch.md line 667 at r1 (raw file):

##### Establishing connections between Nodes

Currently, a peer-to-peer (p2p) session can be established in two scenarios. In the first, if the target Node has a 

I think I need to learn more about the Session. Can I create multiple Sessions between the same two Nodes? What makes those Sessions different from what you'd typically call a "connection"?

Code quote:

session

arch-snapshot/arch.md line 672 at r1 (raw file):

Node first sends a Reverse Connection message to the Relay server, which forwards it to the target Node. The target 
Node then attempts to establish a direct connection with the initiating Node. Whether the target Node has a public 
IP can be determined based on information returned by the Relay server.    

trailing whitespace

Code quote:

····

arch-snapshot/arch.md line 674 at r1 (raw file):

IP can be determined based on information returned by the Relay server.    

Since the current Net implementation does not support NAT punching, if both Nodes are behind NAT, communication must 

I think it's called either "hole punching", or "NAT traversal", or "NAT punch through" but not "NAT punching".

Code quote:

NAT punching

arch-snapshot/arch.md line 777 at r1 (raw file):

Different channels can be utilized to send messages, as explained in the [reliable, unreliable, and transfers 
channels chapter](#reliable-unreliable-and-transfers-channels). In the Hybrid Net, reliable and transfer channels 
are distinguished by using separate TCP connections. This separation ensures that independent sender buffers are 

Still over the same Session, right?

Code quote:

TCP connections

arch-snapshot/arch.md line 777 at r1 (raw file):

Different channels can be utilized to send messages, as explained in the [reliable, unreliable, and transfers 
channels chapter](#reliable-unreliable-and-transfers-channels). In the Hybrid Net, reliable and transfer channels 
are distinguished by using separate TCP connections. This separation ensures that independent sender buffers are 

Could you please rephrase this sentence so that it is 100% clear that:

  • every reliable channel gets a new connection
  • every data transfer gets a new connection
  • everything else is squeezed into one connection (realiable?)

Code quote:

channels chapter](#reliable-unreliable-and-transfers-channels). In the Hybrid Net, reliable and transfer channels
are distinguished by using separate TCP connections. This separation ensures that independent sender buffers are

arch-snapshot/arch.md line 803 at r1 (raw file):

Optimal guarantees can be achieved when two neighboring Nodes have distinctly different neighborhoods, minimizing their
number of common neighbors. Currently, in the Hybrid Net, neighborhood is determined based on the reversed Hamming
distance between Node IDs.

As mentioned in some other review - I need more context here. I do understand that neighborhood is a concept from some DHT designs. If so, please mention it and introduce the reader to the concept.

Code quote:

Hybrid Net implements local broadcasting to the nearest neighborhood of each Node. To query its neighbors, a Node
can send a `Neighborhood` request to the Relay server. The Relay server then responds with a list of Nodes that are
closest to the querying Node, based on a predefined metric.

After receiving the list of neighbors, the Node attempts to establish connections with them, as described in the
chapter on [communication](#establishing-connections-between-nodes). The neighborhood algorithm does not
differentiate between Nodes capable of establishing peer-to-peer connections and those that require relayed
communication. Unlike IP-level broadcasts, Hybrid Net uses reliable channels for message transmission.

**Neighborhood - distance function**

To prevent clustering of Nodes and accidental splits in the network, where subsets of Nodes become unreachable, a proper
neighborhood function must be utilized. This function is defined by the network module, and the market relies on the
broadcast function, leaving it with no alternative in this regard.

Optimal guarantees can be achieved when two neighboring Nodes have distinctly different neighborhoods, minimizing their
number of common neighbors. Currently, in the Hybrid Net, neighborhood is determined based on the reversed Hamming
distance between Node IDs.

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 21 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 563 at r1 (raw file):

specifically for transfers.

#### Hybrid Net

Moving a comment from #111
I'm sorry, but I genuinely do not understand any of this. In particular, I don't understand:

  • what a neighbor is
  • what a neighborhood is
  • what a broadcast function is
  • what a neighborhood function is
  • how that function affect the likelihood of a network partition

Code quote:

#### Hybrid Net

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 21 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 567 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Is this really a Node? As in Yagna Node?

I probably so much wanted relay to be real Node :(


arch-snapshot/arch.md line 568 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

stray whitespace

Done.


arch-snapshot/arch.md line 584 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

What is there to discover? All nodes are connected to the relay, no?

This is what I would call discovery :D finding out information about Node
Otherwise we don't have discovery at all
I changed to only mention query node information, to make it less controversial


arch-snapshot/arch.md line 608 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Stray whitespace

Done.


arch-snapshot/arch.md line 610 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

response, right?

Done.


arch-snapshot/arch.md line 611 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

extract?

I checked that not only in our code but in general people use term recover in case of ECDSA.


arch-snapshot/arch.md line 612 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

extracted?

People use term recover


arch-snapshot/arch.md line 619 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

ICMP packet

Done.


arch-snapshot/arch.md line 672 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

trailing whitespace

Done.


arch-snapshot/arch.md line 674 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think it's called either "hole punching", or "NAT traversal", or "NAT punch through" but not "NAT punching".

Done.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 21 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 622 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Is there a designated port for that? You could get unlucky and accidentally reach a port of a different Yagna behind the same NAT. Imagine a server room with 100 servers, where each server has a running Yagna. If all of them are behind the same NAT, the likelihood of this event happening is roughly 1-(1-100/65536)**100 which is 15%.

It's relay server that has separate port. Yagna node has only one UDP port for listening to external connections

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 21 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 636 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I must be missing something. What prevents me from grabbing your NodeId and presenting myself to the relay with a public key for my own private key rather than yours?

It's good point. It wasn't mentioned anywhere that NodeId is derived from public key, so you can't do that.
But I think it should be rather described in Identity chapter. I added to our task on board that it should be described

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 21 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 589 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I love dumb questions, so I can't resist this one: isn't this a premature optimization?

For every Node on the network you need to keep its identity, public keys, IP address and probably some overheads. Is this really that much smaller than the kernel memory footprint for a TCP connection? Besides, even if we assume that an idle TCP connection eats up 1K of kernel memory, you can have 1e6 machines connected per every GB of memory. I'd dare to say that 1M machines is way more than a single relay can handle anyway.

image.png
It's more than that, because there are tcp buffers. The middle values is default on my ubuntu at least.
It's actually less than I expected. Out embedded stack has much bigger buffers for optimal transfers.

But there are other reasons as well. For example we need to forward traffic on UDP anyway. So we would need separate UDP and TCP ports exposed. This would require changes to p2p sessions as well.
Maybe it could be said more in this chapter..

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 568 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

Done.

I still see it


arch-snapshot/arch.md line 589 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

image.png
It's more than that, because there are tcp buffers. The middle values is default on my ubuntu at least.
It's actually less than I expected. Out embedded stack has much bigger buffers for optimal transfers.

But there are other reasons as well. For example we need to forward traffic on UDP anyway. So we would need separate UDP and TCP ports exposed. This would require changes to p2p sessions as well.
Maybe it could be said more in this chapter..

Makes sense - especially the fact that we need UDP anyway.


arch-snapshot/arch.md line 611 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

I checked that not only in our code but in general people use term recover in case of ECDSA.

ACK


arch-snapshot/arch.md line 612 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

People use term recover

ACK

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 802 at r3 (raw file):

significant application is in [the Offer Propagation algorithm](#offer-propagation). While the implementation of 
specific algorithms is handled by other modules, the network module provides the necessary operations as building 
blocks for these processes.

This is great, thank you for adding it.

I've got a dumb follow-up question, though. Before I ask the question, please verify my assumptions, which are:

  • we're propagating only offer IDs and the offer themselves are then fetched by every provider directly from all providers
  • eventually, every offer is known to every Node on the network

If this is correct and given the fact that the network is already centralized, why bother with the neighborhoods? Couldn't we do this instead?

  • every Node publishes all its Offer IDs to the relay (or even the whole Offers)
  • the relay aggregates all offers
  • the relay periodically updates all Nodes with all the Offer ids (or even whole Offers)

I presume there is a logical answer - I'd be grateful if you added it to the document.

Code quote:

**What is a Neighborhood?**

Before diving into broadcasting, it's essential to understand the concept of a neighborhood. In a network, a
neighborhood is a subset of Nodes that are considered closest to a given Node based on an abstract metric. Each Node
has its own neighborhood. This metric doesn't necessarily reflect the real-world proximity of Nodes. For instance,
two Nodes on opposite sides of the globe could be neighbors, while two Nodes within the same physical network may be
too distant in terms of this metric to be in the same neighborhood.

In peer-to-peer networks, the concept of Node distance is often used for efficient Node discovery. However, since
the current Golem network layer relies on the Relay server for Node discovery, neighborhoods don't serve that
purpose. Although it’s conceivable that a fallback, such as a Kademlia-like implementation, could be used in the
event of a Relay server downtime, for now, the primary purpose of the neighborhood concept is broadcasting.

Broadcasting is a mechanism used by various algorithms to propagate information throughout the network. Its most
significant application is in [the Offer Propagation algorithm](#offer-propagation). While the implementation of
specific algorithms is handled by other modules, the network module provides the necessary operations as building
blocks for these processes.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 563 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Moving a comment from #111
I'm sorry, but I genuinely do not understand any of this. In particular, I don't understand:

  • what a neighbor is
  • what a neighborhood is
  • what a broadcast function is
  • what a neighborhood function is
  • how that function affect the likelihood of a network partition

Does new chapter explaining neighborhood resolves this comment? Or is there something left to explain?


arch-snapshot/arch.md line 566 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

between some of the Golem Nodes

Done.


arch-snapshot/arch.md line 568 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I still see it

Hmm, done again


arch-snapshot/arch.md line 803 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

As mentioned in some other review - I need more context here. I do understand that neighborhood is a concept from some DHT designs. If so, please mention it and introduce the reader to the concept.

Does new chapter explaining neighborhood resolves this comment?


arch-snapshot/arch.md line 802 at r3 (raw file):

we're propagating only offer IDs and the offer themselves are then fetched by every provider directly from all providers

Full Offers are fetched from broadcast senders, not original authors

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 563 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

Does new chapter explaining neighborhood resolves this comment? Or is there something left to explain?

It does, thank you.


arch-snapshot/arch.md line 802 at r3 (raw file):
ACK.

Regarding the essence of this comment and following up on our F2F discussion, how about adding this somewhere:

Although HybridNet remains centralized for node discovery and relaying communication between Nodes behind NATs, broadcasting has already been designed in a decentralized manner as a step toward full network decentralization.


arch-snapshot/arch.md line 784 at r4 (raw file):
Let's start this section by stating the problem which this part of the Network abstraction is supposed to solve.

How about something along these lines:

Every Requestor needs to have a complete view of the market. Each Requestor contacting every Provider on the network to learn of their Offers would be a drastic scalability limitation. Hence we need smarter ways of disseminating information. These ways are abstracted away by the Network module in the form of the "Broadcast" method. HybridNet's implementation is inspired by XYZ algorithm, which introduces the concept of neghborhoods.

Code quote:

##### Broadcasting and neighborhood

arch-snapshot/arch.md line 799 at r4 (raw file):

event of a Relay server downtime, for now, the primary purpose of the neighborhood concept is broadcasting.

Broadcasting is a mechanism used by various algorithms to propagate information throughout the network. Its most 

As per my comment below - I don't think that's true in the generally accepted sense. "Broadcasting" is not really a mechanism - it's a name for the act of sending something to everybody.

Code quote:

Broadcasting is a mechanism used by various algorithms to propagate information throughout the network.

arch-snapshot/arch.md line 804 at r4 (raw file):

blocks for these processes.

**Broadcasting**

I think I'm finally starting to understand where my confusion stems from. I think these are (correct me if I'm wrong) a rather unorthodox use of the word "broadcast" and a leaky abstraction of the HybridNet. It is my understanding that what you mean by "broadcast" is a way to send a similar message to all neighbors.

I consider the naming unorthodox is because usually broadcast means sending something to all participants of a network, not a subset (that's multicast).

I think that HybridNet's abstraction is leaky because the Markeplace needs to know of the network topology (i.e. what you call broadcast requires specifying some subset of nodes if I understood you correctly).

BTW, the fact that "propagation" has two meanings depending on the context doesn't help either:

  • the algorithm which Nodes obey (i.e. talking to neighbors and aggregating offers) to eventually disperse information from every Node to every Node
  • sending Offers from one Node to another

Here is a suggestion on how we can give words meanings to make it easier for readers:

  • broadcast - a process of sending a bit of information to all the nodes on the network; in a naive implementation, it may be every node speaking to every node or there might be a central aggregator; in a P2P implementation it the information might be routed through neighbors
  • propagation - the process built on top of Net, which makes sure that every Node sees every Offer
  • transmit - the the act of sending offers or IDs to particular Nodes
  • transmit to neighbors - the the act of sending offers or IDs to neighboring Nodes

Code quote:

**Broadcasting**

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 590 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Does this apply to relaying communication between nodes behind NATs as well? Is yes, then how?

There is Forwarding Network Traffic chapter describing this. Is there something lacking you would like there mentioned?


arch-snapshot/arch.md line 667 at r1 (raw file):
I define Session in chapter Connecting to Relay server. Later in chapter Peer-to-peer Session handshake differences between Session with Relay vs. p2p Session are described.

I changed the descriptions in all places to consequently use word Session instead of connection. In earlier chapters when Session was not defined yet, I replaced connection with more general word communication.

What makes those Sessions different from what you'd typically call a "connection"?

I think we used this word rather interchangeably. Session has more information in comparison to TCP connection for example. We have collection of all identities belonging on Node and some other information.
I checked definitions, it might be another non-standard usage of the word, but I would prefer to keep it, since we already use it this way.


arch-snapshot/arch.md line 802 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

ACK.

Regarding the essence of this comment and following up on our F2F discussion, how about adding this somewhere:

Although HybridNet remains centralized for node discovery and relaying communication between Nodes behind NATs, broadcasting has already been designed in a decentralized manner as a step toward full network decentralization.

Good idea. I'm adding this in the introduction to Hybrid Net chapter


arch-snapshot/arch.md line 784 at r4 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Let's start this section by stating the problem which this part of the Network abstraction is supposed to solve.

How about something along these lines:

Every Requestor needs to have a complete view of the market. Each Requestor contacting every Provider on the network to learn of their Offers would be a drastic scalability limitation. Hence we need smarter ways of disseminating information. These ways are abstracted away by the Network module in the form of the "Broadcast" method. HybridNet's implementation is inspired by XYZ algorithm, which introduces the concept of neghborhoods.

But this mixes problems from different layers. I wouldn't like to make chapters about net market-centric. Do you have any proposal that would be more neutral?


arch-snapshot/arch.md line 799 at r4 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

As per my comment below - I don't think that's true in the generally accepted sense. "Broadcasting" is not really a mechanism - it's a name for the act of sending something to everybody.

Changed to broadcasting operation


arch-snapshot/arch.md line 804 at r4 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think I'm finally starting to understand where my confusion stems from. I think these are (correct me if I'm wrong) a rather unorthodox use of the word "broadcast" and a leaky abstraction of the HybridNet. It is my understanding that what you mean by "broadcast" is a way to send a similar message to all neighbors.

I consider the naming unorthodox is because usually broadcast means sending something to all participants of a network, not a subset (that's multicast).

I think that HybridNet's abstraction is leaky because the Markeplace needs to know of the network topology (i.e. what you call broadcast requires specifying some subset of nodes if I understood you correctly).

BTW, the fact that "propagation" has two meanings depending on the context doesn't help either:

  • the algorithm which Nodes obey (i.e. talking to neighbors and aggregating offers) to eventually disperse information from every Node to every Node
  • sending Offers from one Node to another

Here is a suggestion on how we can give words meanings to make it easier for readers:

  • broadcast - a process of sending a bit of information to all the nodes on the network; in a naive implementation, it may be every node speaking to every node or there might be a central aggregator; in a P2P implementation it the information might be routed through neighbors
  • propagation - the process built on top of Net, which makes sure that every Node sees every Offer
  • transmit - the the act of sending offers or IDs to particular Nodes
  • transmit to neighbors - the the act of sending offers or IDs to neighboring Nodes

I applied terminology fixes discussed on standup. Not a lot of them, but probably there will be more significant changes in other PRs for example about Offers propagation

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 579 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Could you elaborate more on the public key? Is it a part of the identity? Or is it ephemeral for network communication only?

I added chapter about Node identification to explain more.
And another chapter about encryption is on the way which is going to explain relationships between different keys

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 777 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Still over the same Session, right?

yes


arch-snapshot/arch.md line 777 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Could you please rephrase this sentence so that it is 100% clear that:

  • every reliable channel gets a new connection
  • every data transfer gets a new connection
  • everything else is squeezed into one connection (realiable?)

But it doesn't work like you said. We have only single channel of each type.
It would be nice to have it implemented with ability to create channel, but it was ad hoc implementation to solve a problem not a fully designed concept

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 777 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

But it doesn't work like you said. We have only single channel of each type.
It would be nice to have it implemented with ability to create channel, but it was ad hoc implementation to solve a problem not a fully designed concept

I get it that channels in general are described in #112, I still have a limited understanding of their lifecycles. In particular:

  1. is there exactly one channel per triplet (NodeId pair, type of channel)?
  2. if I understand your description correctly, there is 1:1 relationship between channels and TCP connections, except for unreliable channels; is that right?
  3. is there exactly a single Session underpinning all channels with the same NodeId pair?

While the answer to the first question belongs to #112, I believe that answers to questions 2. and 3. should be answered here. Especially that I got it wrong as you say.


arch-snapshot/arch.md line 784 at r4 (raw file):
My intention is as follows:

  1. Provide rationale for why need a Broadcast operation in the first place - granted, this should end up in Networking #112.
  2. Emphasize that a naive approach would not work, hence we need something smarter, which is why we're introducing the concept of neighbors.

Given that 1. should end up in #112 perhaps we should leave only something along these lines:

Broadcasting information via a naive way, i.e. every Node contacting all other Nodes would be a drastic scalability limitation. Hence we need smarter ways of disseminating information. HybridNet's implementation is inspired by XYZ algorithm, which introduces the concept of neighborhoods.


arch-snapshot/arch.md line 799 at r4 (raw file):
I think I do understand what you mean but only because we talked about it before. How about rephrasing it slightly:

A technique used in various broadcasting algorithms (please link at least one) involves disseminating information through the network by having each node send broadcast messages only to its neighbors, relying on those neighbors to forward the information further. HybridNet uses it as well.


arch-snapshot/arch.md line 804 at r4 (raw file):

Previously, nieznanysprawiciel wrote…

I applied terminology fixes discussed on standup. Not a lot of them, but probably there will be more significant changes in other PRs for example about Offers propagation

I still see some places, but let me resolve this comment and create comments in relevant places.


arch-snapshot/arch.md line 818 at r7 (raw file):

Its most significant application is in [the Offer Propagation algorithm](#offer-propagation). While the implementation of 
specific algorithms is handled by other modules, the network module provides the necessary operations as building 
blocks for these processes.

Doesn't this apply to broadcasting in general rather than the technique, which HybridNet uses under the hood? IMO such a sentence should be in #112 to explain to the reader what the broadcast operation is for.

Code quote:

Its most significant application is in [the Offer Propagation algorithm](#offer-propagation). While the implementation of
specific algorithms is handled by other modules, the network module provides the necessary operations as building
blocks for these processes.

arch-snapshot/arch.md line 822 at r7 (raw file):
That's not consistent with how we intended to use the word "broadcast", right? How about:

HybridNet implements the broadcast operation via sending broadcast messages to the nearest neighborhood...

Code quote:

local broadcast operation that sends message to the nearest neighborhood of the Node.

arch-snapshot/arch.md line 822 at r7 (raw file):

**Broadcasting**

Hybrid Net implements local broadcast operation that sends message to the nearest neighborhood of the Node. To query 

nit: stray space

Code quote:

·

arch-snapshot/arch.md line 824 at r7 (raw file):

Hybrid Net implements local broadcast operation that sends message to the nearest neighborhood of the Node. To query 
its neighbors, a Node can send a `Neighborhood` request to the Relay server. The Relay server then responds with a 
list of Nodes that are closest to the querying Node, based on a predefined metric. 

nit: stray space

Code quote:

·

arch-snapshot/arch.md line 829 at r7 (raw file):

chapter on [communication](#establishing-connections-between-nodes). The neighborhood algorithm does not 
differentiate between Nodes capable of establishing peer-to-peer Sessions and those that require relayed 
communication. Unlike IP-level broadcasts, Hybrid Net uses reliable channels for message transmission. 

nit: trailing space - actually there are many more - can you please fix them all?

Code quote:

·

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 799 at r4 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think I do understand what you mean but only because we talked about it before. How about rephrasing it slightly:

A technique used in various broadcasting algorithms (please link at least one) involves disseminating information through the network by having each node send broadcast messages only to its neighbors, relying on those neighbors to forward the information further. HybridNet uses it as well.

I linked to libp2p. I had hard time finding good description. I'm afraid that this will rather complicate things...

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 777 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I get it that channels in general are described in #112, I still have a limited understanding of their lifecycles. In particular:

  1. is there exactly one channel per triplet (NodeId pair, type of channel)?
  2. if I understand your description correctly, there is 1:1 relationship between channels and TCP connections, except for unreliable channels; is that right?
  3. is there exactly a single Session underpinning all channels with the same NodeId pair?

While the answer to the first question belongs to #112, I believe that answers to questions 2. and 3. should be answered here. Especially that I got it wrong as you say.

I clarified point 3.

Hmm there is another implementation detail that I missed when it comes to question 2. There are actually 2 tcp connections per transport type (channel). One is used to send messages, other is used to receive messages from other party.

This is implementation detail, done for simplicity. Otherwise there are problems with synchronization. If 2 Nodes try to establish tcp connection at the same time, we end up with 2 connections anyway. We could avoid this, by having rules, which connection should be closed and which one should remain. This would be more error prone than current version.
In general there is much to lose and not much to gain.

I'm wondering, should I describe it?
I wouldn't like to dive so deep into implementation.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 784 at r4 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

My intention is as follows:

  1. Provide rationale for why need a Broadcast operation in the first place - granted, this should end up in Networking #112.
  2. Emphasize that a naive approach would not work, hence we need something smarter, which is why we're introducing the concept of neighbors.

Given that 1. should end up in #112 perhaps we should leave only something along these lines:

Broadcasting information via a naive way, i.e. every Node contacting all other Nodes would be a drastic scalability limitation. Hence we need smarter ways of disseminating information. HybridNet's implementation is inspired by XYZ algorithm, which introduces the concept of neighborhoods.

Done.


arch-snapshot/arch.md line 822 at r7 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

That's not consistent with how we intended to use the word "broadcast", right? How about:

HybridNet implements the broadcast operation via sending broadcast messages to the nearest neighborhood...

Done.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 784 at r4 (raw file):

Previously, nieznanysprawiciel wrote…

Done.

Point 1 should be applied to the Network PR

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so all changes targeting this PR are finished. There 2 comments that should be applied in https://reviewable.io/reviews/golemfactory/golem-architecture/112
I would prefer to apply them after we will have unified version. So maybe let's merge those PR first and later I will create new one to add necessary things?

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dopiera)

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 777 at r1 (raw file):

Previously, nieznanysprawiciel wrote…

I clarified point 3.

Hmm there is another implementation detail that I missed when it comes to question 2. There are actually 2 tcp connections per transport type (channel). One is used to send messages, other is used to receive messages from other party.

This is implementation detail, done for simplicity. Otherwise there are problems with synchronization. If 2 Nodes try to establish tcp connection at the same time, we end up with 2 connections anyway. We could avoid this, by having rules, which connection should be closed and which one should remain. This would be more error prone than current version.
In general there is much to lose and not much to gain.

I'm wondering, should I describe it?
I wouldn't like to dive so deep into implementation.

Let's ignore it.


arch-snapshot/arch.md line 794 at r10 (raw file):

types are distinguished by using separate TCP connections. This separation ensures that independent sender buffers 
are maintained, preventing messages in one channel from being blocked by messages in the other. Each transport type 
has its own single TCP connection. This means that independent transfers still can interfere with each other.

Sorry - dumb question. Are they even possible? I.e. is there some multiplexing on top of this TCP connection?

Code quote:

This means that independent transfers still can interfere with each other.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @dopiera)


arch-snapshot/arch.md line 794 at r10 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Sorry - dumb question. Are they even possible? I.e. is there some multiplexing on top of this TCP connection?

Yes, look at my answer in networking PR.

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 794 at r10 (raw file):
Following our F2F discussion - I think the word "transfer" creates confusion here. How about we replace this sentence with this?

If multiple components within the same process need to use the transfer channel, they will share a single TCP connection, competing for bandwidth. As there is no fair scheduling mechanism, one component can easily starve another by producing messages at a sufficiently high rate.

@nieznanysprawiciel nieznanysprawiciel changed the base branch from describe-networking to master October 31, 2024 16:20
@nieznanysprawiciel nieznanysprawiciel changed the base branch from master to new-arch October 31, 2024 16:39
Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r11.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @nieznanysprawiciel)

@dopiera dopiera merged commit 55dca10 into new-arch Nov 3, 2024
@nieznanysprawiciel nieznanysprawiciel deleted the describe-hybrid-net branch November 4, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants