Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add autonat v2 spec #538

Merged
merged 26 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d663611
add autonat v2 spec
sukunrt Apr 12, 2023
1db8613
use priority ordered list in requests for autonat-v2
sukunrt Apr 15, 2023
0ff8ac6
only send index of the dialed address
sukunrt Apr 21, 2023
f2a431c
accept a priority ordered list of addresses for dial requests
sukunrt Apr 21, 2023
62123df
Improve naming for messages
sukunrt Apr 25, 2023
0771bab
add interaction diagram
sukunrt Apr 25, 2023
3e57202
address review comments
sukunrt Apr 27, 2023
f6def9a
allow autonat v2 to dial all ips (#542)
sukunrt May 3, 2023
b769d79
use oneof for messages
sukunrt May 3, 2023
e4efaae
use a single nonce in DialRequest
sukunrt May 15, 2023
f28511c
drop terms node and peer in favour of client and server
sukunrt Jul 11, 2023
d4da279
client should not reuse listen port while making dial request
sukunrt Jul 11, 2023
5b8d37d
explicitly disallow probing for private addresses
sukunrt Jul 11, 2023
05a0de2
add explanation for amplification attack prevention mechanism
sukunrt Jul 12, 2023
8b52643
add recommendation for 30k - 100k bytes
sukunrt Jul 16, 2023
dd2750c
move DialStatus proto out of Response
sukunrt Aug 12, 2023
4e6ecaa
wrap data sent for amplification attack prevention in a protobuf
sukunrt Aug 12, 2023
2af3309
send only a single dial status
sukunrt Aug 16, 2023
6b1604b
add comment regarding nonce
sukunrt Aug 18, 2023
f979fac
rename attempt to dial-back
sukunrt Aug 20, 2023
209b215
fix ResponseStatus_OK
sukunrt Aug 21, 2023
094089b
IPv4 only servers should refuse requests for IPv6 addresses
sukunrt Sep 6, 2023
b4a856b
fix dial-request protocol name
sukunrt Oct 30, 2023
1c76613
add a response to the dialback stream
sukunrt Feb 5, 2024
03718ef
allow the client to send slightly more dial data
sukunrt Jun 20, 2024
0195203
add note that server should not dial any private address
sukunrt Oct 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions autonat/README.md
sukunrt marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# NAT Discovery <!-- omit in toc -->
> How we detect if we're behind a NAT.

Specifications:
- [autonat v1](autonat-v1.md)
- [autonat v2](autonat-v2.md)
3 changes: 0 additions & 3 deletions autonat/autonat-v1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
# NAT Discovery <!-- omit in toc -->
> How we detect if we're behind a NAT.
| Lifecycle Stage | Maturity | Status | Latest Revision |
|-----------------|----------------|--------|-----------------|
| 3A | Recommendation | Active | r1, 2023-02-16 |
Expand Down
17 changes: 17 additions & 0 deletions autonat/autonat-v2-amplification-attack-prevention.plantuml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
@startuml
participant Cli
participant Srv

skinparam sequenceMessageAlign center
skinparam defaultFontName monospaced


== Amplification Attack Prevention ==

Cli -> Srv: [dial] DialRequest:{nonce: 0xabcd, addrs: (addr1, addr2, addr3)}
Srv -> Cli: [dial] DialDataRequest:{addrIdx: 1, numBytes: 120k}
Cli -> Srv: [dial] {120k bytes}
Copy link
Contributor

@thomaseizinger thomaseizinger Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a message for this too, otherwise implementing the decoder for these messages is going to be quite annoying because you need to attempt deserializing them and only if that fails, consider it bare "data".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the protocol is if the client accepts the DialDataRequest, it starts sending that amount of data. If the client doesn't accept the DialDataRequest, it resets the stream. You do not need to deserialise anything after you send DialDataRequest here. Just read that many bytes.

Wrapping the entire data in a protobuf will make the protobuf size very large.

Do you think adding a DataRequestAccepted message before sending that amount of data as raw bytes will make things cleaner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I am asking for this is because at least in rust-libp2p, we typically implement "codecs" which are modules that turn byte streams into typed streams (and sinks) of messages. Reading bare half-way through does not play that well with strong type systems.

What do you mean it will make the protobuf large? Can't you just have a dynamically sized byte buffer in the protobuf? What is the difference?

Copy link
Contributor

@thomaseizinger thomaseizinger Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I am asking for this is because at least in rust-libp2p, we typically implement "codecs" which are modules that turn byte streams into typed streams (and sinks) of messages. Reading bare half-way through does not play that well with strong type systems.

I guess we could unwrap the stream again and read a specified number of bytes.

Thoughts @mxinden?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean it will make the protobuf large

You'll need to bring the whole serialised protobuf into memory before decoding. This will take memory of the order of 100kb. Not prohibitive, but I'd like to avoid this if we can.

If I understand correctly, you would prefer it to be a protobuf so you can have a typed stream which says first object received from the stream is of type A and the second object received is of type B, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean it will make the protobuf large

You'll need to bring the whole serialised protobuf into memory before decoding. This will take memory of the order of 100kb. Not prohibitive, but I'd like to avoid this if we can.

Hmm, not sure how much of a hack this is but technically, the receiving end could declare the protobuf without the payload field and thus avoid allocating it?

That might work, but you're now hardcoding assumptions about the wire format of the Protobuf, outside of your Protobuf library. I'd like to avoid that.

Correct. Rust has tagged enums, meaning we can have rich types describing all fields of each message. But I don't really want to argue for it from an impl PoV. It is not that we are limited by Rust here. Instead, I think Rust's strong type systems makes it obvious that this isn't a consistent design

Not sure I agree with this framing, but in the end, this boils down to a matter of taste.

How about repeatedly sending a smaller message (limited by our usual Protobuf size), until we've reached the required number of bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Rust has tagged enums, meaning we can have rich types describing all fields of each message. But I don't really want to argue for it from an impl PoV. It is not that we are limited by Rust here. Instead, I think Rust's strong type systems makes it obvious that this isn't a consistent design:

So in the current proposal because the number of bytes that you need to read in the second message is not fixed and is equal to the number you asked for in the DialDataRequest this prevents you from writing a generic reader, right?
If that's the case we can prefix the bytes sent with a unsigned-varint just like we prefix other protobuf messages sent. This will also be consistent with the other messages exchanged everywhere which I'd argue is the source of inconsistency in the design as opposed to bytes not being sent as a TLV object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing a POC implementation, I think Marten's suggestion here is the most suited to implementation. The problem from a go perspective is buffering in the protobuf msg reader on the server side. This can happen if someone does a pipelined implementation on the client side which sends data as soon as it has sent the request. see documentation for NewDelimitedReader for another case where this happens. While this situation is unlikely, it is simpler to just use protobufs for transferring data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. Would that message be a regular protobuf with a static array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using bytes.


message DialDataResponse {
    bytes data = 1;
}

Srv -> Cli: [attempt]addr2 DialAttempt:{nonce: 0xabcd}
Srv -> Cli: [dial] DialResponse:{status: OK, dialStatuses:(E_TRANSPORT_NOT_SUPPORTED, OK)}

@enduml
1 change: 1 addition & 0 deletions autonat/autonat-v2-amplification-attack-prevention.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
295 changes: 295 additions & 0 deletions autonat/autonat-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# AutonatV2: spec
sukunrt marked this conversation as resolved.
Show resolved Hide resolved


| Lifecycle Stage | Maturity | Status | Latest Revision |
|-----------------|--------------------------|--------|-----------------|
| 1A | Working Draft | Active | r2, 2023-04-15 |

Authors: [@sukunrt]

Interest Group: [@marten-seemann], [@marcopolo], [@mxinden]

[@sukunrt]: https://github.com/sukunrt
[@marten-seemann]: https://github.com/marten-seemann
[@mxinden]: https://github.com/mxinden
[@marcopolo]: https://github.com/marcopolo


## Overview

A priori, a node cannot know if it is behind a NAT / firewall or if it is
publicly reachable. Moreover, the node may be publicly reachable on some of its
addresses and not on others. Knowing reachability for its addresses is essential
for the node to be well-behaved in the network: A node doesn't need to advertise
its unreachable addresses to the rest of the network, preventing superfluous
dials from other peers. Furthermore, in case it has no publicly reachable
addresses, it might actively seek to improve its connectivity by finding a relay
server, which would allow other peers to establish a relayed connection.

In `autonat v2` client sends a request with a priority ordered list of addresses
and a nonce. On receiving this request the server dials the first address in the
list that it is capable of dialing and provides the nonce. Upon completion of
the dial, the server responds to the client with the response containing the
dial outcome.

As the server dials _exactly_ one address from the list, `autonat v2` allows
nodes to determine reachability for individual addresses. Using `autonat v2`
nodes can build an address pipeline where they can test individual addresses
discovered by different sources like identify, upnp mappings, circuit addresses
etc for reachability. Having a priority ordered list of addresses provides the
ability to verify low priority addresses. Implementations can generate low
priority address guesses and add them to requests for high priority addresses as
a nice to have. This is especially helpful when introducing a new transport.
Initially, such a transport will not be widely supported in the network.
Requests for verifying such addresses can be reused to get information about
other addresses

The client can verify the server did successfully dial an address of the same
transport as it reported in the response by checking the local address of the
connection on which the nonce was received on.

Compared to `autonat v1` there are three major differences
1. `autonat v1` allowed testing reachability for the node. `autonat v2` allows
testing reachability for an individual address.
2. `autonat v2` provides a mechanism for nodes to verify whether the peer
actually successfully dialled an address.
3. `autonat v2` provides a mechanism for nodes to dial an IP address different
from the requesting node's observed IP address without risking amplification
attacks. `autonat v1` disallowed such dials to prevent amplification attacks.


## AutoNAT V2 Protocol
sukunrt marked this conversation as resolved.
Show resolved Hide resolved

![Autonat V2 Interaction](autonat-v2.svg)

A client node wishing to determine reachability of its addresses sends a
`DialRequest` message to a server on a stream with protocol ID
`/libp2p/autonat/2/dial-request`. Each `DialRequest` is sent on a new stream.

This `DialRequest` message has a list of addresses and a fixed64 `nonce`. The
list is ordered in descending order of priority for verification. AutoNAT V2 is
only for testing reachability on Public Internet. Client SHOULD NOT send any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "MUST NOT". and "The server MUST NOT dial any private address".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to implement these safely though. Both the client and the server need to check that the peer is connected over a private IP.

client 192.168.0.100 -> server 192.168.0.10
In this case it's reasonable for the client to ask the server to test its private IP reachability.

This is an edge case I'm willing to ignore though. Happy to change this to MUST, just that keeping it SHOULD allows some implementation to provide this feature if they're willing to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you know that you are indeed on the same private network?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure.

If you see local connection address in private IP range and remote connection address in private IP range, is that enough to conclude that you're in some private network?
Note you cannot rely on https://datatracker.ietf.org/doc/html/rfc1918 subnet masks as you can make a private network from a collection of smaller private networks.

Copy link
Contributor

@MarcoPolo MarcoPolo Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the local address and remote address are in the same private subnet, then it would be okay.

How about adding "The server SHOULD NOT dial any private address"? This leaves the door open in the spec.

I'm not sure the usefulness of doing this though, but maybe others might have a use for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

private address as defined in [RFC
1918](https://datatracker.ietf.org/doc/html/rfc1918#section-3) in the list.
sukunrt marked this conversation as resolved.
Show resolved Hide resolved

Upon receiving this request, the server selects an address from the list to
dial. The server SHOULD use the first address it is willing to dial. The server
MUST NOT dial any address other than this one. If this selected address has an
IP address different from the requesting node's observed IP address, server
initiates the Amplification attack prevention mechanism (see [Amplification
Attack Prevention](#amplification-attack-prevention) ). On completion, the
server proceeds to the next step. If the selected address has the same IP
address as the client's observed IP address, server proceeds to the next step
skipping Amplification Attack Prevention steps.

The server dials the selected address, opens a stream with Protocol ID
`/libp2p/autonat/2/dial-back` and sends a `DialBack` message with the nonce
received in the request. The client on receiving this message replies with
a `DialBackResponse` message with the status set to `OK`. The client MUST
close this stream after sending the response. The dial back response provides
the server assurance that the message was delivered so that it can close the
connection.

Upon completion of the dial back, the server sends a `DialResponse` message to
the client node on the `/libp2p/autonat/2/dial-request` stream. The response
contains `addrIdx`, the index of the address the server selected to dial and
`DialStatus`, a dial status indicating the outcome of the dial back. The
`DialStatus` for an address is set according to [Requirements for
DialStatus](#requirements-for-dialstatus). The response also contains an
appropriate `ResponseStatus` set according to [Requirements For
ResponseStatus](#requirements-for-responsestatus).

The client MUST check that the nonce received in the `DialBack` is the same as
the nonce it sent in the `DialRequest`. If the nonce is different, it MUST
discard this response.

The server MUST close the stream after sending the response. The client MUST
close the stream after receiving the response.


### Requirements for DialStatus

On receiving a `DialRequest`, the server first selects an address that it will
dial.

If server chooses to not dial any of the requested addresses, `ResponseStatus`
is set to `E_DIAL_REFUSED`. The fields `addrIdx` and `DialStatus` are
meaningless in this case. See [Requirements For
ResponseStatus](#requirements-for-responsestatus).

If the server selects an address for dialing, `addrIdx` is set to the
index(zero-based) of the address on the list and the `DialStatus` is set
according to the following consideration:

If the server was unable to connect to the client on the selected address,
`DialStatus` is set to `E_DIAL_ERROR`, indicating the selected address is not
publicly reachable.

If the server was able to connect to the client on the selected address, but an
error occured while sending an nonce on the `/libp2p/autonat/2/dial-back`
stream, `DialStatus` is set to `E_DIAL_BACK_ERROR`. This might happen in case of
resource limited situations on client or server, or when either the client or
the server is misconfigured.

If the server was able to connect to the client and successfully send a nonce on
the `/libp2p/autonat/2/dial-back` stream, `DialStatus` is set to `OK`.

### Requirements for ResponseStatus

The `ResponseStatus` sent by the server in the `DialResponse` message MUST be
set according to the following requirements

`E_REQUEST_REJECTED`: The server didn't serve the request because of rate
limiting, resource limit reached or blacklisting.

`E_DIAL_REFUSED`: The server didn't dial back any address because it was
incapable of dialing or unwilling to dial any of the requested addresses.

`E_INTERNAL_ERROR`: Error not classified within the above error codes occured on
server preventing it from completing the request.

`OK`: the server completed the request successfully. A request is considered
completed successfully when the server either completes a dial(successfully or
unsuccessfully) or rejects all addresses in the request as undialable.

Implementations MUST discard responses with status codes they do not understand.

### Amplification Attack Prevention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden mention that this design is takes inspiration from a different protocol but I forgot which one. Might be worth mentioning!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. related to the QUIC anti-amplification mechanism based on data size:

Prior to validating the client address, servers MUST NOT send more than three times as many bytes as the number of bytes they have received. This limits the magnitude of any amplification attack that can be mounted using spoofed source addresses.

https://www.rfc-editor.org/rfc/rfc9000.html#section-8.1

I am not aware of an exact origin of the AutoNATv2 mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUIC is always a good source of inspiration :)

The specifics here is something that we came up with ourselves, in the discussion on the issue. If you're aware of any prior art (RFCs?), please let us know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I mention QUIC?


![Interaction](autonat-v2-amplification-attack-prevention.svg)

When a client asks a server to dial an address that is not the client's observed
IP address, the server asks the client to send some non trivial amount of bytes
as a cost to dial a different IP address. To make amplification attacks
unattractive, servers SHOULD ask for 30k to 100k bytes. Since most handshakes
cost less than 10k bytes in bandwidth, 30kB is sufficient to make attacks
unattractive.

On receiving a `DialRequest`, the server selects the first address it is capable
of dialing. If this selected address has a IP different from the client's
observed IP, the server sends a `DialDataRequest` message with the selected
address's index(zero-based) and `numBytes` set to a sufficiently large value on
the `/libp2p/autonat/2/dial-request` stream

Upon receiving a `DialDataRequest` message, the client decides whether to accept
or reject the cost of dial. If the client rejects the cost, the client resets
the stream and the `DialRequest` is considered aborted. If the client accepts
the cost, the client starts transferring `numBytes` bytes to the server. The
client transfers these bytes wrapped in `DialDataResponse` protobufs where the
`data` field in each individual protobuf is limited to 4096 bytes in length.
This allows implementations to use a small buffer for reading and sending the
data. Only the size of the `data` field of `DialDataResponse` protobufs is
counted towards the bytes transferred. Once the server has received numBytes
bytes, it proceeds to dial the selected address.

If an attacker asks a server to dial a victim node, the only benefit the
attacker gets is forcing the server and the victim to do a cryptographic
handshake which costs some bandwidth and compute. The attacker by itself can do
a lot of handshakes with the victim without spending any compute by using the
same key repeatedly. The only benefit of going via the server to do this attack
is not spending bandwidth required for a handshake. So the prevention mechanism
only focuses on bandwidth costs. There is a minor benefit of bypassing IP
blocklists, but that's made unattractive by the fact that servers may ask 5x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can simply shrug this off. This is called a reflection attack and has been a huge issue for open DNS resolvers.

Fixing the amplification side does go a long way, but paying a 5x bandwidth cost for a bunch of free IP addresses seems like a pretty reasonable tradeoff from an attacker's standpoint (especially because said attacker isn't paying for the bandwidth, but likely needs to compromise one machine per IP address).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: home NAT users likely don't need this feature. That is:

  1. They likely only need 1 dialable address.
  2. They likely don't care which one.
  3. Their outbound and inbound IPs are likely identical.

Being willing to dial other addresses does matter for, e.g., AWS and other special settings where there are separate ingress IP addresses. But, in that case, maybe the user should just configure their node correctly rather than relying on AutoNAT? AutoNAT specifically exists to enable home users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simplifies client implementations as they don't need to worry about IPv4 peer vs IPv6 peer. Though the benefit isn't huge since most IPv4 servers won't have IPv6 connectivity so they any way cannot check the IPv6 address.

Fixing the amplification side does go a long way, but paying a 5x bandwidth cost for a bunch of free IP addresses seems like a pretty reasonable tradeoff from an attacker's standpoint (especially because said attacker isn't paying for the bandwidth, but likely needs to compromise one machine per IP address).

Can you elaborate here? why isn't the attacker paying for the bandwidth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This features enables the following attack:
    • Contact a large number of autonat v2 servers.
    • Give them the target's address to connect to
    • When requested for data: provide the data slowly:
      * The stream timeout is 1 minute
      * In the period, with a 1Gbps connection, you can send 60Gb ~= 6GB
      * Maximum dial data requirement is 100kB
      * So, in theory, you can run this with 60_000 peers in parallel.
      * The servers have a random wait of up to 3 seconds precisely for this scenario. So in theory we can have 20k connections a second for 3 seconds to the target.
    • We can make a bunch of implementation improvements to reduce the harm here. The simplest ones being: Only wait 10 seconds for the dial data, and wait for 5 seconds before dialing. That would reduce the max new connection rate to 2k / second, which is very manageable.
    • The ideal solution is to introduce rate limits for new connections.
  • There's another problem with this feature related to implementation difficulty:
    • The primary use for this feature is to allow both IPv4 and IPv6 addresses to be tested without worrying about whether we have a v4 or a v6 connection. So you can ask a v4 peer to test your v6 address. This requires correctly reporting an error in case the server has no v6 connectivity, which is majority of servers.
    • I'm not sure if the rust implementation correctly handles this case. @umgefahren please correct me if I'm wrong here.
      * See discussion around this comment: feat(autonatv2): Implement autonat v2 umgefahren/rust-libp2p#1 (comment)
    • I'm also not sure if we can rely on other implementations to correctly handle this. They might just make a dial, fail, and report unreachable.
  • If we have to ensure that we check v6 addresses with a v6 peer, it might be better to just disable this feature.

@MarcoPolo @Stebalien @umgefahren thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential attack vector seems to be described correctly, however I'm not sure that rust-libp2p is affected. The whole process is allowed to take a maximum of 10 seconds:
https://github.com/libp2p/rust-libp2p/blob/8ceadaac5aec4b462463ef4082d6af577a3158b1/protocols/autonat/src/v2/server/handler/dial_request.rs#L66
However, we don't wait any time before dealing back. This mitigation is a quick fix, I can prepare.

Regarding IPv4 and IPv6 I stand with @thomaseizinger's comment on that matter: umgefahren/rust-libp2p#1 (comment)

So it correctly handles that case, in that we don't generate false positives.

more data than the bandwidth cost of a handshake.

## Implementation Suggestions

For any given address, client implementations SHOULD do the following
- Periodically recheck reachability status.
- Query multiple servers to determine reachability.

The suggested heuristic for implementations is to consider an address reachable
if more than 3 servers report a successful dial and to consider an address
unreachable if more than 3 servers report unsuccessful dials. Implementations
are free to use different heuristics than this one

Servers SHOULD NOT reuse their listening port when making a dial back. In case
the client has reused their listen port when dialing out to the server, not
reusing the listen port for attempts prevents accidental hole punches. Clients
SHOULD only rely on the nonce and not on the peerID for verifying the dial back
as the server is free to use a separate peerID for the dial backs.

Servers SHOULD determine whether they have IPv6 and IPv4 connectivity. IPv4 only servers SHOULD refuse requests for dialing IPv6 addresses and IPv6 only
servers SHOULD refuse requests for dialing IPv4 addresses.


## RPC Messages
sukunrt marked this conversation as resolved.
Show resolved Hide resolved

All RPC messages sent over a stream are prefixed with the message length in
bytes, encoded as an unsigned variable length integer as defined by the
[multiformats unsigned-varint spec][uvarint-spec].

All RPC messages on stream `/libp2p/autonat/2/dial-request` are of type
`Message`. A `DialRequest` message is sent as a `Message` with the `msg` field
set to `DialRequest`. `DialResponse` and `DialDataRequest` are handled
similarly.

On stream `/libp2p/autonat/2/dial-back`, a `DialAttempt` message is sent
directly

```proto3

message Message {
oneof msg {
DialRequest dialRequest = 1;
DialResponse dialResponse = 2;
DialDataRequest dialDataRequest = 3;
DialDataResponse dialDataResponse = 4;
}
}


message DialRequest {
repeated bytes addrs = 1;
fixed64 nonce = 2;
}


message DialDataRequest {
uint32 addrIdx = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an index, why not send the address itself? In my eyes just a small complexity reduction. Feel free to ignore. Based on intuition, the additional bytes (index < address) doesn't have a performance impact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your suggestion is right. I'll resolve this while taking up the implementation. This will help us avoid sharing the slice between the thread that makes the dial request and the thread that receives the response. Similar effect can be had by sharing the idx to the requesting thread, but this seems simpler to me. I'll resolve this while implementing.

uint64 numBytes = 2;
}


enum DialStatus {
UNUSED = 0;
E_DIAL_ERROR = 100;
E_DIAL_BACK_ERROR = 101;
OK = 200;
}


message DialResponse {
enum ResponseStatus {
E_INTERNAL_ERROR = 0;
E_REQUEST_REJECTED = 100;
E_DIAL_REFUSED = 101;
OK = 200;
}

ResponseStatus status = 1;
uint32 addrIdx = 2;
DialStatus dialStatus = 3;
}


message DialDataResponse {
bytes data = 1;
}


message DialBack {
fixed64 nonce = 1;
}

message DialBackResponse {
enum DialBackStatus {
OK = 0;
}

DialBackStatus status = 1;
}
```

[uvarint-spec]: https://github.com/multiformats/unsigned-varint

20 changes: 20 additions & 0 deletions autonat/autonat-v2.plantuml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
@startuml
participant Cli
participant Srv

skinparam sequenceMessageAlign center
skinparam defaultFontName monospaced


== Dial Request Success==

Cli -> Srv: [dial] DialRequest:{nonce: 0xabcd, addrs: (addr1, addr2, addr3)}
Srv -> Cli: [attempt]addr2 DialAttempt:{nonce: 0xabcd}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure how to achieve this, but it would be nice if it could be made clear that this is happening on a new connection (and a different 5-tuple), whereas the rest of the exchange happens on the same stream.

Srv -> Cli: [dial] DialResponse:{status: OK, dialStatuses:(E_TRANSPORT_NOT_SUPPORTED, OK)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions E_TRANSPORT_NOT_SUPPORTED but that is missing from the protobufs?


== Dial Request Failure==

Cli -> Srv: [dial] DialRequest:{nonce: 0xabcd, addrs: (addr1, addr2, addr3)}
Srv ->x Cli: [attempt]addr2 DialAttempt:{nonce: 0xabcd}
Srv -> Cli: [dial] DialResponse:{status: OK, dialStatuses:(E_TRANSPORT_NOT_SUPPORTED, E_DIAL_ERROR)}
@enduml
Loading