-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLV failure message and length relaxation #1021
Conversation
Concept ACK. Note that this doesn't provide a lot of bytes for the tlv stream though, since we're limited to 256 bytes for the failure payload, but it's better than nothing. An important point to note is that the tlv stream mustn't be length-prefixed (same as our tlv extensions at the end of each lightning message). It's worth adding a test vector to make sure people get that right. |
Yes, I can add a test vector. But I think I'll do it just for the failure message bytes excluding padding and encryption. |
3860351
to
ff5b364
Compare
Added test vector to ensure that implementers won't prepend the tlv stream with a length. |
IMHO it would really be more helpful to future readers to update the existing e2e test vector... |
Discussed in spec meeting: pair this change with an increase in the recommended failure message size. Any thoughts on the new size? I was thinking 1024, to bring it roughly in line with the forward pass payload? |
1024 sounds good to me since we don't have a real use-case yet, as long as all implementations change their behavior to handle potentially longer messages in the future (this way we can change the spec recommendation later if a use-case comes up that requires a bigger length without having to update intermediate nodes). |
So just to get it right, you're saying we don't change the recommendation yet? If that's indeed the intention, we have to be aware that a later increase allows for fingerprinting of the destination node. Perhaps this needs an activation date... All implementations switch to 1024 byte failure messages on Jan 1, 2023 :) |
No, I think it's a good idea to change the recommendation to something bigger, and 1024 sounds good. I'm just saying we can potentially go much bigger (we're only restricted by the whole message being 65kB), but since we don't know yet how we're going to use it, a first step to 1024 is probably a good idea. But we should also change the spec to say that forwarding nodes should be able to handle messages with a different length than the recommended one, to give us freedom to change it in the future without requiring a network-wide upgrade. Does that make sense? |
Yes. Will also check to see what lnd does and does not support currently. |
It seems that The fix lightningnetwork/lnd#6913 is small, but critical to unblock the usage of longer tlv failures. |
82e35fd
to
7baffed
Compare
Updated PR with 1024 byte tlv failure test vector. For my own reference: patch used to generate vector lightningnetwork/lnd@master...bottlepay:lnd:failure-message-test-vector and lightningnetwork/lightning-onion@master...joostjager:lightning-onion:failure-message-test-vector |
7baffed
to
954665d
Compare
51cc167
to
344440a
Compare
@@ -823,9 +823,8 @@ channel. | |||
### Requirements | |||
|
|||
The _erring node_: | |||
- SHOULD set `pad` such that the `failure_len` plus `pad_len` is equal to 256. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this change a bit more: I think we might need either a new feature bit or a new "empty" TLV key in the onion for the switch over to work properly. Without some sort of signal, then a destination or intermediate node doesn't know if the sender can actually handle/decrypt errors larger than 256 bytes. This is effectively hard coded in the spec and most implementations, so just changing it here without any additional context-specific signalling won't allow for a smooth switchover.
As an example, on the lnd side we may end up black listing a node for sending a larger errors (we have a hard check, and would discard it as internal network funny business or w/e). Instead, if we had the sender signal in the onion to each node (intermediate and also final) that they understand larger errors, then the error'ing node will know if they can attach additional context or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. The empty TLV key option looks to be the smallest in size. But if we invent a new failure message format in the future that for example fixes error attribution, we'd need another empty TLV key, or use the original TLV with a non-empty value. In the latter case, we need to specify it so that no implementation will trip over a non-empty value from the start.
Perhaps defining a TLV record for sender feature bits from the get go is the cleanest solution. Trade-off is obviously the added size, which I think is one byte extra per hop compared to an empty tlv. This assumes we start fresh with the assignment of failure message feature bits, to not exceed one byte initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback from the spec meeting on this: since we don't have a short-term use-case for longer error messages, we can take the slow approach that requires less changes and no signaling:
- Implementations add support now for reading failure messages of arbitrary length
- In 1 year or so, we can consider that senders do support it and may start writing failure messages of arbitrary length
- If a specific feature requires arbitrary length messages before that, the usage of that feature is an implicit signal that the sender knows how to read failure messages of arbitrary length, so it's ok to generate them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a short term use case: user defined extensions to the incorrect_details
failure. For example for hodl invoices that are cancelled, more details could be added as to why it was cancelled. Or (just brainstorming now) return a "redirect" invoice in a tlv field.
Given the right extension points in a node implementation (fail hodl invoice with reason and expose failure tlv to user of the sending node), users could add and read their own data and this may exceed the current 256 byte limit. I don't think 3) applies to this scenario? LND's htlc interceptor API also allows users to return custom failure messages today, so one receiver-side extension point already exists.
You could argue that it is not only up to the lightning devs to decide that there is no current use case.
Regarding 2): Is it safe to assume that almost all senders support the feature after one year? I don't think there is a direct indication of the proportion of nodes that have upgraded, and this isn't a feature bit that is part of the node announcement either? It could be part of it, but it doesn't seem to be useful because nodes in a path don't know the key of the origin node anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inbound fees, we probably need this too. See lightningnetwork/lnd#6967, commit "signal failure message format".
The specification recommends using a length of 256 for onion errors, but it doesn't say that we should reject errors that use a different length. We may want to start creating errors with a bigger length than 256 if we need to transmit more data to the sender. In order to prepare for this, we keep creating 256-bytes onion errors, but allow receiving errors of arbitrary length. See the specification here: lightning/bolts#1021 Fixes #2438
Right, that's what I meant by "no failure reason", sorry it was unclear. My point is that if you don't upgrade for two years suffering from degraded route selection performance seems like a perfectly fine outcome. |
Maybe inbound routing fees become a success, and then we need to wait for two years before we can return a complete Of course everyone can have their own opinion about the viability of protocol extensions, the value of experimentation and whether that's worth the trade-off with privacy. |
I'm kinda confused why a fee_insufficient failure is > 256 bytes? There's a ton of headroom above existing error messages to play with. |
If the fee is insufficient, it could be that the sender either used an outdated inbound fee policy, an outdated outbound fee policy or both. The routing node only sees a total fee paid, and doesn't know which one of the cases applies. To cover all cases and ensure that the next attempt is successful, the routing node would need to return both the incoming channel update and the outgoing channel update with the Of course signed fee information can be returned in a more compact way, but that would deviate from the standard channel update format. I expect that to lead to lots of extra work and maintenance. The graph database for example would also need to support this new compact format. |
That sounds like the way forward if you want to do experimentation though? You really don't need to return complete channel updates, and returning only the values you're interested in lets you experiment today, without changes from any implementation nor the spec required. |
Yes, could do that. But the quicker way to go forward with the experiment is to just signal using a custom onion payload tlv record. Long failures are already relayed back, so no implementation changes required. And for the rest it's just between sender and routing node. I thought that it would be nice to make that signaling official if we see a future need for long failure messages anyway. But if not, I'll just go ahead and do my own thing. In that case, I can remove the increase of the recommended failure length to 1024 from this PR and note that longer failure messages are a risk because senders may not be able to parse. |
The specification recommends using a length of 256 for onion errors, but it doesn't say that we should reject errors that use a different length. We may want to start creating errors with a bigger length than 256 if we need to transmit more data to the sender. In order to prepare for this, we keep creating 256-bytes onion errors, but allow receiving errors of arbitrary length. See the specification here: lightning/bolts#1021 Fixes #2438
720911a
to
1a48cdd
Compare
Removed increase of the recommended message length to 1024 from PR and added comment about old nodes not being able to interpret long messages. |
Extend every onion failure with an optional tlv stream. See the specification here: lightning/bolts#1021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 1a48cdd
The test vectors match with eclair (see ACINQ/eclair#2455).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 1a48cdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🛹
lnd v0.16.0 is scheduled to ship with the relaxed behavior for the sender (the one decrypting the final payload).
@joostjager do you have a branch of lnd where I can force a node to emit an error bigger than 256 bytes that contains dummy tlvs to test cross-compatibility? |
If you run an lnd routing node with lightningnetwork/lnd#6967 and the sender does not include enough fee, you will get back a long failure message with two channel updates. The sender does need to signal long failure compatibility through a tlv record though. If this is all a bit too much for the purpose of interop testing, I can prep a simpler branch with a long failure for you. |
Don't worry, this should work well enough for my tests. Can you simply update the description of lightningnetwork/lnd#6967 with the details of the tlv that needs to be set in the onion to opt-in to receive the long failure? |
Description updated. |
Extend every onion failure with an optional tlv stream. See the specification here: lightning/bolts#1021
Extend every onion failure with an optional tlv stream. See the specification here: lightning/bolts#1021
Extend every onion failure with an optional tlv stream. Added to the specification by: lightning/bolts#1021
Extend the failure message specification with a tlv stream in line with the onion tlv payload and p2p messages.