Gossipsub message throttling / TTL ? #5504

Wiezzel · 2024-07-18T10:50:00Z

Wiezzel
Jul 18, 2024

I have a live network with ~800 nodes which publish "ping" messages with their state every 10 seconds. This generates quite a lot of messages, but they are quickly deprecated. As a new ping gets broadcasted, the previous one for the same peer is no longer of any interest. When a node joins the network and subscribes to the pings topic, I see it getting flooded with thousands of messages. Therefore, I would like to ask if gossipsub protocol supports any of the following features?

Message throttling – limiting the number of messages from a particular publisher accepted within a given time frame, or
TTL – discarding messages older than some given time (could be configurable per message or per topic)

Wiezzel · 2024-08-01T11:47:12Z

Wiezzel
Aug 1, 2024
Author

@jxs @guillaumemichel Referring to our recent discussion, I have found out that dropping deprecated messages without penalizing peers should be possible by submitting MessageAcceptance::Ignore as validation result. This should work well with simply keeping the most recent sequence number for each publisher, and ignoring all previous messages originating from this peer.

I still puzzles me though, how is it possible that with the default gossipsub config, there are messages months old circulating in my network. I would expect that gossipsub messages would naturally be gone after a minute or two. Perhaps it's important – I see that about 50% of these very old messages come from a single propagation source.

5 replies

guillaumemichel Aug 2, 2024
Maintainer

Gossipsub has two components:

Pubsub: spread messages quickly (and probabistically) throughout the network
Gossip: allows nodes that missed a message to learn about it and request it from a peer that has it

Hence the traffic that you see after months is certainly nodes catching up the messages they missed in the last months through gossip

Wiezzel Aug 2, 2024
Author

@guillaumemichel So if a new node joins the network it catches up with messages that were published before it went online? But if the default heartbeat is 1 second, and the default history length is 5 heartbeats, shouldn't it be able to get only the messages from approx. last 5 seconds? How are these old messages still stored?

guillaumemichel Aug 2, 2024
Maintainer

if the default heartbeat is 1 second, and the default history length is 5 heartbeats, shouldn't it be able to get only the messages from approx. last 5 seconds?

Yes, that would make sense. I cannot explain why older messages are still circulating

Wiezzel Aug 2, 2024
Author

@guillaumemichel Do you have any idea how to debug this further? (Please keep in mind that I only control a few nodes in the network so the capabilities are limited here)

guillaumemichel Aug 2, 2024
Maintainer

I am not too familiar with the protocol, nor its implementation.

Maybe you can find more details in the Gossipsub paper or the specs to confirm that this behaviour is actually unexpected. And if it is the case, try to look for what's wrong in the implementation.

Wiezzel · 2024-08-02T09:56:11Z

Wiezzel
Aug 2, 2024
Author

If this adds any context, I have been seeing a lot of these warnings in all nodes:

WARN libp2p_gossipsub::handler: The maximum number of outbound substream attempts has been exceeded

And more recently, one of the node operators also reported a bunch of these:

WARN libp2p_gossipsub::behaviour: Rejected message not in cache 313244334b...

0 replies

anilaltuner · 2024-08-17T06:18:03Z

anilaltuner
Aug 17, 2024

This is the big problem on the network right now. Any updates from you? @Wiezzel

0 replies

Stebalien · 2024-08-18T14:39:37Z

Stebalien
Aug 18, 2024

We've been having this issue with go-libp2p in F3 (Lotus/Filecoin), I'm going to propose a spec update this week to fix it. The idea is to (optionally) repurpose the message nonce as an expiration timestamp.

0 replies

anilaltuner · 2024-08-18T14:59:19Z

anilaltuner
Aug 18, 2024

Hey @Stebalien!

I debugged the repo a bit and worked on an improvement with forked repo.

So first of all, the problem is;

If you look at this part, you will see that the on_connection_handler function synchronises all messages. So this function works as peer increases and right at this part
https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/behaviour.rs#L3022

it triggers the following function to receive messages like crazy and then puts these messages directly into memcache.
https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/behaviour.rs#L1685

And the memory continues to swell continuously. Also, these messages are propagated again when a new node arrives. I dig a little bit where this message list comes from and I realized that these messages first go through ValidationMode.

This is the part where the proto is directly decoded. I realized that ValidationMode has three things that it validates, signature, sequence_no, source.
https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/protocol.rs#L235

In Rust-Libp2p, sequence_no works like this; when the node gets up, they take the timestamp and increment it.

I utilized this sequence_no to be completely timestamp. If you look at this part, before each message is sent, sequence_no goes as the timestamp of that moment;
https://github.com/anilaltuner/rust-libp2p/blob/master/protocols/gossipsub/src/behaviour.rs#L2639

in this section, I added ttl directly into sequence_no verification. I say that messages that do not pass the filter directly to the invalid list.
https://github.com/anilaltuner/rust-libp2p/blob/master/protocols/gossipsub/src/protocol.rs#L349

Incoming messages are blocked without any processing, data extraction or cache.

and the other thing is memory issue.

Some apps has not need the past messages but libp2p still stores on memcache. So I added the usage of it.

Normally, libp2p uses Hashmap for it;

rust-libp2p/protocols/gossipsub/src/mcache.rs

Line 41 in d9ee266

msgs: HashMap<MessageId, (RawMessage, HashSet<PeerId>)>,

But I implemented lru_cache_time;
https://github.com/anilaltuner/rust-libp2p/blob/master/protocols/gossipsub/src/mcache.rs#L42

Idea is, messages will start to be deleted after a certain time or capacity, so old messages will not be kept in memory.

I'm testing fork, but after testing I'm going to release pr for the next version. You can use this configuration on the gossip behaviour like this.

Behaviour::new(
    MessageAuthenticity::Signed(id_keys),
    ConfigBuilder::default()
        .heartbeat_interval(Duration::from_secs(10))
        .message_id_fn(message_id_fn)
        .message_ttl(Duration::from_secs(100))
        .message_capacity(100)
        .build()
        .expect("Valid config"),
)
    .expect("Valid behaviour")

What do you think on that? Also @Wiezzel, is it fix your issue too? Because seems we are implementing same usage on libp2p.

0 replies

Wiezzel · 2024-08-20T13:40:26Z

Wiezzel
Aug 20, 2024
Author

@anilaltuner This is a viable solution to my problem. However, as @guillaumemichel pointed out to me, there's a couple of problems with that:
a) it does not conform to the gossipsub specs,
b) unexpected thing may happen if nodes have off-sync clocks.

I personally implemented a simpler solution, using the message validation functionality. In my case only the most recent message from each publisher is interesting. So I just keep a hash map with the highest seen seq_no for each peer, and ignore messages with previous numbers.

0 replies

AgeManning · 2024-09-02T05:26:32Z

AgeManning
Sep 2, 2024
Maintainer

Apologies for being late to this thread (I was away for a bit). I might be able to add some insight here.

The pubsub system is designed to publish messages as best it can throughout the network. There is no in-built mechanism to decide if a message is old or not.

Old messages can bounce around the network for a variety of reasons. Some of which I've seen in the wild are:

An individual node has a long message cache configuration. The message cache stores messages and routinely gossips them. If another node joins the network, it will gossip about this old message and send it on to the new node. The new node may then store it for a long period of time and gossip it on to other new nodes who join the system and the process repeats
Message history is configured to be longer than the duplicate cache time. We have a cache that registers duplicate messages (via their message id). If a node has seen this before it just drops the message. If the memcache is larger than the duplicate cache, it can happen that old messages get gossip'd and then received again and not filtered as a duplicate because they have been removed from the cache and then repropagated, potentially in an endless cycle. Although they should get filtered if they are in the local memcache.
Slow computers can build up a queue of messages to send out. If their upload speeds are really bad, these and queue for quite a while and then get sent super late. They look like old messages, but its just that the node is really slow, or hit some long deadlock or something.

In all of these scenario's, the easiest (and imo correct way) to handle this is to inform gossipsub about what messages are stale and which are not. This kind of logic is application-specific so has been left to the application. The way to do this is to set config.validate_messages(). This means that no message is forwarded on the network without a specific validation from the application. This allows the application to decide which message should be forwarded and bounced around the network and which shouldn't. This should also apply not just for late messages, but malicious or invalid messages.

Once a message comes in, the application should then call publish.report_message_validation_result() with the message_id of the validated message. If the message is ok (i.e not invalid or malicious) but you don't want to forward it, you use MessageAcceptance::Ignore. Sending MessageAcceptance::Accept means you will forward this message to other peers.

Large queues if outbound messages exceeds the capacity of the network to upload them can still cause messages to be late however.

It seems you've already found this solution, but I thought it might be useful to elaborate on the original design.

0 replies

anilaltuner · 2024-09-02T06:57:48Z

anilaltuner
Sep 2, 2024

Hey @AgeManning!

Firstly thank you for elaborating, it is quite clear. The thing is, I've changed a few more things since my last update.

Yes, as you said, there is no structure related to whether the message is old or not, we can add this custom in validate_messages, but there may be a problem like this.

Messages come as flood and when we make MessageAuthenticity Signed, thousands of messages come to verify_signature. It doesn't matter whether to validate or not because these messages are validated when they come to other nodes and kept in the message cache. This flood causes an incredible cpu usage in verify_signature. Even if there is no next forwarding, the system becomes unusable for that node.
If we decide not to use signature verification, we need message transformation to use it in validate_messages. In this case, it will be necessary to parcel and validate thousands of messages. Especially if the timestamp of the message was included in the gossip without transforming the message (I transformed the sequence number to this in my previous solution), there would be no need for parsing.
Hashmap is still used in the message cache. Wouldn't having an LRU mechanism for this part like duplicate cache bring optimisation in terms of memory usage?

Apart from all these, I saw that the most basic problem I had was in send_queue #4572.

As a solution here, it is suggested that either send_queue should be limited or backpressure should be required between connections.

I solved it by limiting it, but is it still a valid suggestion or has a better solution been developed since then?

0 replies

AgeManning · 2024-09-02T09:11:13Z

AgeManning
Sep 2, 2024
Maintainer

Hey, yeah, so sounds like you have huge amounts of burst in your network.

For 1, it sounds like the signature verification is too slow to handle the traffic. There's probably two options here, 1 - avoid signature verification, 2 - Backpressure and decide on what messages to drop. (I'll talk a bit about this later on).

The message transform function we initially put in there to handle optional compression. People could then use different compression libraries etc, but you probably could put some timing logic in there if the message-id is based on a time and then filter the messages out pretty quickly (I guess it would be a hack tho, and its not obvious to me how to filter here straight away). I imagine here, the issue isn't in parsing the protobuf of so many messages it's still message verification (i.e sig verification). It sounds to be like you need to drop messages that are not relevant to you and you want to drop them before the sig verification.

Gossipsub tries not to make decisions like these at the protocol level because they quickly get specialised and the protocol becomes very complex and less general. In fact, adding extra configurations into gossipsub was turned down by the maintainers at the time because they thought the protocol was already getting too complex. For this reason, gossipsub tries to be dumb about what messages are being sent, and passes functionality to the application of gossipsub to handle these details.

My initial reaction to your problem would be to implement backpressure via the MessageValidation functionality. Let gossipsub give you the burst of messages, dump them all into a queue in your application and filter based on any metric (i.e time) then the ones that are left, do sig verification and send back MessageAcceptance::Accepted. If this is not possible, I'm not against adding a configuration paramter that takes a closure that can filter messages, the problem is that it can only rely on the gossipsub protobuf not the application specific encoding. So like from,to, msg-id etc.

The message cache is a weird thing. In the beginning, the gossipsub specification was the go implementation. The go implementation had this form of message cache. It was closely tied to the specification because of the configuration parameters like history_length and heartbeat. Essentially, it does behave like an LRUCache only storing history_length worth of heartbeats messages. Rather than it being bound by space, it is bound by time. The specifications allow you to time-bound the history, but not length/space-bound it. So the memcache is a time-bound cache rather than an LRU. We implemented the duplicate cache ourselves, and naturally went with an LRU.

If you have a lot of burst messages, then the memcache will grow quite large, depending on this configuration parameters (which you can tweak). If we space-bound the memcache, then we'd run into footguns with the specification parameters like history_length because we wouldn't be storing all the history for x heartbeats because we may have dropped some due to the size.

Yes. There was a significant problem with send queues, we run into also quite a while ago. And you rightly find the solution that it should have backpressure. We resolved this, here is a useful comment to track changes: sigp/lighthouse#4918 (comment)

We needed a fix quickly and our changes were quite large to the gossipsub code base and we didn't have time to merge upstream. So we have forked from rust-libp2p and currently have our own gossipsub which handles the send queue backpressure, along with a few other fixes. Our implementation is here: https://github.com/sigp/lighthouse/tree/stable/beacon_node/lighthouse_network/gossipsub

We are planning/in the process of upstreaming our fixes to rust-libp2p but they are not in there yet.

Essentially our fix creates channels with different priorities. Sending messages have a higher priority than sending IHAVE/IWANT gossip messages for example. When we drop messages, lower priority messages gets dropped before higher priority ones. These queues are also bounded and configurable. So you can choose a bound and they wont drop anything until we max them out.

Hope it helps :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gossipsub message throttling / TTL ? #5504

{{title}}

Replies: 9 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gossipsub message throttling / TTL ? #5504

Wiezzel Jul 18, 2024

Replies: 9 comments · 5 replies

Wiezzel Aug 1, 2024 Author

guillaumemichel Aug 2, 2024 Maintainer

Wiezzel Aug 2, 2024 Author

guillaumemichel Aug 2, 2024 Maintainer

Wiezzel Aug 2, 2024 Author

guillaumemichel Aug 2, 2024 Maintainer

Wiezzel Aug 2, 2024 Author

anilaltuner Aug 17, 2024

Stebalien Aug 18, 2024

anilaltuner Aug 18, 2024

Wiezzel Aug 20, 2024 Author

AgeManning Sep 2, 2024 Maintainer

anilaltuner Sep 2, 2024

AgeManning Sep 2, 2024 Maintainer

Wiezzel
Jul 18, 2024

Replies: 9 comments 5 replies

Wiezzel
Aug 1, 2024
Author

guillaumemichel Aug 2, 2024
Maintainer

Wiezzel Aug 2, 2024
Author

guillaumemichel Aug 2, 2024
Maintainer

Wiezzel Aug 2, 2024
Author

guillaumemichel Aug 2, 2024
Maintainer

Wiezzel
Aug 2, 2024
Author

anilaltuner
Aug 17, 2024

Stebalien
Aug 18, 2024

anilaltuner
Aug 18, 2024

Wiezzel
Aug 20, 2024
Author

AgeManning
Sep 2, 2024
Maintainer

anilaltuner
Sep 2, 2024

AgeManning
Sep 2, 2024
Maintainer