Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Networking] Optimizing GossipSub RPC Handling Memory Usage Through Asynchronous Subscription Updates #4988

Merged
merged 62 commits into from
Nov 16, 2023

Conversation

yhassanzadeh13
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 commented Nov 9, 2023

Problem Statement:
Our current GossipSub implementation encounters significant memory overhead due to the frequent updates triggered by the peer-scoring component. This component, integral for verifying peer subscriptions, becomes increasingly memory-intensive with rising RPC traffic. Each update renews the router’s topic list and peers' roster, and while individually lightweight, these updates cumulatively strain the system. This issue was highlighted in a flame graph analysis, pointing to a need for optimization.

Read more here: https://github.com/dapperlabs/flow-go/issues/6870

Proposed Solution:
This PR introduces a major optimization to the subscription update mechanism within the GossipSub router. We propose shifting from an event-based update system to a time-based approach. Specifically, this involves:

  1. Implementing an asynchronous logic to update subscriptions at predefined, longer intervals (e.g., once every 10 minutes).
  2. Introducing a configurable update interval, allowing system administrators to adjust the frequency according to network needs.
  3. Integrating HeroCache in place of sync.Map for monitoring subscriptions, aiming to reduce heap allocations and improve overall performance.

Key Features:

  • Asynchronous Update Logic: Regular, long-interval updates rather than event-based updates to minimize the frequency of memory-intensive operations.
  • Configurable Update Intervals: Customizable intervals for subscription updates, adjustable via a new flag.
  • HeroCache Integration: Leveraging HeroCache for more efficient subscription monitoring, optimizing memory usage.

@yhassanzadeh13 yhassanzadeh13 changed the title [Networking] Fixing part-1 of RPC inspection resource intensive operations [Networking] Fixing RPC inspection memory intensive operation (subscription validator) Nov 9, 2023
SubscriptionProviderConfig SubscriptionProviderParameters `mapstructure:",squash"`
}

type SubscriptionProviderParameters struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there concrete minimum values we can use for the validate tags of this struct? For SubscriptionUpdateInterval perhaps we can use 10m as minimum, and CacheSize the current amount of nodes on the current network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your input. Setting a fixed minimum seems risky. Presently, validation mandates parameters to be greater than zero, ensuring they're actively configured. Allowing node operators to adjust parameters to minimal values as necessary seems more prudent.


type SubscriptionRecordCache struct {
c *stdmap.Backend
currentCycle atomic.Uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation describing the relationship between the currentCycle and how it is moved during each updateTopics time interval which is configured. It's sort of implied here but a short concise description would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kc1116 kc1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments otherwise looks great !

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. added a few small comments

Comment on lines 146 to 149
case <-reg.validator.Ready():
reg.logger.Info().Msg("subscription validator started")
}
ready()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the worker isn't ready() unless the validator is

Suggested change
case <-reg.validator.Ready():
reg.logger.Info().Msg("subscription validator started")
}
ready()
case <-reg.validator.Ready():
reg.logger.Info().Msg("subscription validator started")
ready()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 450 to 453
case <-scoreRegistry.Ready():
s.logger.Info().Msg("score registry started")
}
ready()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
case <-scoreRegistry.Ready():
s.logger.Info().Msg("score registry started")
}
ready()
case <-scoreRegistry.Ready():
s.logger.Info().Msg("score registry started")
ready()
}

Copy link
Contributor Author

@yhassanzadeh13 yhassanzadeh13 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updatedTopics, err := s.cache.AddTopicForPeer(p, topic)
if err != nil {
// this is an irrecoverable error; hence, we crash the node.
ctx.Throw(fmt.Errorf("failed to update topics for peer %s: %w", p, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rather than passing the irrecoverable.SignalerContext, have this return an error, and the loop can throw if an error is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddWorker(func(ctx irrecoverable.SignalerContext, ready component.ReadyFunc) {
logger.Debug().Msg("starting subscription validator")
v.subscriptionProvider.Start(ctx)
<-v.subscriptionProvider.Ready()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this component become ready quickly? if not, it's probably worth using a select with the context to allow graceful aborts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 yhassanzadeh13 added this pull request to the merge queue Nov 16, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 16, 2023
@yhassanzadeh13 yhassanzadeh13 added this pull request to the merge queue Nov 16, 2023
Merged via the queue into master with commit 4223549 Nov 16, 2023
54 checks passed
@yhassanzadeh13 yhassanzadeh13 deleted the yahya/6870-fix-memory-intensive-issues-part-1 branch November 16, 2023 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants