Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add error log when resource manager throttles crawler #772

Merged
merged 1 commit into from
May 26, 2022

Conversation

guseggert
Copy link
Contributor

Since this could leave the Accelerated DHT client in a degraded state,
due to not being able to completely populate its routing table, we
want to signal this to the user until we can modify the client to
degrade more gracefully when hitting resource limits.

Note that this logs only once per crawl, to avoid spamming the user.

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, don't have time to test this out at the moment (somewhat necessary given the abysmal test coverage of the accelerated DHT client 😬). Should be able to later today or tomorrow.

If you want another set of hands to run tests to see how helpful this is at finding the limits @Jorropo may be free.

fullrt/dht.go Outdated
func(p peer.ID, err error) {
if errors.Is(err, network.ErrResourceLimitExceeded) {
limitErrOnce.Do(func() {
logger.Errorf("Accelerated DHT client was unable to fully refresh its routing table due to Resource Manager limits, which may degrade content routing. Consider increasing resource limits. See debug logs for details.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it look when testing this out within go-ipfs? Have you been able to use this mechanism to discover what we should raise the default limits to such that this should not trigger for people?

See debug logs for details.

What types of logs are you expecting people are going to examine? For example, do they need to look at the dht-crawler logs rather than the fullrt ones, do they need to look at rcmgr logs, etc.

I'm not sure there's enough info here for a user to "figure it out" more than just filing an issue on this repo. If we need to add text to the README or put a link to a wiki for this repo we can do that, otherwise if there's some simple text we can use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it look when testing this out within go-ipfs? Have you been able to use this mechanism to discover what we should raise the default limits to such that this should not trigger for people?

So the logs look like this (next to ipfs/kubo#8980):

2022-05-18T11:34:24.234-0400	ERROR	fullrtdht	fullrt/dht.go:304	Accelerated DHT client was unable to fully refresh its routing table due to Resource Manager limits, which may degrade content routing. Consider increasing resource limits. See debug logs for details.
2022-05-18T11:34:32.201-0400	ERROR	resourcemanager	libp2p/rcmgr_logging.go:51	Resource limits were exceeded 55405 times, consider inspecting logs and raising the resource manager limits.

There aren't details about specifically which kind of limit was hit (outbound conns, streams, mem, FDs), and at which scope, so that error alone is not enough to know what limits need to be raised. I can add more brains to ipfs/kubo#8980 to try to come up with some recommendation but that will take some time, since we'd need to register DHT interactions w/ libp2p as a proper resource manager "system". I figured we could release it like this and see how frequently this happens (with raised default limits).

What types of logs are you expecting people are going to examine? For example, do they need to look at the dht-crawler logs rather than the fullrt ones, do they need to look at rcmgr logs, etc.

For these specific errors related to refreshing the accelerated DHT client's routing table, yeah it's just dht-crawler. I can add that into the error message, since this is specific to that case.

The accelerated DHT client more broadly uses at least three logging subsystems (dht-crawler, fullrtdht, dht) where RM errors might surface.

Since this could leave the Accelerated DHT client in a degraded state,
due to not being able to completely populate its routing table, we
want to signal this to the user until we can modify the client to
degrade more gracefully when hitting resource limits.

Note that this logs only once per crawl, to avoid spamming the user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants