Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle termination and unexpected messages #17

Merged
merged 2 commits into from
Oct 30, 2024

Conversation

mjm
Copy link
Contributor

@mjm mjm commented Oct 30, 2024

We've recently tried to adopt Peep as a replacement for TelemetryMetricsStatsd. We just switched to using a UDS to send metrics to Datadog, and in the process found this bug in OTP: erlang/otp#8989. We optimistically hoped Peep's batching would avoid this issue.

It turned out to still hit that issue, which is fine: that's not Peep's bug. But the way it affected Peep was very destructive: when Peep received an unexpected message, it brought down the entire application! 😱

I believe the reason for this is that the error raised from Peep not having a matching handle_info clause for the message caused the Peep server to terminate for a reason other than :shutdown, which is the only reason its terminate callback is able to handle. So another function clause error is raised, this time while handling the trapped exit, and I think this error is what brings the whole application crashing down. I'm not an OTP expert (yet) but my intuition is that an error raised while handling a trapped exit breaks the normal restarting that would be handled by the server's supervisor, and it cascades up from there.

Even if you're not hitting the OTP bug mentioned above, this behavior is very easy to reproduce: lookup the PID for the name used for Peep with GenServer.whereis and then send it an arbitrary message it's not expecting.

So I've made two changes here that seem to address this issue for us. First, accept any reason for terminate rather than just :shutdown. It seems to be very important for terminate to always succeed, and I don't see a need to handle any specific exit reason differently.

That change is sufficient to make the Peep server restart correctly when it receives unexpected processes, and presumably if it were to exit for any other unexpected reason. But given that we are still being hit by that OTP bug, we would still rather the server not restart on us. So I also introduced a catch-all handle_info case that logs a warning for unexpected messages, which allows the server to keep running. This avoids it dropping its ETS table when we hit this bug. I can understand if you wouldn't want to accept that commit though on the basis that it's better to crash early for unexpected things happening.

Thanks for your work and thank you for your time!

@rkallos
Copy link
Owner

rkallos commented Oct 30, 2024

Hey Matt! Thank you for your PR.

First, accept any reason for terminate rather than just :shutdown. It seems to be very important for terminate to always succeed, and I don't see a need to handle any specific exit reason differently.

You're absolutely right. I wound up adding this exact change to a PR I've been working on, but I think I'd prefer to merge your stuff and rebase my work on your changes.

So I also introduced a catch-all handle_info case that logs a warning for unexpected messages, which allows the server to keep running.

Nice! Admittedly I don't use the Statsd parts of Peep anymore since changing jobs a bit over a year ago, but I'm a bit surprised I didn't run into this issue running Peep in containers in k8s in the past.

I'm not sure I agree with the approach of logging the unexpected message. My main concern is that a user of Peep can't really do anything with these log messages, particularly in the case of this bug you're seeing in Erlang/OTP, other than forking Peep and removing the Logger call.

I think I would prefer if a Peep process either silently discards all unknown messages, or it handles the specific {:inet_reply, _port, _ref} messages, and continues to crash on receiving unexpected messages. I'm currently leaning towards the first approach.

@mjm
Copy link
Contributor Author

mjm commented Oct 30, 2024

Fair point. I can update the PR to remove the logging, and maybe drop a comment in there instead?

It's possible we're hitting this because the sysctl net.unix.max_dgram_qlen is really low. It seems to be defaulting to 10 both in Docker Desktop and on our GKE nodes (or maybe it's Docker setting it low? I'm not sure). We need to figure out how to increase it in our infra, which might make us not really hit that particular issue anymore.

@rkallos
Copy link
Owner

rkallos commented Oct 30, 2024

I can update the PR to remove the logging, and maybe drop a comment in there instead?

Sounds good! A short comment explaining why Peep sinkholes unknown messages might be useful to curious readers.

Yeah, I read the issue you linked to. That does seem rather low. Nevertheless, this raises misbehaviour(s) in Peep that shouldn't happen, so I still think it's important to merge your changes, even if you manage to fix the underlying issue.

@mjm mjm force-pushed the push-ylkrxquusrrl branch from 1358fa8 to 2252345 Compare October 30, 2024 17:48
@mjm
Copy link
Contributor Author

mjm commented Oct 30, 2024

Alright I replaced the logs with a comment that hopefully isn't too verbose :) I linked back to this PR and the OTP issue in case someone wants more context in the future.

And yeah, I agree that it's best for Peep to be resilient to this kind of thing even once the bug that surfaced it is resolved.

Thanks for being so responsive!

@rkallos rkallos merged commit 21f0759 into rkallos:main Oct 30, 2024
1 check passed
@rkallos
Copy link
Owner

rkallos commented Oct 30, 2024

Thank you! <3

@mjm
Copy link
Contributor Author

mjm commented Oct 30, 2024

Quick question: do you think you'll be cutting a new Hex release with this soon (within the day)? No rush or anything, just wondering if I should update our app to point at main or wait for the release.

@mjm
Copy link
Contributor Author

mjm commented Oct 30, 2024

Nevermind, you just did! Thank you very much!

@mjm mjm deleted the push-ylkrxquusrrl branch October 30, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants