Gracefully handle termination and unexpected messages #17

mjm · 2024-10-30T00:02:22Z

We've recently tried to adopt Peep as a replacement for TelemetryMetricsStatsd. We just switched to using a UDS to send metrics to Datadog, and in the process found this bug in OTP: erlang/otp#8989. We optimistically hoped Peep's batching would avoid this issue.

It turned out to still hit that issue, which is fine: that's not Peep's bug. But the way it affected Peep was very destructive: when Peep received an unexpected message, it brought down the entire application! 😱

I believe the reason for this is that the error raised from Peep not having a matching handle_info clause for the message caused the Peep server to terminate for a reason other than :shutdown, which is the only reason its terminate callback is able to handle. So another function clause error is raised, this time while handling the trapped exit, and I think this error is what brings the whole application crashing down. I'm not an OTP expert (yet) but my intuition is that an error raised while handling a trapped exit breaks the normal restarting that would be handled by the server's supervisor, and it cascades up from there.

Even if you're not hitting the OTP bug mentioned above, this behavior is very easy to reproduce: lookup the PID for the name used for Peep with GenServer.whereis and then send it an arbitrary message it's not expecting.

So I've made two changes here that seem to address this issue for us. First, accept any reason for terminate rather than just :shutdown. It seems to be very important for terminate to always succeed, and I don't see a need to handle any specific exit reason differently.

That change is sufficient to make the Peep server restart correctly when it receives unexpected processes, and presumably if it were to exit for any other unexpected reason. But given that we are still being hit by that OTP bug, we would still rather the server not restart on us. So I also introduced a catch-all handle_info case that logs a warning for unexpected messages, which allows the server to keep running. This avoids it dropping its ETS table when we hit this bug. I can understand if you wouldn't want to accept that commit though on the basis that it's better to crash early for unexpected things happening.

Thanks for your work and thank you for your time!

rkallos · 2024-10-30T17:33:33Z

Hey Matt! Thank you for your PR.

First, accept any reason for terminate rather than just :shutdown. It seems to be very important for terminate to always succeed, and I don't see a need to handle any specific exit reason differently.

You're absolutely right. I wound up adding this exact change to a PR I've been working on, but I think I'd prefer to merge your stuff and rebase my work on your changes.

So I also introduced a catch-all handle_info case that logs a warning for unexpected messages, which allows the server to keep running.

Nice! Admittedly I don't use the Statsd parts of Peep anymore since changing jobs a bit over a year ago, but I'm a bit surprised I didn't run into this issue running Peep in containers in k8s in the past.

I'm not sure I agree with the approach of logging the unexpected message. My main concern is that a user of Peep can't really do anything with these log messages, particularly in the case of this bug you're seeing in Erlang/OTP, other than forking Peep and removing the Logger call.

I think I would prefer if a Peep process either silently discards all unknown messages, or it handles the specific {:inet_reply, _port, _ref} messages, and continues to crash on receiving unexpected messages. I'm currently leaning towards the first approach.

mjm · 2024-10-30T17:40:47Z

Fair point. I can update the PR to remove the logging, and maybe drop a comment in there instead?

It's possible we're hitting this because the sysctl net.unix.max_dgram_qlen is really low. It seems to be defaulting to 10 both in Docker Desktop and on our GKE nodes (or maybe it's Docker setting it low? I'm not sure). We need to figure out how to increase it in our infra, which might make us not really hit that particular issue anymore.

rkallos · 2024-10-30T17:47:50Z

I can update the PR to remove the logging, and maybe drop a comment in there instead?

Sounds good! A short comment explaining why Peep sinkholes unknown messages might be useful to curious readers.

Yeah, I read the issue you linked to. That does seem rather low. Nevertheless, this raises misbehaviour(s) in Peep that shouldn't happen, so I still think it's important to merge your changes, even if you manage to fix the underlying issue.

mjm · 2024-10-30T17:50:14Z

Alright I replaced the logs with a comment that hopefully isn't too verbose :) I linked back to this PR and the OTP issue in case someone wants more context in the future.

And yeah, I agree that it's best for Peep to be resilient to this kind of thing even once the bug that surfaced it is resolved.

Thanks for being so responsive!

rkallos · 2024-10-30T17:55:15Z

Thank you! <3

mjm · 2024-10-30T17:57:28Z

Quick question: do you think you'll be cutting a new Hex release with this soon (within the day)? No rush or anything, just wondering if I should update our app to point at main or wait for the release.

mjm · 2024-10-30T17:59:49Z

Nevermind, you just did! Thank you very much!

detach event handlers with any termination reason

458a236

ignore unexpected info messages instead of terminating

2252345

mjm force-pushed the push-ylkrxquusrrl branch from 1358fa8 to 2252345 Compare October 30, 2024 17:48

rkallos merged commit 21f0759 into rkallos:main Oct 30, 2024
1 check passed

mjm deleted the push-ylkrxquusrrl branch October 30, 2024 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully handle termination and unexpected messages #17

Gracefully handle termination and unexpected messages #17

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

mjm commented Oct 30, 2024

Gracefully handle termination and unexpected messages #17

Gracefully handle termination and unexpected messages #17

Conversation

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

rkallos commented Oct 30, 2024

mjm commented Oct 30, 2024

mjm commented Oct 30, 2024