Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Logs for penalty box state changes #1478

Open
rraub opened this issue Aug 30, 2024 · 1 comment
Open

Feature request: Logs for penalty box state changes #1478

rraub opened this issue Aug 30, 2024 · 1 comment

Comments

@rraub
Copy link
Contributor

rraub commented Aug 30, 2024

Describe the feature
I want to improve the observability of devices making their way into and out of the penalty box. Our standard info-level logs are a simple way to achieve this. The volume should be low, so I have no capacity concerns with this recommendation.

The problem I'm trying to help address is the automated monitoring around when we make changes that push firmware out to a large number of devices. These logs would let us generate metrics so we could end up with a graph over time of devices put in the box. They would also provide useful debugging context to other tools outside of Nerves Hub (assuming they can search the nerves hub logs) that can help highlight why a device is not receiving its expected firmware updates.

We currently have these types of connect/disconnect logs:

19:03:25.932 [info] pid=<0.1078404.0> mfa=NervesHub.DeviceReporter.handle_event/4 identifier=XXX event=nerves_hub.devices.disconnect ref_id=XX Device disconnected
17:45:44.247 [info] pid=<0.8824444.0> mfa=NervesHub.DeviceReporter.handle_event/4 identifier=XXX event=nerves_hub.devices.connect firmware_uuid=XXX Device connected

We could build off of this model and introduce additional events:
nerves_hub.devices.penaltybox.in
nerves_hub.devices.penaltybox.out

Additional context
Bonus points if we can include some reasoning in the logs (did someone manually select to move them in/out or was this an automatic action based on thresholds)

@joshk
Copy link
Collaborator

joshk commented Sep 10, 2024

I'm sorry for my delayed reply. I was pondering over this a bit and had an idea.

I agree we could log more, and that is a quick win.
I also think we should add some more telemetry, which is another quick win.

But the bigger idea I had was adding a 'key' to the Audit Logs table, which could allow us to show metrics based on audit log events across a product.

I need to play this out more. I essentially want to see more of this data in the UI so you can see spikes quickly, without having to hunt for this info. I'd also like to see some alerting too, most likely to Slack, so you can be warned of these issues as they start to appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants