Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics about the database dump #821

Open
2 of 6 tasks
fridex opened this issue Jan 28, 2022 · 9 comments
Open
2 of 6 tasks

Expose metrics about the database dump #821

fridex opened this issue Jan 28, 2022 · 9 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability

Comments

@fridex
Copy link
Contributor

fridex commented Jan 28, 2022

Is your feature request related to a problem? Please describe.

As Thoth operator and Thoth administrator, I would like to have an overview on the database dumps available on Ceph. This way, I can be sure there are done periodic dumps (and they are running).

Describe the solution you'd like

Create a metric that exposes information about database dumps available in the deployment. If there are no database dumps for a certain period of time (let's say 2 days), Thoth operator should be alerted about this fact.

  • list available PG dumps on Ceph
  • parse datetime when they were created (part of the filename)
  • expose this information in grafana dashboard
  • create an alert if there is no database backup done in the past X days (parametrizable)
  • expose information about the number of dumps available
  • alert if the database dumps go bellow certain number or reach certain number (parametrizable)
    • indicates that the dumps are not properly cleaned or they are dropped
@fridex fridex added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jan 28, 2022
@pacospace
Copy link
Contributor

Thanks @fridex, what about having metrics in https://github.com/thoth-station/graph-backup-job that tell us if dump was create or not?

@fridex
Copy link
Contributor Author

fridex commented Jan 28, 2022

Thanks @fridex, what about having metrics in https://github.com/thoth-station/graph-backup-job that tell us if dump was create or not?

That sounds great 👍🏻

@pacospace
Copy link
Contributor

Thanks @fridex, what about having metrics in https://github.com/thoth-station/graph-backup-job that tell us if dump was create or not?

That sounds great 👍🏻

Added metrics in thoth-station/graph-backup-job#215 🚀

@pacospace pacospace self-assigned this Feb 8, 2022
@pacospace pacospace added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Feb 8, 2022
@pacospace
Copy link
Contributor

Is your feature request related to a problem? Please describe.

As Thoth operator and Thoth administrator, I would like to have an overview on the database dumps available on Ceph. This way, I can be sure there are done periodic dumps (and they are running).

Describe the solution you'd like

Create a metric that exposes information about database dumps available in the deployment. If there are no database dumps for a certain period of time (let's say 2 days), Thoth operator should be alerted about this fact.

  • list available PG dumps on Ceph

@fridex do you mean exposing all pg dumps in a grafana panel, to see all dates?

  • parse datetime when they were created (part of the filename)

  • expose this information in grafana dashboard

  • create an alert if there is no database backup done in the past X days (parametrizable)

  • expose information about the number of dumps available

  • alert if the database dumps go bellow certain number or reach certain number (parametrizable)

    • indicates that the dumps are not properly cleaned or they are dropped

@goern goern added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 16, 2022
@codificat
Copy link
Member

/sig observability
/unassign @pacospace
/remove-lifecycle active

@sesheta sesheta added sig/observability Categorizes an issue or PR as relevant to SIG Observability and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels May 2, 2022
@sesheta
Copy link
Member

sesheta commented Jul 31, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@mayaCostantini
Copy link
Contributor

/remove lifecycle-stale
/lifecycle frozen

@sesheta sesheta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 31, 2022
@VannTen
Copy link
Member

VannTen commented Sep 2, 2022

Thanks @fridex, what about having metrics in https://github.com/thoth-station/graph-backup-job that tell us if dump was create or not?

Is that really a safe way ? I mean, we want a metric about the state, not the action right ?
@goern still priority critical ?

@codificat
Copy link
Member

/remove-priority critical-urgent
/priority important-longterm

@sesheta sesheta added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 2, 2023
@goern goern added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability
Projects
Status: 🆕 New
Status: 🆕 New
Development

No branches or pull requests

7 participants