Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/ingest: added 'horizon_ingest_errors_total' metric key #5302

Merged
merged 7 commits into from
May 10, 2024

Conversation

sreuland
Copy link
Contributor

@sreuland sreuland commented May 7, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

added a new metrics counter key 'horizon_ingest_errors_total' which ingest fsm will emit anytime it traps an error response from in ingestion state run.

Why

to configure prometheus alert against the new metric for early visibility on when an ingestion halt starts to form - https://github.com/stellar/prometheus/pull/243

Closes: #5256

Known limitations

@tamirms
Copy link
Contributor

tamirms commented May 8, 2024

We log ingestion errors here:

https://github.com/stellar/go/blob/master/services/horizon/internal/ingest/main.go#L646

If you search on kibana you can see that these logs occur very rarely. I think it would make sense to have an ingestion error counter metric which is basically triggered every time we emit the log in https://github.com/stellar/go/blob/master/services/horizon/internal/ingest/main.go#L646. The metric definition should resemble the log message: a counter with 2 labels, current_state and next_state.

We can the create the alert using a very simple heuristic: if the ingestion error counter is incremented at least n times in the last m minutes fire the alert

… new error counting metrics, per review feedback
Copy link

socket-security bot commented May 9, 2024

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
golang/github.com/stretchr/testify@v1.9.0 filesystem, network, shell, unsafe 0 616 kB

🚮 Removed packages: golang/github.com/stretchr/testify@v1.8.4

View full report↗︎

@sreuland sreuland changed the title services/horizon/ingest: added 'horizon_ingest_error_restarts' metric key services/horizon/ingest: added 'horizon_ingest_errors_total' metric key May 9, 2024
@sreuland
Copy link
Contributor Author

sreuland commented May 9, 2024

I think it would make sense to have an ingestion error counter metric which is basically triggered every time we emit the log in https://github.com/stellar/go/blob/master/services/horizon/internal/ingest/main.go#L646.

ah, thanks for pointer to that, updated - 3319801

go.mod Show resolved Hide resolved
Copy link
Contributor

@tamirms tamirms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments but overall looks good!

@sreuland sreuland merged commit 38d28bb into stellar:master May 10, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

services/horizon: add metrics for ingestions failures and alert on them
2 participants