Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cf_exporter >= 1.0 doesn't handle apps & spaces being deleted during a scrape gracefully #85

Closed
risicle opened this issue May 5, 2023 · 3 comments · Fixed by #87 or cloudfoundry/prometheus-boshrelease#475

Comments

@risicle
Copy link
Contributor

risicle commented May 5, 2023

I think the title summarizes it pretty well. #82 no longer appears to be an issue, but we're still getting scrape errors post-prometheus-boshrelease 28.x.

They seem to be related to error messages we get of the form could not find app summary with guid 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' and unexpected status code 404 on request /v2/spaces/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/summary. This results in a failure and a hole in our metrics. This is presumably the result of an app or space having been listed at one point in the scrape, but no longer being present at a later point.

The problem is that this isn't a rare occurrence for us, because we have continuous smoke tests creating and deleting apps, along with acceptance tests that run occasionally, also doing this.

@psycofdj
Copy link
Contributor

psycofdj commented May 5, 2023

I'm glad that #82 has been resolved

The exporter implements a strategy where space list, space summaries and applications are scraped asynchronously. Space summaries data are used to output metrics about applications belonging to that space.

In your case, my guess is that you only have {__name__=~"cf_.*_scrape_errors"} == 1 metrics since at least one error occurred during the async scrap process.

Can you confirm that the expected behavior in such case would be to have all metrics except for the apps that have been deleted between app listing and the space summary fetching ?

@risicle
Copy link
Contributor Author

risicle commented May 9, 2023

Apologies, long weekend here. Yes the most sensible thing would probably be to pretend such apps had never been seen by the first scrape. Many thanks for addressing - we will try this soon.

@risicle
Copy link
Contributor Author

risicle commented May 22, 2023

This is a lot better now thanks. Ideally it wouldn't increment last_<type>_scrape_error because this means we still get a lot of bogus alerts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants