-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reporting client data loss stats to Jaeger backend #2005
Comments
I wouldn't expect that extracting metrics from clients is difficult, especially because applications should be exposing their own metrics anyway. Once metrics are collected, it should also be easy to get aggregated values with or without service names using any reasonably modern metrics backend, like Prometheus. But if you do need this internally, this might be a good indication that other people might benefit from it. I do have one question, though:
What do you have in mind for syncing this across all collectors in a cluster? |
We indeed found this quite challenging, given how heterogeneous Uber's ecosystem is. Some apps, for example, are still using statsd-style metrics that don't even support tags, a lot more apps simply do not configure Jaeger clients with a metrics client, because they are not interested in those metrics. So we want to have a more guaranteed way of collecting critical metrics like data loss (I don't want to replicate all 20 client metrics here).
I guess this approach will only work with the agents, where there is no need for synching: each agent remembers the max batch seqNo it received from each unique client, and can use it to emit a metric of the total number of batches that the client thinks were sent. If batches arrive out of order (i.e. with a smaller seqNo than what agent remembers), then it is safe to ignore them for the purpose of the metrics because we will "catch up" once a batch with higher seqNo arrives, since all counters included in the batch are cumulative. |
I can see that, but aren't you then forcing the application into doing something it doesn't want, assuming this would be on by default?
Before merging the PRs for the clients, can we get an experimental version of the agent (draft PR would be fine), with a concrete example? |
I don't plan on merging any PRs until I have an end to end solution working. |
@jpkrohling fyi #2010 |
I left a review already. As it's marked as WIP, not sure you want another one already. |
since Jaeger clients are deprecated, this is unlikely to move further, so closing |
Requirement - what kind of business use case are you trying to solve?
We want to have accurate global measurements of the potential data loss in Jaeger clients.
Problem - what in Jaeger blocks you from solving the requirement?
Although Jaeger clients internally support integrations with metrics backends, not all users actually configure those, and even if they do, the metrics are often namespaced to the application, e.g. by including a tag
service=xyz
or using a custom prefix, which makes it more difficult to observe across the site.Proposal - what do you suggest to solve the problem or improve the existing situation?
Allow clients to report certain data loss metrics when submitting span batches to Jaeger backend. The data elements should include:
The Jaeger clients (some, at least) are already including a
client-uuid
string tag in the Process, which will allow the agent/collector to monitor those reported stats and emit consistent metrics.The text was updated successfully, but these errors were encountered: