Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting client data loss stats to Jaeger backend #2005

Closed
3 of 4 tasks
yurishkuro opened this issue Jan 6, 2020 · 8 comments
Closed
3 of 4 tasks

Reporting client data loss stats to Jaeger backend #2005

yurishkuro opened this issue Jan 6, 2020 · 8 comments
Labels
changelog:new-feature Change that should be called out as new feature in CHANGELOG enhancement help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Jan 6, 2020

Requirement - what kind of business use case are you trying to solve?

We want to have accurate global measurements of the potential data loss in Jaeger clients.

Problem - what in Jaeger blocks you from solving the requirement?

Although Jaeger clients internally support integrations with metrics backends, not all users actually configure those, and even if they do, the metrics are often namespaced to the application, e.g. by including a tag service=xyz or using a custom prefix, which makes it more difficult to observe across the site.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Allow clients to report certain data loss metrics when submitting span batches to Jaeger backend. The data elements should include:

  • batch sequence number to detect possible dropped packets (especially with UDP transports)
  • count of spans dropped by the client due to internal queue overflow
  • count of spans dropped by the client due to exceeding max packet size (e.g. 65000 bytes cap for UDP transport)

The Jaeger clients (some, at least) are already including a client-uuid string tag in the Process, which will allow the agent/collector to monitor those reported stats and emit consistent metrics.

@ghost ghost added the needs-triage label Jan 6, 2020
@yurishkuro yurishkuro self-assigned this Jan 6, 2020
@yurishkuro yurishkuro changed the title Reporting some client stats to Jaeger backend Reporting client data loss stats to Jaeger backend Jan 6, 2020
@yurishkuro
Copy link
Member Author

@jpkrohling @pavolloffay fyi

@jpkrohling
Copy link
Contributor

I wouldn't expect that extracting metrics from clients is difficult, especially because applications should be exposing their own metrics anyway. Once metrics are collected, it should also be easy to get aggregated values with or without service names using any reasonably modern metrics backend, like Prometheus.

But if you do need this internally, this might be a good indication that other people might benefit from it.

I do have one question, though:

batch sequence number to detect possible dropped packets (especially with UDP transports)

What do you have in mind for syncing this across all collectors in a cluster?

@yurishkuro
Copy link
Member Author

yurishkuro commented Jan 7, 2020

I wouldn't expect that extracting metrics from clients is difficult

We indeed found this quite challenging, given how heterogeneous Uber's ecosystem is. Some apps, for example, are still using statsd-style metrics that don't even support tags, a lot more apps simply do not configure Jaeger clients with a metrics client, because they are not interested in those metrics. So we want to have a more guaranteed way of collecting critical metrics like data loss (I don't want to replicate all 20 client metrics here).

What do you have in mind for syncing this across all collectors in a cluster?

I guess this approach will only work with the agents, where there is no need for synching: each agent remembers the max batch seqNo it received from each unique client, and can use it to emit a metric of the total number of batches that the client thinks were sent. If batches arrive out of order (i.e. with a smaller seqNo than what agent remembers), then it is safe to ignore them for the purpose of the metrics because we will "catch up" once a batch with higher seqNo arrives, since all counters included in the batch are cumulative.

@jpkrohling
Copy link
Contributor

apps simply do not configure Jaeger clients with a metrics client, because they are not interested in those metrics

I can see that, but aren't you then forcing the application into doing something it doesn't want, assuming this would be on by default?

I guess this approach will only work with the agents, where there is no need for synching: each agent remembers the max batch seqNo it received from each unique client, and can use it to emit a metric of the total number of batches that the client thinks were sent.

Before merging the PRs for the clients, can we get an experimental version of the agent (draft PR would be fine), with a concrete example?

@yurishkuro
Copy link
Member Author

I don't plan on merging any PRs until I have an end to end solution working.

@yurishkuro
Copy link
Member Author

@jpkrohling fyi #2010

@jpkrohling
Copy link
Contributor

I left a review already. As it's marked as WIP, not sure you want another one already.

@yurishkuro yurishkuro removed their assignment Nov 30, 2020
@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Nov 30, 2020
@yurishkuro
Copy link
Member Author

since Jaeger clients are deprecated, this is unlikely to move further, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:new-feature Change that should be called out as new feature in CHANGELOG enhancement help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

2 participants