-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics instrumentation #522
Comments
What are the metrics interfaces to be supported? Where is this recorded? |
We use prometheus in ipfs and it works pretty well. We use the interfaces in go-metrics-interface and the implementation in go-metrics-prometheus so we can swap out metrics implementations. |
yeah, let's continue down the prometheus/grafana train. as a note, i'd love to see our metrics interface extended to support "labels", arbitrary KV pairs. these labels will allow us to get deeper insight into our relatively abstract gauges/counters/timers. two things i'd love to see instrumented off the bat:
|
I was initially in favour of OpenCensus, however after experimentation, here are my observations on it:
|
One advantage of using OC for both metrics and traces is that it automatically saves "exemplar" traces for values in each histogram bucket. I also asked for a clarification of the direction they're taking with the whole
It's not obvious from the issue that they understood you were asking about gauge functions. I've explicitly asked in a follow-up question. Representing {Phantom}Drift (libp2p visualiser), we can make the metrics export for the introspection protocol work with whichever system. I do think that lazily computed gauge functions are important to us, and that doesn't even appear to be on the roadmap or sensical for OC because they aspire to be primarily push-based (encouraging use of OpenCensus Agent/Service). Not sure if I'd tag them as a dealbreaker, but certainly close to it. Also, the maturity and stability of Prometheus should not go undervalued. OpenCensus has to mature a great deal, I'm sure there will be significant API changes (as evidenced by the stats vs. metrics thing now), and if we choose it, it would be more of a risky bet on the grounds of seeing potential that we're willing to support with time and effort, versus choosing the proven, robust option. OTOH, if we want to incorporate tracing, automatic "examplars" are a nicety we only get with OC, but not the panacea either. |
OC does support lazy metrics calculation via thunks: census-instrumentation/opencensus-go#1080 (comment) |
OC Metrics/Stats difference here: census-instrumentation/opencensus-specs#253 (comment) |
An OpenCensus
Please link to something that would take no more than 2 hours to port from Prometheus to OpenCensus, as evidence of this point. On top of this, you can still register these tools when using OC. Snippet from cluster:
This may be a pain for you as developers of an upstream package, but that is what makes the metrics of your package documented and accessible to your downstream users.
Regardless of the OC Agent/Service aspect, which is secondary to this discussion, OC is neither push nor pull as that is based on the backend that you configure it to use. If you configure prometheus as the metrics backend, you still end up with the metrics endpoint that prometheus then scrapes.
I don't think OC is going to need your support, its come out of Google and now has Microsoft support. Not a guarantee of success by any means but this isn't some vapourware project that we are talking about.
Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project. Edit: removed points addressed between refreshes of issue. Added code snippet. |
This sounds like appeal to authority, and it's a poor interpretation of my argument. They won't need us, but it's undeniable that we are investing more time and effort in wrapping our heads around OC than around Prometheus. The takeaway for several of us is that it's immature and heavily in evolution, and the communication around ongoing changes isn't great (e.g. stats vs. metrics thing). So yes, it's an investment from our side to keep up with their evolution, much more than just going with stable Prometheus.
Can you point me to a project that relies – or builds upon – OpenMetrics? Just to put some magnitude on this. Number of packages importing each:
|
So far, these are my conclusions. With a few details clarified (e.g. lazy gauges), they appear to be at feature parity. The devil might be in the details, though. Note that this evaluation is for the METRICS functionality. For tracing, I think we're onboard with using OpenCensus. Prometheus: 👍 Safe bet, mature. Widely used in Go. Mature instrumentation for Node, Rust, Java, Python, C++. 👎 Ties us to a specific metrics backend (even if there are converters out there), export format is textual. OpenCensus: 👍 Open standard not tied to any particular solution (good for libp2p; as a library we should not impose a world view onto our users); segways synergistically into tracing; push support. 👎 In heavy evolution, little traction since 2018, heavily WIP, configuring to export to Prometheus in a specific way proved debatably challenging, verbose configuration, not great DX due to lack of docs and opacity around changes (e.g. stats vs. metrics thing), will require active participation and involvement upstream. Multi-language support still in the works (many libraries are alpha; not good for libp2p). Seems like a leap of faith (which we might want to take); all backends supported by OpenCensus already support Prometheus scraping; key features we want to use are still experimental and with big fat warnings of API instability. Let's hear another round of feedback, then propose a way forward by Friday. Whichever one it is, let's set a checkpoint in 1 month to evaluate how it's playing out. We will mark this feature as experimental to convey to our users things are subject to change under the hood. If in 1 month we find that we're subjectively unhappy with the choice, the system gets in our way, certain features don't work as advertised, DX is bad, we feel frustrated, etc. We can either get involved upstream, or make a switch if we lack the capacity. I doubt we'll have any of these issues with Prometheus. EDIT: added new pros/cons; clarified this evaluation is for the metrics aspect. |
I clearly added a caveat to clarify this wasn't an appeal to authority but a rebuttal to the implication that OC was some small-time project.
You misunderstood, I am agreeing that Prometheus is robust hence OpenMetrics taking and extending their data model, with OC implementing the more generalised standard.
To add context to your numbers: OpenCensus released: Janurary 2018
Evidence of this please? I had no issue doing so in Cluster...
Can you please explain what the issue is here? I don't have any trouble understanding the difference. |
Can't say I'm impressed by the traction. EDIT: Not implying it's not a useful investment, I'm just pointing out we'd be very early adopters, and we don't have a seat at the table.
@anacrolix, could you point to concrete code samples? I was summarising the first bullet point from your feedback.
Contextualising the citation:
What I'm saying is that these are two disjoint ways of generating metrics, and if I were an OC user that had invested into complying with their |
I meant summaries, which was in the linked document.
That's not a good argument, as much as I love implementing stuff.
https://godoc.org/go.opencensus.io/stats/view#SetReportingPeriod and https://github.com/anacrolix/go-libp2p-dht-tool/blob/feat/oc-metrics/main.go#L374 |
@raulk there are a lot of code links further down in my summary. Regarding the overhead, it took me most of a day to convert a small number of metrics, from OC to Prometheus. Here are some conversions from OC metrics to Prometheus for go-libp2p-dht-tool, and for go-libp2p-kad-dht. Note that the Prometheus side of things has a few more features in the DHT, as I've continued adding metrics there. Some of the file diffs are collapsed. |
After carefully weighing pros and cons, and speaking to a number of stakeholders (incl. @frrist and @scout today) for alignment, I recommend to go with OpenCensus for both metrics and tracing, and really strive to make it work. We talked about a 1-month evaluation period, but I want us to try as hard as possible to make this a success. For me, that pathway is an ultimate backstop/escape hatch, and we'll need to justify it clearly and quantitively if we use that card. Here are my closing notes:
Thank you all for the constructive input! Let's instrument all the things! 💪 |
thanks for collecting all of this, @raulk! sounds good to me! |
@raulk on the k8s front, searching the org does turn up results, but only on the tracing front, granted k8s just rebuilt their metrics collection tooling, heapster -> metrics-server. Also, I have mentioned this elsewhere but to reiterate, I am more than happy to help out on any rough edges and getting this implemented smoothly across all the orgs. Thanks for diving deep on this, sorry if the discussion got too passionate at times 🙏 |
Closing. The current metrics effort is tracked in #1356. |
We need to instrument the various subsystems for metrics collection.
This issue is here to track progress towards instrumentation.
Subsystems that need to be instrumented:
cc @Stebalien @raulk @bigs @anacrolix
The text was updated successfully, but these errors were encountered: