Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate Monkey Patching for Diagnostic Instrumentation #134

Closed
mike-kaufman opened this issue Dec 17, 2017 · 37 comments
Closed

Eliminate Monkey Patching for Diagnostic Instrumentation #134

mike-kaufman opened this issue Dec 17, 2017 · 37 comments

Comments

@mike-kaufman
Copy link
Contributor

mike-kaufman commented Dec 17, 2017

Forked from issue #95 so we can track seperately:

@watson wrote:

Problem: Can we standardise the way tracing information is sent from an instrumented module (called module-x below) to an APM agent?

Suggestions:

  • Maybe we can use the V8 trace events API
    - Pros: No need for module-x to know about the APM agents
    - Cons:
    - Currently no way to bind the trace event to the current context
    - Might be expensive to cross the barrier into C-land
    - Might not be detailed enough
    - Overhead of running in production?
  • Alternatively invent a new API in JavaScript land
    - This could either be in Node core or a userland module
    - If it’s a userland module, then module-x needs to detect if it’s present before it can send events to it. > The APM agents need to do the same.
    - What if there’s multiple versions of this module installed?
    - Research if other languages have something similar

To frame the problem a bit more concisely: Current APM vendors have to monkey-patch libraries to produce diagnostic data. E.g., if you want to know specific details of a DB query, one would monkey-patch the DB driver to capture any necessary params and stats.

This is problematic because

  • monkey-patching is brittle (it necessarily relies on internal implementation details),
  • each APM vendor has to effectively do the same thing.
  • multiple APM vendors loaded into the same process can trample on each other.
  • ESM modules wil require a custom loader to support monkey-patching.

We would like to get to a solution that has the following characteristics:

  • de-facto standard APIs for producing & consuming messages. (i.e., major APM vendors should buy into this)
  • ability to "phase-in" use of APIs by monkey-patching at first, and then pushing library maintainers to write to the "event sink" internally, thus eliminating needs for monkey-patching.
  • API is back-compat (ideally down to Node v4)
  • Ability to turn off event publication and then have "near-zero" overhead of any libraries that are leveraging publication APIs.
  • a solution that accounts for asynchronous context. That is, data events necessarily need to be correlated with asynchronous context. That said, we shouldn't conflate the problem of "monkey-patching for data events" (what we're tryign to solve here) with "monkey-patching to track async continuations" (addressed by async-hooks).
@AndreasMadsen
Copy link
Member

What are the cons of using async_hooks? I can think of a few but was this not discussed?

@Flarna
Copy link
Member

Flarna commented Dec 18, 2017

AsyncHooks are fine to correlated initiator and corresponding callbacks for native calls. But I don't think it's really possible to extract the data passed to the functions.

Besides that I think AsyncHooks have some blind spots:

  • Native extensions not using the AsyncHooks Embedder API (e.a. plain NAN or V8 APIs). This could be improved by integrating AsyncHooks Embedder API to NAN APIs used in such cases.
  • APIs which are based on parsing some data stream where there is no 1:1 mapping of outgoing writes and incoming data - in the end AsyncHooks were never designed for such cases.

@AndreasMadsen
Copy link
Member

In AsyncHooks the embedder can pass a custom resource object where they can store any information they like. Such as the data passed to a function.

@watson
Copy link
Member

watson commented Dec 18, 2017

Wouldn't you normally only implement the Embedder API on async boundaries - when queueing a callback, just before calling it, just after calling it etc. In APM we need sometimes hooks into other places of the code. For example instrumenting all middleware functions in Express or templating languages that doesn't do any async calls are both quite common.

@AndreasMadsen
Copy link
Member

Wouldn't you normally only implement the Embedder API on async boundaries - when queueing a callback, just before calling it, just after calling it etc. In APM we need sometimes hooks into other places of the code.

You could also use the Embedder API for making the context flow easier to understand. A database request, even if there is no queue, can involve a hundred async operations. Wrapping that with async_hooks seems very reasonable.

For example instrumenting all middleware functions in Express or templating languages that doesn't do any async calls are both quite common.

Indeed, for the purely synchronous cases it doesn't make sense to use async_hooks.


I think trace_events is the way to go when async_hooks doesn't make sense (actually async_hooks emits trace_events).

Currently no way to bind the trace event to the current context

This we can fix. The V8 team is already working on it.

Might be expensive to cross the barrier into C-land

I've done a great deal of benchmarking with trace_events versus more naive logging. The price of crossing the JS/C++ boundary is offset by the great performance of trace_events logging versus more native logging.

If someone would like to benchmark this for themselves, I have already implemented a basic version of trace_events on process.binding('trace_events'). See usage example here: https://github.com/nodejs/node/blob/master/lib/internal/trace_events_async_hooks.js

Might not be detailed enough

What details do you need? You can emit arbitrary events in trace_events you can also link things together. Yes, by default it only supports having two extra arguments per event, but these can contain strings so one could just JSON.stringify if there are more. A more involved implementation can also support more than two extra arguments.

Overhead of running in production?

For sure trace_events will have the smallest overhead. It was built to be used in production.

@Flarna
Copy link
Member

Flarna commented Dec 18, 2017

Maybe I haven't fully understood Async Hooks yet. If my task is to trace e.a. a http.get() call or or just a fs.read() . Which data to I get from AyncHooks here?

In our case we usually monitor high level requests e.a. a DB request, not all the 100 async calls needed to get it done. As a result we would like to extract the high level data not the low level protocol messages.

Sure, the 100 async interactions in between need to be tracked also in a lot cases but just to correlate request and cb, without extracting specific data. Here async hooks look really fine.

@danielkhan
Copy link
Contributor

We started working with the OpenCensus folks to create a generic, vendor neutral tracing API. Maybe this is something we can incorporate into Node.js as well. Did anyone already research using anything like OpenCensus, Dapper, OpenTracing? I'd almost expect that, as Google is driving that.
If not, I will put something together over xmas.

@AndreasMadsen
Copy link
Member

Maybe I haven't fully understood Async Hooks yet. If my task is to trace e.a. a http.get() call or or just a fs.read() . Which data to I get from AyncHooks here?

async_hooks will as a minimum trace the "atomic" requests. But you could also trace the non-atomic requests. You will typically end up doing that anyway for queues, which is why I don't see a problem with doing that too for non-queue cases.

@mike-kaufman
Copy link
Contributor Author

Did anyone already research using anything like OpenCensus, Dapper, OpenTracing? I'd almost expect that, as Google is driving that. If not, I will put something together over xmas.

@danielkhan - I'm not aware of anything. Would be great to see something like this, at a minimum just an inventory of different approaches so we can be somewhat systematic. nodejs/node-eps#48 looks like an effort to explose tracing_events through JS, which also may suffice here.

Also, to be clear, we should scope this issue to addressing the "monkey-patching for gathering metadata about API calls" problem. As discussed above, ppl are monkey-patching for two distinct reasons - tracking async context and capturing details about API calls (e.g., params passed to a DB call). Async hooks is intended for the former. I'm not following suggestions to use Async Hooks APIs for the latter, but perhaps I'm missing something. If this is a valid approach, we should list in above list.

@tedsuo
Copy link

tedsuo commented Dec 18, 2017

Ted from OpenTracing here. Would you consider the OpenTracing C++11 API, or a C API? We're in the process of integrating into a number of technologies, now would be a good time to start working with you!

https://github.com/opentracing/opentracing-cpp
Consider dynamic loading as a way to continue to ship node without needing to recompile in order to include a tracer: opentracing/opentracing-cpp#28

@vdeturckheim
Copy link
Member

vdeturckheim commented Dec 18, 2017

@mike-kaufman if I understand well, you are suggesting we would want some kind of proper overriding tooling in Node.js?

@mike-kaufman
Copy link
Contributor Author

mike-kaufman commented Dec 19, 2017

@vdeturckheim - not sure what you mean by "proper overriding tooling". What's been expressed here is that monkey patching is invasive & fragile - i.e., it relies on internal details of some library, and when those details change, monkey-patching breaks. Consequently, there's a desire to get away from it.

I think we'd like to find some solution where

  1. module authors (e.g., DB client libraries) have a consistent way to publish "data events" about interesting things happening in their library.
  2. Interested consumers (e.g., APM vendors) can listen for those events & act on them.

@mhdawson
Copy link
Member

mhdawson commented Jan 3, 2018

@tedsuo last year Node.js integrated a trace engine from google and we have been discussing how to add tracepoints etc. I'm not sure if the use of OpenTracing is relevant to this thread but it might be interesting as a discussion on its own. Ideally it we would be able to plug in different trace engines under the APIs that we expose in C and Javascript in Node.js for tracing. I've not read enough on OpenTracing to see how it would fit into that model. If you have some cycles to thing about it and possibly give the diagnostics team a n overview I'd suggest opening a new issue to start that discussion.

@SergeyKanzhelev
Copy link

@mike-kaufman thanks for pointing me to this thread.

@danielkhan we are discussing node.js OpenCensus SDK with Google. I think diagnostics channel and OpenCensus are very aligned. We just need to understand the proper layering and requirements of those.

@ofrobots
Copy link
Contributor

ofrobots commented Jan 9, 2018

For background, OpenCensus (github) is a new project to make a language and vendor neutral distributed tracing API and SDK. This work is in concert with the w3c proposal for distributed trace context. We are hoping to extract Node.js instrumentation code from Stackdriver Trace and incorporate that into the OpenCensus SDK as a starting point. Some of the other languages already have fairly functional SDKs, but we haven't gotten that far on JavaScript just yet.

Even with this in place, the question asked by @mike-kaufman in the OP still applies. How do we connect producers of high level tracing data (e.g. express) to the consumers (APM, debugging tools). This may be trace-events (but OP has valid concerns) or it may be something like the diagnostic channel module from Microsoft. I see both of these attempting to do the same thing from different ends (low-level vs. high level), and one thing I would like to achieve at the upcoming diagnostic summit is whether there is a path to intersect.

Some thoughts on the specific concerns by OP:

Maybe we can use the V8 trace events API
Pros: No need for module-x to know about the APM agents
Cons:

  • Currently no way to bind the trace event to the current context

This is a problem. We need the notion of context to be well defined within Node core. Once this notion is available, it would be fairly easy to bind trace events to the context.

  • Might be expensive to cross the barrier into C-land

This is a problem. We will need a way for the VMs to accelerate (JIT) tracing calls from JavaScript. I think we would probably want the VMs to provide an intrinsic.

  • Might not be detailed enough

IMO, this is addressable.

  • Overhead of running in production?

If we can avoid the JavaScript to C transitions, it would hard to beat the performance of trace-events with a pure JS.

Trace-event is designed first of all as a low level tracing mechanism to be used primarily from C/C++. The additional benefit it provides is that it acts like a single bus that aggregates all performance event data. Diagnostic channel does the same thing, but it will be limited to only high-level (i.e. not too high frequency) performance event data. Perhaps diagnostic channel is a good starting point. Once/if we have the trace-event API available from JS, we can additionally inject the diagnostic channel data into trace events as well, giving us a single stream of all performance event data?

Just some thoughts.

@mike-kaufman
Copy link
Contributor Author

mike-kaufman commented Jan 24, 2018

Based on @ofrobots comment above, I think there's a path forward here by the following:

  1. Land on a JS API for a pub/sub model for events. (e.g., DiagnosticChannel or something equivalent)
  2. Start building up monkey-patching libraries that use above API. Consumers (e.g., APM vendors) can start consuming these, and each vendor doesn't need to do their own monkey patching.
  3. Open appopriate PRs on monkey-patched libraries to make use of the API above, and start deprecating the monkey-patch libraries.
  4. Resolve perf implications of using trace_events from JS.
  5. Integrate API from 1 above with trace_events.
  6. open-census messages can be implemented on top of messages produced from API. i.e., there will be an "open-census" translation library that will subscribe to events & transform them into open-census things.

Notes:

  • above list isn't necessarily ordered or sequential. Some things can happen in parallel & some things can be moved up/down.
  • RE context formalization, I think that can be addressed at the JS API at first through monkey-patching and/or async-hooks, and then evolve as formalization efforts crystalize.

Does this sound like a fair interpretation of thread so far? Please let me know if I'm missing something.

@mike-kaufman mike-kaufman changed the title Get out of monkey patching Eliminate Monkey Patching for diagnostic instrumentation Jan 24, 2018
@mike-kaufman mike-kaufman changed the title Eliminate Monkey Patching for diagnostic instrumentation Eliminate Monkey Patching for Diagnostic Instrumentation Jan 24, 2018
@ofrobots
Copy link
Contributor

@mike-kaufman 👍

@mcollina
Copy link
Member

I would also note that monkey patching requires a custom loader to work with ESM. I would like to add this as one of the problems that removing monkey patching would solve.

As part of the goals, we should consider the LTS cycle. When opening PRs to 3rd-party libraries, we should be focusing on making sure that the library could run without modifications on old node releases (down to 4 ideally).

Moreover, we should ensure that when turned off, this tracing layer has zero overhead. One of the advantages of monkey patching is that when it's turned off it's off.

@mike-kaufman
Copy link
Contributor Author

mike-kaufman commented Jan 25, 2018

@mcollina - Thanks, updated summary above to reflect your comments. Note I changed "zero overhead" to "near zero" in case anyone is being pedantic. With tracing disabled, overhead should be a few instructions.

@Flarna
Copy link
Member

Flarna commented Jan 25, 2018

I don't think that only emitting trace events will fit for all usecases. Currently we have to modify the arguments array (e.g. by wrapping the passed callback) in quite some cases.
For some outbound requests (HTTP, message queues) we have to modify outgoing data (e.g. HTTP headers) to get our trace tag transmitted.
For inbound HTTP requests we even modify the response data stream in some cases.

Maybe async hooks can help to avoid the need to wrap callbacks in most cases as we get the call context via this hooks.

If I compare NodeJS instrumentation with other techs like Java or .NET the main difference is in my opinion not the access to internals (for Java/.NET byte code is patched). The main difference is that in NodeJs monkey patching is done by a lot people - not just by a few APM vendors (which are anyway not used in the same application at the same time).

Catching up with changes in internals is definitely not my favorite task. But till now it has shown that it's harder and more cumbersome to find issues caused by combining modules using monkey patching.

@mike-kaufman
Copy link
Contributor Author

I don't think that only emitting trace events will fit for all usecases.

@Flarna - if you could be more explicit about use cases that aren't handled by emitting "trace events", and potential any work-around, that would be helpful here. E.g., for adding http headers, can you inject a piece of middleware?

@kjin
Copy link
Contributor

kjin commented Jan 25, 2018

@ofrobots and I were discussing this just now -- to summarize, we worked with the following three requirements for getting out of monkeypatching for tracing:

  1. Context propagation (to propagate trace context within the application)
  2. Information retrieval (to create and populate trace data, such as path/query/etc.)
  3. Outgoing message metadata injection (to handle distributed trace context; this refers to injecting HTTP headers as @Flarna mentioned, but also generalized to include other RPC types other than HTTP)

The ecosystem adoption of async_hooks fulfills (1), and the diagnostic-channel module mentioned here addresses (2) with its pub/sub API. However, these APIs alone don't satisfy (3). Injecting a piece of middleware would work somewhat but only applies to server responses, not outgoing requests (on top of that, I would consider injecting middleware a form of monkeypatching).

What we think is that Node core could expose some API where an APM vendor would be able to specify that they want a set of key-value pairs to be added to every outgoing HTTP request upon calling http.request. A second, similar API might exist to specify the same thing but for server responses.

@Flarna
Copy link
Member

Flarna commented Jan 26, 2018

@mike-kaufman - I will try to get more concrete by describing some samples. Please note that this does not describe every detail/corner case.

Example 1: Simple outgoing request without tagging; e.g. database requests mongodb.find()
Currently we monkey patch find(). Once our wrapper is called we do following actions:

  1. check if we should monitor this request or not
  2. start a trace to book CPU time on it (name of function and type of this is used here)
  3. record operation specific start data to this trace (e.g. hostname, query,... - extracted from this and/or function arguments and/or data captured via earlier, related calls)
  4. modify arguments by replacing the user supplied callback (if present) with a wrapped version
  5. call original function
  6. patch returned promise (if present, depends on (4))
  7. sometimes we also capture a stacktrace and attach it to the trace to get deeper code level insight
  8. end trace (stops CPU time booking)

Once the wrapped callback is called (or the promise then/catch function) we do similar actions:

  1. start a trace linked to above to book CPU time for callback (name of function and type of this is used here)
  2. extract end data from arguments (e.g. error object, result size,...) and store it to the trace
  3. call original callback
  4. end trace (stops CPU time booking)

In case another patched function like mongodb.find() is called within the callback (or later after a setImmediate(),...) the same as above happens. From call context we are able to link them together and split CPU time in case of direct calls.

I could imagine that this use case can be completely covered by events and async_hooks (assuming enough info is passed to the events, e.g. we have begin/end events for all functions,...)
In case this high level events issued hold some sort of transaction id/context (maybe this is created by events consumer to avoid a table lookup) it's even not needed at all to use low level async_hooks to connect them.

Example 2: "Complicated" outgoing request without tagging; e.g. database request via query/stream object
Similar as above we patch the the API which triggers the request and creates a trace including entry data. But instead of wrapping the callback (4) or patching the promise (6) we have to patch the returned query object. Depending on how this query/stream object is designed it's needed to monkey patch some APIs and wrap callback registered. Usually we have to store our transaction context in these objects.
When results are issued by these objects we do similar actions as above for the single callback.

Not sure if we are able to trace transactional context via async_hooks along such multi step operations. If not, the trace events emitted by the module could contain an unique identifier for the operation to allow to track context.

Example 3: Simple outgoing request with tagging; e.g. HTTP request
Approach is similar as above without tagging. But additionally we create a trace tag (unique for this request) at (2) and inject it to the transmitted message.
In case of HTTP this is a dedicated request header we either inject by modifying the options passed to http.request() or by calling req.setHeader() afterwards (a better place would be maybe _storeHeader()).
In case of outbound HTTP request we have to ensure also that we capture all emitted events on request/response objects - there is no single callback.

Besides events to extract data and async_hooks to track context we need some hook here to inject a HTTP header. For other RPC/Messaging protocols a similar hook would be needed. Important here is that the trace tag injected shall be unique to the operation and is created during the actual call (at (2)).

Example 4: Incoming request, e.g. http.Server
In this case we wrap the incoming message handler. In case of http this is done by wrapping the callback registered to the 'request' event of http.Server. Once this function is called we do following actions:

  1. check if we should monitor this request or not
  2. extract the trace tag from incoming data (e.a. our HTTP trace header)
  3. start a trace using the tag extracted to book CPU time on it (name of callback and type of this is used here)
  4. record operation specific start data to this trace (e.g. hostname, HTTP method,...)
  5. depending on concrete request additionally patch request object to allow us to monitor follow-up data (e.a. HTTP form data)
  6. patch request object to allow us to wrap other listers (e.g. 'close', 'error') and get notified when request is ended (e.g. req.end() is called)
  7. call original callback
  8. end trace (stops CPU time booking)

Once wrapped events or end() is called we do similar actions:

  1. start a trace linked to above to book CPU time for callback (name of function and type of this is used here)
  2. extract end data from arguments (e.g. error object, status code, response headers,...) and store it to the trace
  3. call original callback/function
  4. end trace (stops CPU time booking)

With trace events emitted we could for sure cover most or even all of above functionality. Not sure if async_hooks allow to track context for all possible combinations of events/requests on HTTP request/response objects. I would expect that we need at least something on top of it to link the request event and the corresponding resp.end(). We know that there are frameworks out there which do pooling.

Example 5: Incoming HTTP request injecting a JS agent
This one is like example 4 but additionally we have to

  • parse and potentially modify the HTTP response data
  • remove/modify incoming and outgoing HTTP headers
  • catch incoming requests from injected JS agent and completely handle them (not visible to real application at all)

I fear that this use case is too heavy and specific to be covered by generic hooks/events....

@ofrobots
Copy link
Contributor

@Flarna thanks for the detailed examples. Can you elaborate on what you mean by 'request injecting a JS agent'? I think I am understanding with the rest of your description, but I am not sure what a 'JS agent' is.

@mike-kaufman
Copy link
Contributor Author

Can you elaborate on what you mean by 'request injecting a JS agent'?

I interpreted this as needing to inject a script into a html response, but @Flarna please clarify. :)

@danielkhan
Copy link
Contributor

@mike-kaufman, @ofrobots Stepping in here. Yes it's about changing the response by adding a script tag to the body.

@tedsuo
Copy link

tedsuo commented Jan 26, 2018

@mhdawson replying in regards to using OpenTracing for C/JS instrumentation.

We are adding support for dynamic loading (opentracing/opentracing-cpp#45) and will be adding a C bridge as the next step. This will allow any API-compliant tracer to be dynamically linked to the nodejs binary, much like what we are doing with envoy, nginx, postgres, etc. This has the advantage of not requiring NodeJS to be tied to a particular tracing implementation or version, as the code is separated. I can start another issue to discuss this.

@Qard
Copy link
Member

Qard commented Jan 27, 2018

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like needing to inspect a connection object of a database driver during a query to include host/port information with contextual data of the trace layer, needing to inspect the socket of an incoming http request to get the remote address data, and many other cases where the patches do a lot more than just naively capturing function arguments. It's very common to need to dig about in the function context or augment things to intercept things like errors as transparently as possible.

edit: had this issue open since yesterday and it hadn't refreshed with any of the posts from the large one from @Flarna on. Oops. 😅

@mike-kaufman
Copy link
Contributor Author

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like...

@Qard - I agree in short term. Interesting exercise here will be to take use cases like you and @Flarna describe, and then see if we can get a system in place that provides a path to the required data.

@mike-kaufman
Copy link
Contributor Author

@Flarna - thanks for the detail above. Will be interesting to use your use cases as a function to see if we can get the right API in place. Won't be 100% at first, but should be able to evolve in that direction & the exercise will bring more clarity/definition to the shape of the necessary APIs.

@LewisJEllis
Copy link

LewisJEllis commented Jan 27, 2018

I agree in short term. Interesting exercise here will be to take use cases like you and Flarna describe, and then see if we can get a system in place that provides a path to the required data.

I think part of the difficulty is that given an arbitrarily implemented database driver, we can't predict how to reach (for example) its connection object automatically, and the implementor/maintainer of that driver cannot predict what information they should provide to some new potential tracing hooks/reporting API to satisfy the needs of every tool (APM or otherwise).

I have seen cases of monkeypatch instrumentation tracking state across nested objects or calls in ways that the instrumented module itself doesn't track (or otherwise need to). In these cases the instrumented module itself might have a hard time reporting the data collected by the monkeypatching instrumentation. For example, finding the full route of an express request currently requires monkeypatching every layer in the routing tree and pushing each layer's route fragment into a stack during traversal. Unless I am mistaken, Express does not otherwise keep track of or have a way to easily find this full route. I imagine there are more compelling examples of this difficulty; this is just one I am familiar with.

Maybe the potential tracing reporting API would allow Express to report the full route piece-by-piece, but then we're still relying on Express maintainers to understand and accept a pull request that does this. Maybe we can count on Express to do such a thing, but can we count on every tracing-relevant module out there to do such a thing? And to collate and report every potential data point that any tool out there might find interesting? And can we count on end-users to use updated versions of all these modules?

I think it could be within the realm of possibility to eliminate the need to monkeypatch Node core libraries, but similar to @Qard I am not convinced that it is possible to avoid monkeypatching 3rd-party ecosystem modules. I think it'll be enough of a challenge to have everyone in the ecosystem propagate asynchronous contexts correctly in the face of various queueing/pooling/promise-caching scenarios, much less spit out sufficient trace data for every tracing/diagnostic use case to fully avoid monkeypatching.

We probably don't want to end up with every APM/diagnostic-tool vendor trying to send PRs to every library they want to trace, trying to have them report some additional piece of diagnostic data that one vendor cares about but another doesn't. I imagine that with the natural competitive inclinations of vendors wanting to provide features/capabilities that others don't, we would either see monkeypatching continue or we might even see ecosystem module maintainers caught in the middle of an awkward "what all should this module report?" tug of war.

I don't mean to just be a naysayer - I think reducing the need for/incidence of monkeypatching would be excellent, and I think the plan/steps Mike laid out above can satisfy a huge portion of use cases and go a very long way toward that goal. I just hesitate at the notion that monkeypatching can be eliminated entirely; despite the downsides, it's practically a feature of JavaScript.

I would be very interested to read about how other languages/platforms handle this sort of need, if anybody is familiar with good resources or examples. I know some other languages have handled asynchronous context tracking brilliantly, so maybe there is also something to be learned in this adjacent area.

@danielkhan
Copy link
Contributor

I imagine that with the natural competitive inclinations of vendors wanting to provide features/capabilities that others don't, we would either see monkeypatching continue or we might even see ecosystem module maintainers caught in the middle of an awkward "what all should this module report?" tug of war.

Speaking as a vendor, I would appreciate a generic API. The agent part of an APM product is just a fraction of the value proposition of APM.

I don't think that there is a generic way to instrument every module there is.
I don't even think that this is needed, or maybe we have to reiterate on which problem we are trying to solve.

For me the MongoDB APM vendor API is a good example of how a module vendor provides a way for APM vendors to register callbacks instead of monkey patching.
So as APM vendor I know that this is MongoDB and I have to deliberately register myself there but the instrumentation happens deterministically in a publisher / subscriber way and isn't prone to break when the signature of a function changes.

@Flarna
Copy link
Member

Flarna commented Jan 29, 2018

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like needing to inspect a connection object of a database driver during a query to include host/port...

Fully agree here. The trace events have to give more info then just function arguments. Depending on the concrete usecase it may be quite some effort to put all this data together.

And can we count on end-users to use updated versions of all these modules?

I fear not - we even see customers insisting on use of NodeJS 0.12. Others use ancient versions of NodeJs 4.

I would be very interested to read about how other languages/platforms handle this sort of need, if anybody is familiar with good resources or examples. I know some other languages have handled asynchronous context tracking brilliantly, so maybe there is also something to be learned in this adjacent area.

As far as I know the approach in Java and .NET is patching bytecode during loading. This requires you to be an "agent module" loaded in a special way - not like in NodeJS.
I would guess more knowledge of internals is needed here compared to NodeJs.
On the other hand you can be quite sure that your are the only one in a process doing something like this - again a major difference to NodeJs where patching of foreign objects is very common (not just for tracing/diagnostics).

@mike-kaufman
Copy link
Contributor Author

RE other runtimes, .net is now using something called DiagnosticSource, which is similar in function to DiagnosticChannel.

@mike-kaufman
Copy link
Contributor Author

mike-kaufman commented Jan 29, 2018

@LewisJEllis - thanks for your comments. A brief response to some of your points:

  • I don't think we'll see monkey-patching go away overnight; in fact the plan above has us relying on monkey-patching in the interim until appropriate APIs can get merged into target libraries. Someone doing this work will need to address surfacing necessary data, and this may very well involve refactoring on the target library. e.g., if express has challenges in tracking route info w/out monkey-patching, then we should consider a PR to refactor express to make this a supported scenario.

  • RE APM vendors all getting on board, each vendor is free to go their own way if they prefer. That said, I think there's enough benefits in what's being proposed to get the vendors + library maintainers to cooperate.

@isaachier
Copy link

Even if you want to use OpenCensus, they seem to be interested in implementing the OpenTracing interface. OpenTracing is an interface behind which the actual tracer can do tracer-specific work. Here is the issue regard the OpenCensus Go client, which easily applies to all of the clients: census-instrumentation/opencensus-go#502.

@mike-kaufman
Copy link
Contributor Author

closing in liue of #180.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests