Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: mark AsyncResource and AsyncLocalStorage as stable #35286

Closed
Qard opened this issue Sep 21, 2020 · 57 comments
Closed

Proposal: mark AsyncResource and AsyncLocalStorage as stable #35286

Qard opened this issue Sep 21, 2020 · 57 comments
Labels
async_hooks Issues and PRs related to the async hooks subsystem. discuss Issues opened for discussions and feedbacks. experimental Issues and PRs related to experimental features.

Comments

@Qard
Copy link
Member

Qard commented Sep 21, 2020

The AsyncResource API has been largely unchanged since its creation many majors ago, and AsyncLocalStorage has been out in the wild for awhile now and has proven its utility. I think both should be considered for stable status at this point. The challenge I see is that they live in the async_hooks module which has been marked as "experimental" at the top-level because the async_hooks feature itself exposes internals thus many (myself included) were unwilling to accept it as a "stable" API.

What I suggest is that AsyncResource and AsyncLocalStorage be considered as stable in the next major. Due to the module being named after the one not stable feature, it may also be worth moving the two APIs out to new top-level modules, but I'll leave that last point for discussion.

How should we proceed in getting these two APIs to stable status?

cc @nodejs/diagnostics

@Qard Qard added discuss Issues opened for discussions and feedbacks. async_hooks Issues and PRs related to the async hooks subsystem. diag-agenda Issues and PRs to discuss during the meetings of the diagnostics working group. experimental Issues and PRs related to experimental features. labels Sep 21, 2020
@mmarchini
Copy link
Contributor

cc @nodejs/tsc for awareness

@bmeck
Copy link
Member

bmeck commented Sep 21, 2020

I'm extremely not comfortable with enterWith or exit being considered non-experimental since they are an action at a distance, but the rest seems possible to me at least. If concrete and necessary use cases in the wild exist for those methods beyond being easier to consume than run I'd be open to conversation about why they are needed but I'm still not seeing wide adoption that forces them to exist. Those APIs are within the realm of what caused problems with domains in the post mortem and I'd be very hard pressed to understand why they are not the same danger.

@Qard
Copy link
Member Author

Qard commented Sep 21, 2020

Domains were global, so enter and exit applied globally. That meant calling either method could totally mess up some other components sense of what the context was. However, each instance of AsyncLocalStorage is separate, so you'd only ever get enterWith or exit called by the owner of that storage, thus they'd have more of a sense of what the "correct" thing to do is in their specific case. Messing with the context of an AsyncLocalStorage instance only impacts that instance, so I don't see how any of the issues with domains would emerge.

If people are really so concerned about those specific parts of the API though, I'm happy to consider marking the rest of the API stable and just leaving those specific parts as experimental. Also, I'm less sure about the stability of AsyncLocalStorage than AsyncResource, given its relative newness in comparison, though I'm fairly confident at this point there are no significant issues remaining.

@bmeck
Copy link
Member

bmeck commented Sep 22, 2020

@Qard The problem is just like how domain allows non-local and non-scoped manipulation of context in non-async boundaries. Your argument that only a "owner" will manipulate assumes that no-shared async storage would require that consumers have a specific coding style (in particular, not exposing a direct reference to an AsyncLocalStorage instance) which I don't find compelling. The docs show a local storage that is shared across multiple locations for example.

@Qard
Copy link
Member Author

Qard commented Sep 22, 2020

The docs show an AsyncLocalStorage instance shared within the same user code, which is the intent. Sharing a context object with other modules or other peoples code in general is a bad idea and not encouraged.

@bmeck
Copy link
Member

bmeck commented Sep 22, 2020

@Qard

Sharing a context object with other modules or other peoples code in general is a bad idea and not encouraged.

Agreed, but the concerns of sharing it with only .run are much different from those 2 methods, hence me not wanting to move those methods out of experimental. The .run method cannot alter non-locally. We can wait on more usage before coming back to those.

@Qard
Copy link
Member Author

Qard commented Sep 22, 2020

The enterWith() method is needed for APM vendors to be able to achieve patch-free instrumentation with the combination of AsyncLocalStorage and the work I'm currently doing on diagnostics_channel.

With diagnostics_channel, http servers can publish a message indicating the start of a connection, which a subscriber can then do als.enterWith(context) to bind the entire request to the context. If there's only als.run(...) then they would need to patch http servers to intercept the request handler, which is fragile and can negatively impact performance. There's also many ways a request handler can be added, and they could be removed and added again any time in the future, due to servers just being an event emitter. This makes the code to correctly patch an http server actually fairly complicated, whereas with the diagnostics_channel message the handler itself becomes irrelevant.

Personally, I disagree that there is any issue with enterWith() or exit(). If you're still concerned about it, perhaps we should add more explicit warnings to the docs about sharing AsyncLocalStorage contexts? Regardless of those specific functions, sharing a context is just a bad idea because it gives you a single object space that can be overridden at any time. The danger of some downstream code calling enterWith() is no more dangerous than something calling run() and messing up the context you expect to have in a callback. Sharing a context object between consumers is just universally a bad idea and this specific problem with domains is exactly why AsyncLocalStorage was created as a non-global system, allowing each user that wants to propagate context to have their own isolated object to do it with.

I entirely disagree that this is a repeat of the domains issues. Domains were problematic because they all shared one context globally--you can't be in more than one domain at a time. That's not the case here, each context is isolated.

@bmeck
Copy link
Member

bmeck commented Sep 22, 2020

@Qard I do not agree with that last comment, hence me not wanting to move those methods out of experimental. We can have this discussion here if necessary but it likely would be more useful to have it elsewhere. Perhaps an explanation of why that specific API design which does have non-local affects is necessary would be more helpful. I'm still unclear on why APM is unable to use .run. I will repeat that I am extremely uncomfortable with those APIs and having convenience in usage is not my priority in allowing those to move forward to stable. The rest of the APIs seem fine to me.

@AndreasMadsen
Copy link
Member

+1 on marking AsyncResource as stable. It being experimental is the primary reason module authors don't want to use it. Other parts of async_hooks still needs work, but I don't see it interact with the AsyncResource API.

I would suggest adding a documentation section, regarding what node.js would in the future, when the other parts are stable, consider a major/minor/patch change to the async context graph. This would serve as a default guideline for module authors when they change the async context graph.

PS: Be sure to also mark the C++ AsyncResource API stable as well.

@mcollina
Copy link
Member

I would prefer to split those two things:

  1. mark AsyncResource stable
  2. keep AsyncLocalStorage experimental for some more time

@mhdawson
Copy link
Member

My initial thinking is along the same lines as @mcollina with leaving AsyncLocalStorage in experimental for a bit longer.

@Qard
Copy link
Member Author

Qard commented Sep 22, 2020

Sounds good re: letting AsyncLocalStorage sit a bit longer. I'm mainly pushing for AsyncResource here because, as already explained by Andreas, module authors are refusing adoption because of it being marked as experimental and that is resulting in context loss scenarios.

I do think we should make a more specific decision at least around when we should re-evaluate if AsyncLocalStorage is ready though. We've sometimes let things sit in experimental forever and go stale, which I'd like to avoid here as this is an important feature for APMs.

@mhdawson
Copy link
Member

@Qard agreed on thinking about the timeline for AsyncLocalStorage. For some of the other Experimental features we agreed on a set of exit criteria in terms of usage, etc. and that was useful in making the case when it was the right time to exit Experimental.

@Qard
Copy link
Member Author

Qard commented Sep 22, 2020

@bmeck Here's a simple example of using a combination of AsyncLocalStorage and diagnostics_channel to enable patch-free instrumentation.

const dc = require('diagnostics_channel')
const onHttpRequestStart = dc.channel('http.server.request.start')
const onMySQLQuery = dc.channel('mysql.query')

const als = new AsyncLocalStorage()

// If an incoming request is received, create a Trace object
// and set it as the context for the request.
// This channel message does not intercept the handler
// as it should not have access to mutate arguments. But it
// triggers this message handler before running `request`
// events, so it can be used to enter the context first.
onHttpRequestStart.subscribe(() => {
  als.enterWith(new Trace())
})

onMySQLQuery.subscribe(() => {
  const trace = als.getStore()
  if (!trace) return
  // Report the MySQL event to the trace
})

Note that to be able to start the als context for the http request, we would need to either use als.enterWith(...) in the diagnostics_channel message handler or patch http servers to wrap the request handler in als.run(...). The patching method creates additional closures and, as explained before, there are many ways to attach a handler, and they can be removed too, so simply patching createServer(...) is not enough.

As for the comment about enterWith(...) not really being any more dangerous than run(...). Consider this app:

const app = express()
const als = new AsyncLocalStorage()

app.get('/user/:id', (req, res) => {
  const requestData = {
    userId: req.params.id
  }

  als.run(requestData, () => {
    // A function from some other module
    doAThing((err, user) => {
      console.log(als.getStore().userId)
    })
  })
})

If doAThing looks like this:

function doAThing(cb) {
  // BOOM. `requestData` is gone.
  als.run({ some: 'data' }, cb)
}

it will blow away the store entirely, because the two points are unaware of each other calling run. The console log will log undefined.

If doAThing looks like this:

function doAThing(cb) {
  als.getStore().userId = 'some id'
  cb()
}

it will replace the userId that was already in the store with a new one. The console log will log 'some id'. You could also run into situations where it checks for presence of a named field, expecting it to be set from one place but it's been set by something completely unrelated and causes all sorts of havoc.

tl;dr: Do not ever share a store with code you don't control. It's essentially the same concept as setting all your variables as globals. You're bound to run into a collision eventually and it's going to be hell to debug. Just don't do it. 😬

@bmeck
Copy link
Member

bmeck commented Sep 22, 2020

@Qard it is more complicated than that, consider:

const als = new AsyncLocalStorage();
emitter.on('my-event', () => {
  als.enterWith(store);
});
emitter.on('my-event', () => {
  als.getStore();
});
ee.emit('my-event');

The problem with these bleed through APIs unlike .run is you don't have fine delineation.

ee.prependListener('my-event', () => {
  als.getStore(); // gets the store BEFORE enterWith
});

Will escape this kind of implicit timing based side effect.

ee.emit('my-event');
// AFTER
als.getStore(); // still has the store meant to span 'my-event'

Will keep the store after the span it sought to encompass.

If you combine the 2, you can get stale store values. I can imagine plenty more scenarios where this is true, but its pretty simple to see. I'm not against solving the problem of needing to span across calls, I am only uncomfortable with the current APIs.


Per your example with .run that same sharing has a problem with .enterWith but is not the major concern of mine. I'd agree that sharing is dangerous regardless but other issues around accidental sharing as my example in this comment are the real worry.

@Qard
Copy link
Member Author

Qard commented Sep 22, 2020

That's intentional. It's setting the context for the rest of the sync tick, which all that is part of. That's actually specifically the functionality that is needed for using AsyncLocalStorage and diagnostics_channel together for patch-free instrumentation, and also why AsyncLocalStorage contexts are intentionally designed to be isolated from each other rather than having some shared global context.

My APM agent can say I want the http request diagnostics_channel event to trigger enterWith() on my AsyncLocalStorage context and that will not influence someone else's separate AsyncLocalStorage context in any way. The owner has the control, and that's the intent. If I hand off my context store to someone else and they mess it up, that's my own fault. That's like saying a stream based API shouldn't be allowed to return a stream and should instead just take a callback to receive the chunks, because if they return the stream someone could pause it--that's kind of the point.

@bmeck
Copy link
Member

bmeck commented Sep 23, 2020

@Qard I understand that is the design of the API, but disagree on the intent of the API being to be the API. The API is to solve the problem of wrapping around these contextual event spans. My example above is showing that the API is leaking outside of those spans. I understand how you can achieve some things with the APIs but I am not sold on the leakage being non-trivial. The problem that is attempting to be solved by those APIs seems to differ from the affects of the APIs.

@Qard
Copy link
Member Author

Qard commented Sep 23, 2020

It's not leaking though. It's capturing exactly the span it's supposed to, as indicated in the docs which call out this exact behaviour. It is intentional that the context continues for the rest of the sync tick and not just a lexical scope. It is not trying to capture only the local scope of the function where it is placed, that's what run() is for. see: https://nodejs.org/api/async_hooks.html#async_hooks_asynclocalstorage_enterwith_store

@bmeck
Copy link
Member

bmeck commented Sep 23, 2020

@Qard it goes beyond the events and outside of the emit. I do not dispute the documentation, just that the API is a leaking abstraction that goes beyond the goal of wrapping a specific time frame and instead wraps past that time frame. It would be good to understand why having the store set to something AFTER the emit() in my example completes is intended and needed. I can understand why it needs to go after the first listener, that is understood. Per my comment above, the current API can generate stale references.

@Qard
Copy link
Member Author

Qard commented Sep 23, 2020

It's not going beyond the goal. It's achieving exactly the intended goal. That API was created by me, and I wrote it specifically for the purpose of pairing it with my now in-progress work on diagnostics_channel to be able to tell an AsyncLocalStorage to enter a context at the point that diagnostics_channel emits the request start event just before the server emits the request event. At that point, everything in the rest of the sync tick should be linked to that context, not just the current lexical scope. Rather than run() which says "make a new async scope and set its context to this" the enterWith() method is saying "for the current scope, set its context to this".

@bmeck
Copy link
Member

bmeck commented Sep 23, 2020

@Qard It is unclear to me what you are talking about at this point, you are describing the API and not the problem it is trying to solve. I'm asking to understand why the API isn't considered to be leaking given that it doesn't appear to be wrapping the event but rather the event plus a long tail until the sync stack of anything unwinds. Repeating that you have achieve an intended goal is well and good, but it doesn't explain what the goal was, which I understand to be setting the store value to be something during the event so that it can be persisted. A more concrete example rephrasing my above comment:

var { AsyncLocalStorage } = require('async_hooks');
var { EventEmitter } = require('events');
var als = new AsyncLocalStorage();
var ee = new EventEmitter();
var incr = ((i = 1) => () => i++)();
ee.on('my-event', () => {
  var value = incr();
  console.log('pre-entering: %s previous: %s', value, als.getStore());
  als.enterWith(value);
});
ee.on('my-event', () => {
  console.log('post-entering: %s', als.getStore());
});
function run() {
  console.log('run');
  ee.emit('my-event');
  setTimeout(() => {
    console.log('unrelated timeout got: %s', als.getStore());
  }, 0);
}
run();
run();

You can see old values crossing into "previous:" and the timeout getting values even if it is outside the event. It would be good to understand why this is not problematic. I remain uncomfortable with those 2 APIs given examples like this one.

@Qard
Copy link
Member Author

Qard commented Sep 23, 2020

An async server is a tree of spans of sync execution. This is modelled with async_hooks in the form of execution and trigger ids along with lifecycle events. In an app with nothing but a single set immediate, you will have the top-level sync tick execution span which, between the start and end of that sync tick, will emit an init event for the setImmediate. Later, when the setImmediate callback is triggered, a before event will trigger before running the callback and an after will trigger after.

These before and after events will be linked to the init event through a shared async id, allowing one to traverse the async execution graph in id form and enabling correlation of anything between the before and after to be correlated to a point in that graph by using the current async id. All this complicated graph management is what AsyncLocalStorage does internally by attaching the given context object to an async resource and propagating it onto newly created async resources whenever an init happens. When that previously mentioned setImmediate triggers its init, the current value stored in the context (if there is one) will be attached to the newly created resource so that value will be used as the context between when the before and after of that resource are run.

The run() method of AsyncLocalStorage encapsulates a callback in a new async resource to ensure that everything within that callback gets attributed to that resource. What it actually does is just call enterWith() between the before and after of a newly constructed AsyncResource. This means that, because AsyncResource automatically becomes the current resource when the runInAsyncScope() method starts and is popped from the execution async resource stack when it ends, any async activity within the callback will be linked to that resource for the specific span of time that it is considered the current resource, which is the before/after boundary triggered around runInAsyncScope(). Lifting enterWith() out from running inside of run() just means it's binding to the existing current resource instead of creating a new one. This means that until the current execution async resource is popped from the stack when the after happens it will propagate the current value into all async branches initialized in that time. If enterWith() is called midway through a tick, it means for the remainder of that tick, or until a new context is set, propagate this object into any async resources initialized in this sync tick.

Being able to change the context object partway through the current tick is an intended feature and enables patch-free instrumentation. In the immediate term, it enables something like this:

// Say I have a simple http server that makes a SQL query
// I want to create a trace for the request and attribute the
// mysql query to that trace...
const server = http.createServer((req, res) => {
  mysql.query('...', (err, data) => {
    res.json(data)
  })
})

// ... a naive APM agent might do something like this:
const als = new AsyncLocalStorage()

// In a real-world scenario, we wouldn't have access to `server`
// so we'd have to patch the module loader to intercept the http
// when it is loaded and patch a large chunk of its API to basically
// just achieve what these three lines of code below are doing.
server.prependListener('request', () => {
  als.enterWith(new Trace())
})

// A more real-world looking patch to collect spans from MySQL
// would look something like this...
const original = mysql.query
mysql.query = function wrappedMySQLQuery(query, callback) {
  const trace = als.getStore()
  const span = trace.startSpan('mysql', 'query')
  
  return original.call(this, query, function wrappedMySQLQueryCallback(err, res) {
    span.end()
    return callback.call(this, err, res)
  })
}

With diagnostics_channel, we will effectively be getting an event immediately before that request event and through an API which requires no specific awareness on the users part of the existence of any specific server object. In the previous example, the server variable needed to be shared, which is rarely the case so more complex monkey-patching is made to intercept the entire module loading process just so we can do that single server.prependListener(...) on every server we see. The amount of hackery APMs need to do just to effectively run a handful of lines of code on some object we don't have direct access to is, frankly, absurd but also quite fragile and damaging to performance.

Anyway, my point is, the ability to change the context partway through a tick is an important feature for APMs in many places. The pre-http request example is the main one, but sometimes we need it for other types of servers. Another case is that APMs are encouraged to preload before app code via node -r $module, and if the user wants a trace of the entire process from the top-level, we can't wrap the top-level of the entry point module automatically without something like enterWith() to set our context object on the current async tick scope.

A ton of work has gone into getting these APIs to where they are now. Solving the problem of async execution graph tracking has literally been the core focus of my career for nearly a decade now, so I've put a ton of thought into it and built several APM companies agents around this model. While the AsyncLocalStorage API is still relatively new to core, it has been something I, and many others working in APM, have been pushing for in core for a long, long time.

@Qard
Copy link
Member Author

Qard commented Sep 23, 2020

Also, everything about your example is working exactly as intended. The context value is supposed to persist until the end of the tick so any async branches after that point will propagate that object into those async branches. If the object is changed midway through a tick, any future async inits in the remainder of that sync tick should propagate the new object. The "value" of the context between before and after hooks should be whatever the value was at the time the resource init event happened.

@Flarna
Copy link
Member

Flarna commented Sep 23, 2020

I'm +1 on promoting AsyncResource stable.

I would wait a little bit longer with AsyncLocalStorage for following reasons:

  • I doubt it's used a lot in the wild for production setups till now
  • If I see callstacks like this from a real user I'm not sure if this is really what we intended and we may have to rethink this once more
  • The NAPI integration of async resources is still missing/work in progress (e.g. n-api: napi_make_callback emit async init event with resource of async_context #32930)
  • There are too much modules in the wild where AsyncLocalStorage is effectively not working as it should because of userspace queuing,... and not using AsyncResource. I know this is not a bug in AsyncLocalStorage nor can this be fixed within node core but as a user I would expect that works better if exposed stable.

Regarding your discussion @bmeck @Qard
I'm with @Qard regarding the need of these APIs. I'm aware that they may cause problems if used wrong. But this is the nature of all APIs. Someone could argue that all open calls may result in a leak if close is not called...
The major difference between AsyncLocalStorage and domains is that domains consumed/redirected/discarded errors which was a global state. If one AsyncLocalStorage user garbles it's own als instance by using the wrong APIs it's a local problem. There is no global state change nor is there any impact to other als instances.
@bmeck Do you have some proposal how an API should look like to reach the requirement described by @Qard to attach to an existing resource for it's lifetime?

FWIW .NET offers an AsyncLocal class with is similar. They don't offer something like .run, just set a value.

@bmeck
Copy link
Member

bmeck commented Sep 23, 2020

Anyway, my point is, the ability to change the context partway through a tick is an important feature for APMs in many places.

I agree to this and do think we should have some way to solve it, but not to the points in my comments being non-trivial.

@bmeck Do you have some proposal how an API should look like to reach the requirement described by @Qard to attach to an existing resource for it's lifetime?

I have no real demands on the actual signature but it shouldn't be the end of the current tick that determines when it is changed. An API that matches the following would work fine to me:

Something that persists until a synchronous termination before the end of the tick. APIs like eventEmitter.emit('request') are a good example of something that the assignment should stop at the end of. I am not opposed to it escaping the local call frame of the event.

Right now, the API to me bleeds past the intended synchronous call frame that it was hoping to wrap and I still don't see an explanation of WHY going past that call frame is intended and necessary, just explanations of that it does and it is intended. I am familiar with these APIs and async queueing, I don't think everyone needs to establish their level of familiarity here. TC39 also has had presentations on these issues which I have been a part of and I have had plenty of discussions on why AsyncLocalStorage style APIs are rejected there, most recently with Async Contexts. I am not trying to be as absolute as I would be in TC39 and do want to find a middle ground, but find that subset of the current APIs to be non-starters without clarification and exhaustive production usage from many sources showing no issues (either with development or runtime).

@Qard
Copy link
Member Author

Qard commented Sep 23, 2020

It needs to be until the end of the tick though because the point is that we want to capture everything in that tick. An APM product wants to trace an entire request, not just part of it. At the start of the tick which runs the request handler until the end of it, we want to know everything that happens within that tick and within every async task descending from that tick so we can attribute that activity to the request that spawned it. If, for example, running the request handler was followed by running a setTimeout to timeout the request if it's not finished soon enough, we want to see that. We need our context object, which is probably our tracer, to be propagated into that so we have some way of finding the current trace within the descending async execution. I created enterWith() specifically because there are a few scenarios where we need to capture until the end of the tick. I was quite explicit in the documentation about warning exactly what this means. The suggested API for 90% of users is definitely the run() method, enterWith() is a special-case API.

@Shahor
Copy link

Shahor commented Oct 26, 2020

It's fine since the discussion is open :)

@Shahor
Copy link

Shahor commented Oct 28, 2020

Would it be possible to surface an als store through some events 🤔 ?

Considering the example in this comment

process.on('unhandledRejection', () => {
  als.getStore() // undefined here
})

This would always be undefined.
I suppose there is no way to interact with this EventEmitter in such a way that it would be ran under the ALS ?

@Qard
Copy link
Member Author

Qard commented Oct 28, 2020

An unhandled rejection could come from anywhere, so no there's no way to bind that to an AsyncLocalStorage instance. The context in an event handler will be whatever the context was where the emit was called. In this case, it'd be in Node.js internals somewhere, not from the promise that the error originated from.

@mmarchini
Copy link
Contributor

@Qard should we remove from diag-agenda until we have a proposal for exit criteria?

@Qard
Copy link
Member Author

Qard commented Nov 25, 2020

Sounds reasonable. :)

@vdeturckheim
Copy link
Member

vdeturckheim commented Feb 10, 2021

@gireeshpunathil thanks a lot for the list of exit criterea, here is my take on it for the follwoing scope:

  • AsyncLocalStorage constructor
  • AsyncLocalStorage.prototype.getStore()
  • AsyncLocalStorage.prototype.run(store, callback[, ...args])

I am restricting to this scope as I believe this is a viable minimal API for AsyncLocalStorage that has no strong controverial points - there might be discussions about the DX of this scope but AFAIK there are no strong opposition against these methods and this constructor. This scope is, IMO, the most likely to reach consensus.

user interfaces (flags, API semantics): are those finalized / ratified internally?

-> On the given scope, yes, the only debatable method is run and if we decide to make an alternative the mains canonical way of starting context spreading, there will always be a trivial way of building run over it.

platform coverage: is the capability available in all (identified) platforms?

-> I am not aware of any platform limitation for this scope

bugs: how many critical bugs outstanding? is the backlog ratified?

-> AFAIK there is no bug opened on the API per se. There is a know performance cost in promise-heavy environment but it should not be considered as a blocker: enabling AsyncLocalStorage is optional and the end user has the choice (directly or through packages they use) to pay that cost. However there are bugs linked to AsyncLocalStorage when ecosystem packages break context propagation (for instance here nodejs/help#3186). The solution is actually to move the API to stable to push adoption/compatibility forward and get rid of these bugs.

worker support: will feature work seamless within worker threads? or special settings required, and documented?

-> I believe there is no impact here

test coverage: unit test and smoke test. are all the APIs covered? are there known flakes?

-> There are 15 test files for the AsyncLocalStorage API covering multiple use cases and situations. (All the ones which name start with async-local-stoagre here https://github.com/nodejs/node/tree/master/test/async-hooks)

serviceability: upon production issues emanating from this feature, do we know how to debug?

-> This one is probably the one where we probably can improve the most. I have spent a lot of time debugging context spreading issues and I am probably too used to this to be able to give a satisfactory response here. When we built this API, I had the feeling that it would mostly be used by APM/RASP providers. We have opened issues about the API in this repo and in nodejs/help that actualy prove me wrong on this (which actually makes me very happy). There is probably room for a debugging guide or at least a few debugging tips in the doc and I woul dbe happy to add them to the PR that would mark the scope as out of experimental.

documentation: are the capabilities documented?

-> There have been a few iterations on the doc and I believe it is clear enough as of right now.

evidence of adoption: production deployments, npm modules, framework stacks.

-> It is hard to find numbers for this but the DataDog APM for Node.js actually uses this API which means there must be a large amount of Node.js live application using it through this module.

dependencies on experimental / flaky upstream APIs (such as v8): are there any, and if so, do we know mitigation plans.

-> AFAIK @Qard is still working on JS-side promise hooks, which will affect all of this, but only as an implementation detail and we can move forward without blocking on it.

This is a very personal analysis which has 2 major biases:

  • I have co-authored this API
  • I have mutliple years of experience in Node.js context tracking which means my experience is not at all representative of the ones of the experience of most Node.js users but I tried to go through this list with as much honnesty as I could.

Thanks again @gireeshpunathil for providing this great framework.
Based on this, I would like to suggest the defined scope to be marked as out of experimental and I will soon open a doc PR in that direction. Let me know what you think :)

@Aschen
Copy link
Contributor

Aschen commented Feb 10, 2021

We are using AsyncLocalStorage in Kuzzle since more than 7 months and we didn't encounter any issue or any bad API design flaws.

Despite we know our clients are using this feature in production since months, since we are developing a Node.js framework we can affirm that many other users are using AsyncLocalStorage without any inconvenient.

👍 for @vdeturckheim proposal to make this API stable so other products may want to take advantage of this feature as well

@bmeck
Copy link
Member

bmeck commented Feb 10, 2021

That scoping seems fine to me.

@Qard
Copy link
Member Author

Qard commented Feb 11, 2021

I agree with that scoping also. Nothing controversial in there to me, and all that has been battle-tested for months now in the Datadog APM agent. Any issues we've had have been with context loss due to AsyncResource not being marked stable yet and therefore some libraries declining to adopt it, so I suggest we mark both as stable for 16.x and start promoting adoption and migration from using async_hooks directly where possible.

@mhdawson
Copy link
Member

mhdawson commented Mar 4, 2021

+1 for making non-experimental. @vdeturckheim do you know when you might open the PR so I can keep an eye out for it ?

@vdeturckheim
Copy link
Member

@mhdawson I have a project ending early next week, so I planned to work on this in the next couple weeks (there might be some big doc updates to do)

@mcollina
Copy link
Member

mcollina commented Mar 4, 2021

+1 for making non-experimental.

@mhdawson
Copy link
Member

mhdawson commented Mar 4, 2021

@vdeturckheim thanks for the update and for doing the work to get it out of experimental. I think its a really important feature and appreciate you moving it forward.

@Lxxyx
Copy link
Member

Lxxyx commented Mar 15, 2021

+1 for making non-experimental. We are applying it in a production environment.

For the past two months I've been working with AsyncLocalStorage as a runtime for a full-stack web framework (https://github.com/midwayjs/midway). It has made Using "React Hooks" to develop the back-end into a reality.

Referenced code: https://github.com/midwayjs/hooks/blob/ecc0b83a4fae5554cfbec8e61ca6ba74bc3f6135/packages/hooks-core/src/runtime/index.ts#L1-L23

We have over 1000 Midway full-stack applications in our company and will be upgrading to use the AsyncLocalStorage version. So it is very important for us that AsyncLocalStorage becomes a stable Api.

@shahbazshueb
Copy link

+1 for making this non-experimental. We at Airlift are also seriously considering it for our production environment.

@vdeturckheim
Copy link
Member

We stabilised a subset of the API that should be prod ready in #37675 :)

@mhdawson
Copy link
Member

@vdeturckheim I think maybe we should close this for now as "complete" and open a new one to stabilize other parts at a later time when that makes sense?

@vdeturckheim
Copy link
Member

+1 @mhdawson . cc @Qard too?

@Qard
Copy link
Member Author

Qard commented Jun 24, 2021

We've stabilized what I wanted to see stabilized, so I'm going to call this done. Other parts of async_hooks, or more likely further higher-level APIs on top of it, can have their own stability timeline discussions elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks Issues and PRs related to the async hooks subsystem. discuss Issues opened for discussions and feedbacks. experimental Issues and PRs related to experimental features.
Projects
None yet
Development

No branches or pull requests