Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Chaining middleware using Promises #353

Closed
3 tasks done
aoberoi opened this issue Dec 20, 2019 · 17 comments
Closed
3 tasks done

Proposal: Chaining middleware using Promises #353

aoberoi opened this issue Dec 20, 2019 · 17 comments
Labels
discussion M-T: An issue where more input is needed to reach a decision
Milestone

Comments

@aoberoi
Copy link
Contributor

aoberoi commented Dec 20, 2019

This proposal is meant to address the following use cases:

The changes

Adjust ack() and next() to return Promises

All listeners (with the exception of those listening using app.event()) and all middleware will become async functions. The consequence of using a normal function where an async one was expected would be either to interpret it the same way as Promise.resolve(). Listeners and middleware’s returned promise should resolve when processing the event is complete. If it rejects, the middleware preceding it in the chain may catch, but if it doesn’t catch the middleware should also reject, and so on. Rejections that bubble to the first middleware are handled through the global error handler.

ack() returns Promise<void>. The promise resolves when the receiver is done acknowledging the event (typically when the HTTP response is done being written). By allowing the promise to reject, listeners can handle errors that occur during acknowledgement (like trying to send the HTTP response). Currently, receivers are expected to handle these sorts of errors, but they typically have no ability to do anything intelligent.

Code for work that should be performed after the incoming event is acknowledged should be performed after calling await ack() in the listener or middleware function. The consequence of forgetting the await keyword will likely be invisible. The returned promise would be unhandled and might reject, but that kind of error is very rare.

next() returns Promise<void>. The promise resolves when the next middleware in the chain’s returned promise resolves. This builds a chain through middleware where early middleware gain a signal for when processing has completed in all later middleware. That signal can then be passed onto the receiver. It also means middleware use await next() and follow with more code to post-process the event (such as logging middleware). This replaces the usage of next() where a callback function is passed in.

The global error handler can return a Promise

If a rejection bubbles through the first middleware, the global error handler is triggered. This is not new. But instead of only returning void it can optionally a Promise. If the returned promise rejects, that rejection is exposed to the receiver as described below. The default global error handler will log the error and then reject, which means by default all unhandled errors in listeners and middleware make their way back to the receiver.

Receivers call App.processEvent() instead of emitting

Receivers are no longer EventEmitters. Instead, they are provided with an App instance upon initialization. When the receiver has an event for the app to process, it calls the App.processEvent() method, which returns Promise<void>. The promise resolves when all middleware and listeners have finished processing the event. The promise rejects when the global error handler's returned promise rejects.

There’s no resolved value for the returned promise. The receiver is expected to remember the value passed to the ack() function (which the receiver created) and associate that value to the returned promise if it chooses to delay the HTTP response until all processing is complete. This allows us to built synchronous receivers that only respond once all processing is complete.

Dispatch algorithm changes

Events are dispatched through 2 separate phases:

  1. One global middleware chain
  2. Many (n=number of listeners attached) listener middleware chains.

Between these two phases, a special middleware manages dispatching to all the listener middleware chains in parallel. This will not change. However, the return values of the individual listener middleware chains will start to become meaningful, and need to be bubbled up through the global middleware chain. This special middleware between the phases will aggregate the returned promises and wait for them all to complete (a la Promise.all()). If any promise(s) reject, then the middleware will return a rejected promise (whose error is a wrapper of all the resolved promises) to bubble through the global middleware chain. Conversely, if all of them succeed, the middleware will return a resolved promise to bubble through the global middleware chain.

Middleware which choose not to handle an event should return a resolved promise without ever calling next().

Align say() and respond() to return Promises

Now that listeners and middleware are expected to be async functions, it follows that all the utility functions given to them which perform asynchronous work should also return Promises. This aligns with ack() and next() by similarly allowing listeners to handle errors when calling say() or when calling respond().

The consequence of forgetting the await keyword will likely be invisible. The returned promise would be unhandled and might reject, but that kind of error is very rare.

Disadvantages

This is a breaking change. The scope of the backwards incompatibility is pretty limited, and here are some thoughts about migration:

  • Make all your listeners and middleware async functions.
    • Even if you fail to do this, Bolt will warn you, so it shouldn’t cause too many issues.
  • Adjust your code to await ack() (and await next()/say()/respond()).
    • If you fail to do this, your app will likely continue working except for the extremely rare occurrence when finishing an HTTP response has an error, and then you’ll know you should adjust due to an UnhandledPromiseRejection error.
  • If you wrote a middleware, adjust your code to await next() instead of calling it directly
    • If you used post processing, move that code from inside the callback to after the await next().
    • If you fail to do this, your app will likely continue working until a later middleware or listener fails, and then you’ll know you should adjust due to an UnhandledPromiseRejection error.
    • Estimating by the issues created so far, the number of apps using middleware today is very small.

The performance might suffer. The change in the dispatch algorithm requires all the listener middleware chain returned promises to resolve. Most should resolve rather quickly (since the first middleware in most of the chains will filter out the event and immediately return). However, if some middleware needs to process the event asynchronously before it can decide to call next() or not, it could slow down the event from being fully processed by another much quicker listener middleware chain. Once again, we haven’t heard from many users who use listener middleware, especially for asynchronous tasks, so we think this impact is relatively small. We should measure the performance of our example code (and new examples) to understand whether or not this has a significant impact.

There will be less compatibility with Hubot and Slapp apps who wish to migrate and continue to use middleware they wrote. A design similar to this one was considered before releasing Bolt v1. It was deliberately rejected in an effort to maximize compatibility with Hubot and Slapp. However, at this point there’s no indication that many Hubot or Slapp developers are actively migrating their code, and no indication that this specific part of migration is a problem worth holding back on this design’s advantages for.

Other benefits

The old way could be implemented in terms of the new way (but not the other way around). This could be a useful way to assist developers in migrating. In fact, this could pave the way for how “routers” within Bolt are composed together to make a larger application. The special middleware between the two phases can be thought of as a router that you get by default, and could be instantiated separately in other files. They could all then be composed back into the main app.

Credit to @selfcontained, @seratch, @barlock, and @tteltrab. I’ve simply synthesized their ideas into a proposal.

Requirements (place an x in each of the [ ])

  • I've read and understood the Contributing guidelines and have done my best effort to follow them.
  • I've read and agree to the Code of Conduct.
  • I've searched for any related issues and avoided creating a duplicate issue.
@aoberoi aoberoi added the discussion M-T: An issue where more input is needed to reach a decision label Dec 20, 2019
@seratch
Copy link
Member

seratch commented Dec 22, 2019

Thank you for your great proposal here. I'm also excited to seek ways to move this framework forward to cover broader use cases!

I still may misunderstand some, but it seems the proposed changes are effective in addressing the known middleware issues mentioned.

I've got a few questions below. I would like to know your further thoughts.

  • Avoiding listener breaking changes
  • Safe way to run remote functions

The first one is a concern from the perspective of Bolt users. The second is about the effectiveness as a solution for FaaS (or some may say serverless) support.

Avoiding listener breaking changes

tl;dr - I'd like to seek ways to avoid breaking changes affecting all users

If I understand correctly, the primary motivation of the changes is to address the current issues around middleware (#248 #239). I agree that it may be inevitable to bring breaking changes to fix them.

Changing how to write middleware affects some users (I'm sorry for those people), but not all. Changing ack in listeners affects 100% existing users and makes all existing online resources about Bolt outdated. Modifying all existing listener functions to be async and put await in front of ack() calls doesn't bring any benefits to existing applications. It just brings costs for upgrades. I don't think these downsides are ignorable.

I know a listener is a kind of specialized middleware, but I'm still thinking ways to avoid introducing incompatibilities to listeners. Have you already verified it's impossible to keep ack the same as before?

Even if Bolt decides to change the communication between App and Receivers to Promise based one, I'm not sure if it's necessary to expose the Promise generated by ack() to listener functions. It may be possible to encapsulate the conversion of acknowledgment completion to Promise.

I haven't tried to implement your idea yet, but I think there are certainly some ways to achieve the goal without bringing changes to listeners.

Safe way to run remote functions

tl;dr - Context#defer may not be a good solution for running remote functions

For FaaS (e.g., AWS Lambda, Google's Cloud Functions, etc.) support, providing ways to run functions remotely after request acknowledgment is a required piece. "Remotely" here means running functions on another node/container with no shared memory. Also, the mechanism should be beneficial not only for FaaS but for others, as Slack apps have to avoid 3-second timeouts in some ways.

I believe we can assume that functions given to the Context#defer are executed remotely or at least with no shared memory. Otherwise, it looks like just a wrapper of Promise generation to me.

On that premise, supporting closure functions may be challenging. Let's imagine the following situation. I have no idea to safely serialize the code along with deep-copied function/value references in its lexical scope.

app.command("/hello", async ({ body, context, ack }) => {
  // Bolt needs to serialize the following code to run in another Lambda (or other equivalent)
  const localValue = await calculateSomething(body);
  context.defer(() => {
    return doSomethingWith(localValue)
      .then(successHandler)
      .catch(errorHandler);
  }); 
  return ack();
})

The above example may look a bit arbitrary to some. Let me show you another example. Probably, deferred functions tend to use some of the listener arguments. The arguments can be shallowed in the same lexical scope. That may induce more mistakes like directly referring to the context, body given to the outer function.

app.command("/hello", async ({ body, ack }) => {
  context.defer(({ context }) => {
    // body here is a reference to the outer lexical scope
    return doSomethingWith(context, body.text);
  }); 
   return ack();
})

Warnings like "You can not pass any closures to Context#defer method" in documentation may work. But the limitation will be a common pitfall.

My idea for FaaS (also it's possibly applicable to any use cases)

Let me share another option for FaaS support. I've already briefly shared this idea with @aoberoi and others before.

Take a look at the following code example. If Bolt introduces this new way to define listeners, dealing with FaaS limitations or designing Slack apps free from 3-second timeouts can be much simpler. I believe the design looks straight-forward for Slack apps.

const app = new TwoPhaseApp({
  token: process.env.SLACK_BOT_TOKEN,
  receiver: new AwsLambdaReceiver({
    signingSecret: process.env.SLACK_SIGNING_SECRET
  })
});

app.command('/lambda')
  .ack(({ ack }) => {
    // inside this method, no breaking changes
    ack('this text will be posted immediately');
  })
  .then(async ({ body, say }) => {
    // this function will be excuted in another `Event` type invocation using the same lambda function
    return say('How are you?').then(() => say("I'm good!"));
  });

// convert to AWS Lambda handler
export const main = app.receiver().toHandler();

This approach doesn't have the lexical scope issue. Even if the listener has references to values/functions (global ones, imported ones, etc), it's safe to access them.

The downsides caused by introducing yet another App class would be:

  • Having two ways to create Bolt apps may be confusing for beginners
  • The maintenance cost of Bolt may increase (although my prototype shares many of implementation with the existing parts, the costs is about not only code but documentation, etc.)

I may miss something about Ankur's idea. I would like to know others' comments on it. Also, I would love to know others' impressions/feedback on my prototype as well. If many are interested in the idea, I can create another GitHub issue to discuss it.

@stevengill
Copy link
Member

Personally, I like the idea of of everything returning promises and chaining them together. That makes sense to me coming form a JS background.

The migration steps @aoberoi described seem simple on first glance and potentially allow devs to leave their code as is and still work. Though I'm sure we will recommend everyone update their apps. Definitely going to be a good amount of work updating our own sample code + docs to this new model.

@barlock
Copy link
Contributor

barlock commented Dec 24, 2019

Great job @aoberoi synthesizing these ideas.

I'm generally less afraid of breaking changes than others, I think using them can actually be fairly effective. (Preferencing this: Only an opinion, I care significantly more about giving devs control over error handling and simpler more obvious testing approaches which is currently quite hard). I would lean towards switching the middleware function signature to be koa-styled for two reasons:

  1. The signature change tells the dev writing middleware that the function will behave differently than before
  2. Synergy in the JS community, devs familiar with koa can jump in and understand the concepts immediately

(Still an opinion, obviously maintainers choice) Attempting to make these changes without breaking the api I think is possible, but it would add a significant amount of code to bolt to maintain that backwards compatibility. If this were my project, I would opt for simpler code and breaking changes so that it would be easier to update and fix issues, and allow external folks to collaborate easily.

Safe Error Handling

This is one of my key concerns, in the current architecture, it's easy to create unhandled promise rejections. I think this proposal does a good job of addressing them, but there are a few things in contention that might leave problems. The big thing is this error from node:

Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Unhandled errors will prevent graceful shutdown and if the runtime doesn't reboot (which might not be a real thing now?) would cause the app to just be down.

Complex systems fail in complex ways. (Very real) Hypothetical that would bring down a bolt app: My datacenter's network has a glitch, my database connection breaks, that library can't reconnect either by design choice (node does this a lot), or perhaps my Docker container simply needs a restart on a new node without this networking issue. If my database connection throws an error in the middleware stage, think oauth or fetching the correct bot token for a workspace, that exception will go unhandled and the app will stay up (on the port), so health checks pass, but would be functionally dead causing an outage.

@seratch This leads me to your point here:

Modifying all existing listener functions to be async and put await in front of ack() calls doesn't bring any benefits to existing applications. It just brings costs for upgrades. I don't think these downsides are ignorable.

Putting await in front of ack gives developers the chance to handle errors. Bolt shouldn't remove the ability for developers who want to handle errors from being able to do so. The current api also assumes that ack is always successful, and so the developer will always move to business logic assuming an ack occurred. If it didn't, it could lead to very complex bugs that are hard to track down (same thing for say too actually)

There are also some benefits too, the await will guarantee the event loop yielding to that function, letting users get feedback faster. (This is a micro-optimization, but possibly worth mentioning)

Lastly, if a developer doesn't make this change, it actually wouldn't affect anything, bolt would behave exactly as it does now: The http response will be sent as soon as the event loop frees up, any errors will go unhandled, other processing will still happen assuming the response was sent correctly.

Improved testing ability

From this proposal, I read an assumption of a "testing receiver"? Possibly bolt would publish a receiver that's designed to send events programmatically and then resolve for assertions after processEvent is done?

If so this comment is possibly concerning:

The promise never rejects (all errors stop at the global error handler).

I agree that this seems right from a normal receiver's perspective. How do I test my error handling? It would be great to expect various errors.

Slightly off-topic for the heading, but wouldn't it also make sense for a processEvent failure to return a 500?

Performance concerns

@aeoberoi I'm curious to dig more into the performance issues. I think there possibly is something I'm missing?

However, if some middleware needs to process the event asynchronously before it can decide to call next() or not, it could slow down the event from being fully processed by another much quicker listener middleware chain.

Possibly this is different than my current expectations of how bolt works, but apps can do this now right? Middleware currently can do asynchronous processing and let the middleware continue by calling next, the difference here is that it's promise-based instead of callback based?

I wouldn't expect this change to effect performance noticeably or at all. Possibly the only change would be how v8 now has to process a Promise vs a callback which I assume takes a few extra instructions, but even that could be incorrect as v8 is quite good at optimizing common patterns.

Dispatch algorithm

I think the description was correct, I'm going to write a code sample just to specify clearly what I'd expect to happen, taken largely from #337

// console logs indicate call order,
// astrix indicate parallel processing and therefor call order is unpredictable
app.use(async ({ next }) => {
  try {
    console.log('1');
    await next();
    console.log('8 if `next` resolves');
  } catch (error) {
    // This block would be functionally equivalent to `app.error`
    console.error(error);

    // Still throw, but this way we could have error handlers at every step of the middleware chain
    if (!(error instanceof MyErrorType)) {
      throw error;
    }
  }
});

app.use(async ({ next }) => {
  console.log('2');
  await next();
  console.log('7');
});

app.command(
  '/command',
  async ({ next }) => {
    // Some interesting middleware
    console.log('3*');
    await next();
    console.log('6*');
  },
  async ({ say, ack }) => {
    // Normal handler
    console.log('4*');
    await ack();
    await say("Command response");
    console.log('5*');
  },
);

app.command(
  '/another-command',
  async ({ next }) => {
    // Some interesting middleware
    console.log('3*');
    await next();
    console.log('6*');
  },
  async ({ say, ack }) => {
    // Normal handler
    console.log('4*');
    await ack();
    await say("Command response");
    console.log('5*');
  },
);

app.error(async error => {
  // For those using transports, have the ability to log the failure,
  // flush any non-sent logs, then shutdown gracefully
  log.error(error);
  await log.flush();
  process.exit(1);
});

☝️ This aligning to how you see code working?

@aoberoi
Copy link
Contributor Author

aoberoi commented Dec 24, 2019

@seratch thanks for the detailed review of the proposal! i'll address some of your points here:

I'd like to seek ways to avoid breaking changes affecting all users

Modifying all existing listener functions to be async and put await in front of ack() calls doesn't bring any benefits to existing applications. It just brings costs for upgrades. I don't think these downsides are ignorable.

You're right that we could choose not to break the listener API, but it seems prudent to me that we comprehensively "correct" the leaky abstraction we've built. We've chosen to hide the fact that an asynchronous operation is occurring when the listener invokes ack(), and that has been very convenient for users who might find the async/await keywords intimidating. My assumption is that adding the await keyword will trigger some discomfort, but for most people that discomfort will be quick to get over since they are likely just copying the pattern they see in example code. If they don't add the await, as I described in the "Disadvantages" section above, there will likely be no bad consequence. The code will do about the same thing it does today. So my assessment is that most people will barely notice.

I don't agree that there is no benefit. The benefit is that errors during ack() will actually be visible to handle. If we do not make this change, the leaky abstraction will cost developers the ability to do anything, including just logging for awareness, in those situations. Even if they are lazy and do not surround the await ack() with a try/catch, using await allows those errors to bubble up to the global error handler, which I see as a benefit.

I think the biggest cost will be in the documentation effort and support effort of the maintainers. Let's be a little more specific about these costs:

  • The documentation effort should be relatively low. We have only a few tutorials/examples that are widely linked, which we can update with a simple find/replace. There are valuable resources created in the community, too. Most of those should continue to work, since calling ack() without await roughly leaves you in the same position as you are today. However, I optimistically think the community creating these examples is still highly engaged and will likely update for breaking changes now, as compared to 6 months or 1 year from now.

  • The support effort is more substantial. In the absence of a formalized support schedule on this project, I think we should follow the support schedule from the Node Slack SDK as much as possible. According to this schedule, we shouldn't end support for the current (1.x) API until we're removing support for a Node version. The lowest supported Node version for this package is currently the 10.x LTS, whose support won't expire until April 2021. This means if we want to make a breaking change, we should be willing to support the 1.x API in Bolt concurrently with a 2.x API for at least ~15 more months. This means helping developers using 1.x who create GH issues, backporting new features (when its possible), and keeping a migration guide up to date. While this is not free, I think given the scope of the breaking changes proposed here, it could be reasonable to handle.

I think we should continue to gather more input here. Let's keep this point open as a variation of the proposal. I'll update the description above to capture the variation, and let's work together to build a list of pros/cons for it. This should make it easier to answer the question "are the benefits worth the cost?".

On that premise, supporting closure functions may be challenging. Let's imagine the following situation. I have no idea to safely serialize the code along with deep-copied function/value references in its lexical scope.

You are totally right here. I did not think as carefully about the Context#defer() idea as you have. This is super helpful!

Maybe remote function execution with all the benefit of lexical scope capture is not feasible within this proposal. I think this proposal still offers a useful improvement for the problem of running Bolt apps on FaaS providers. The signal that would be gained regarding when the event is completely processed by middleware/listeners is valuable in the FaaS use case.

But this proposal does not offer a comprehensive solution, because it doesn't add clarity to how an app should deal with asynchronous work within an FaaS context. There are likely a few ways to fill this gap, including the TwoPhaseApp idea you've shared. Can we agree that solving this gap is outside of scope for this proposal? Do you think there's anything about this proposal that is in conflict with the TwoPhaseApp idea (or any other idea which could solve for the gap)?

I think there's a substantial amount of added complexity to deliver TwoPhaseApp alongside the existing App. I would like to make one common path for developers using Bolt on FaaS or on a long-lived process, as much as possible. If we change the API of App to what you've shown as TwoPhaseApp in the next major version, I feel that there's a significant added complexity burden on the users running Bolt in a long-lived process that otherwise they wouldn't have to deal with. I don't know if this tradeoff is worth it, so that pushes me back to delivering both App and TwoPhaseApp classes separately. This seems like a much higher documentation and support cost, and I'm looking for ways to lower it. How do you feel about these tradeoffs and costs?

@aoberoi
Copy link
Contributor Author

aoberoi commented Dec 24, 2019

@barlock thanks for the comments as well! i'll try to answer some of your questions.

I agree that this seems right from a normal receiver's perspective. How do I test my error handling? It would be great to expect various errors.

I'd suggest you do something like this:

const assert = require('assert');

// Example system under test.
// This listener throws an error when the action type is not 'button'
async function myListener({ action }) {
  if (action && action.type !== 'button') {
    throw new Error('myListener should only work with buttons');
  }
  // ...
}

// Example test suite
describe('myListener', () => {
  it('should fail when action type is not button', async () => {
    // Arrange
    let caughtError;
    const actionFixture = {
      body: {
        type: 'block_actions',
        actions: [{
          action_id: 'block_action_id'
        }],
        channel: {},
        user: {},
        team: {},
      },
      respond: noop,
      ack: noop,
    };
    const app = new App({
      token: '',
      receiver: {}, // dummy receiver, could replace with a TestReceiver impl
    });
    app.error(e => caughtError = e);
    
    // Act
	app.action('block_action_id', myListener);
    await app.processEvent(actionFixture);

    // Assert
    assert(caughtError instanceof Error);
  });
});

async function noop() { }

But if you trust that the App does what it's supposed to, your unit tests shouldn't even have to test the flow of an event through the receiver, into the listener, and then the error back to the global error handler. You could simply test your listener as a "pure" function.

Slightly off-topic for the heading, but wouldn't it also make sense for a processEvent failure to return a 500?

I don't think so. The value passed to ack() determines the HTTP response (when using an ExpressReceiver), which will likely happen before the App#processEvent() returned Promise resolves. This promise doesn't carry data, its simply a signal that listener/middleware processing completed.

Possibly this is different than my current expectations of how bolt works, but apps can do this now right?

While listener middleware can perform async work before it decides to call next() or not, that doesn't affect the performance of some other listener middleware chain. Recall that events will dispatch to every listener middleware chain, even if the listener at the end of that chain is completely irrelevant to the incoming event. In those cases, some listener middleware would "filter" the event's propogation to the listener by choosing not to call next(). But with this proposal, if some middleware decides to call next() or not after an async operation, the dispatch must await that operation. If the "right" listener completes quickly, and that "wrong" middleware's async operation decides not to call next(), the wrong middleware just slowed down the right one.

☝️ This aligning to how you see code working?

I didn't propose that say() would return a Promise, but besides that, yes 😄. I think the rationale behind making say() return a Promise is the same (ability to handle errors), but I think this might have a greater cost, since say() is often used more than once inside a listener. Should we add this as a variation on top of this proposal?

@seratch
Copy link
Member

seratch commented Dec 24, 2019

I didn't propose that say() would return a Promise, but besides that, yes 😄. I think the rationale behind making say() return a Promise is the same (ability to handle errors), but I think this might have a greater cost, since say() is often used more than once inside a listener. Should we add this as a variation on top of this proposal?

Good catch. In my other prototype, I was thinking of changing say/respond to return Promise (currently, their return types are void for both). I highly agree with this 👍

@seratch
Copy link
Member

seratch commented Dec 24, 2019

Change to ack()

Regarding the benefits of changing ack(), now I got the point thanks to:

@barlock's

Putting await in front of ack gives developers the chance to handle errors. Bolt shouldn't remove the ability for developers who want to handle errors from being able to do so.

@aoberoi's

I don't agree that there is no benefit. The benefit is that errors during ack() will actually be visible to handle. If we do not make this change, the leaky abstraction will cost developers the ability to do anything, including just logging for awareness, in those situations. Even if they are lazy and do not surround the await ack() with a try/catch, using await allows those errors to bubble up to the global error handler, which I see as a benefit.

I understand the benefits described above in common. As far as I know, none of the known issues (specifically #239 #248) is related to errors in ack() and we don't have any requests/reports on error handling for ack() yet. But it doesn't mean we should not improve the mechanism. I agree we should apply improvements regarding this matter even if it may cost (mentioned later - I believe the cost is reasonable enough).

@aoberoi's

If they don't add the await, as I described in the "Disadvantages" section above, there will likely be no bad consequence. The code will do about the same thing it does today. So my assessment is that most people will barely notice.

This is great. This assumption removes the main concerns I had.

This should make it easier to answer the question "are the benefits worth the cost?".

The costs you described here look reasonable enough. Backporting new features may be a bit tough but it'll be inevitable in any case in the future (even if we don't do major upgrade mainly for this matter).

FaaS support

Regarding FaaS topic:

Can we agree that solving this gap is outside of scope for this proposal?

Sounds great. I agree with this. I will start a separate discussion on FaaS support in another issue.

Do you think there's anything about this proposal that is in conflict with the TwoPhaseApp idea (or any other idea which could solve for the gap)?

No, I don't think so. As I added above, changing say/respond to return Promise is necessary for TwoPhaseApp idea. But I think there is no conflict with this proposal.

I think there's a substantial amount of added complexity to deliver TwoPhaseApp alongside the existing App. I would like to make one common path for developers using Bolt on FaaS or on a long-lived process, as much as possible.

My conclusion so far is supporting FaaS correctly with the current App design is almost impossible. That's why I came to the idea to create yet another App implementation.

If we change the API of App to what you've shown as TwoPhaseApp in the next major version, I feel that there's a significant added complexity burden on the users running Bolt in a long-lived process that otherwise they wouldn't have to deal with.

This is true. I agree that it won't be welcomed by everyone to change App APIs to have two ways to build apps.

I don't know if this tradeoff is worth it, so that pushes me back to delivering both App and TwoPhaseApp classes separately. This seems like a much higher documentation and support cost, and I'm looking for ways to lower it. How do you feel about these tradeoffs and costs?

I understand the points. Considering limited resources, it's not so easy to support yet another Bolt use cases at this moment.

One feasible idea is creating a FaaS support package as a 3rd-party one. That means the package won't be managed by @slackapi. The authors of such packages can be anyone. If someone starts such projects, I'll be more than happy to support them.

As a first step, I'll publish my prototype implementation in my own repository and will start a new GitHub issue in this repository. I hope it will be a good starting point for people interested in FaaS support. Even though my prototype is still in very alpha quality, checking it could be helpful for the folks that may start new projects.

@barlock
Copy link
Contributor

barlock commented Dec 26, 2019

But if you trust that the App does what it's supposed to, your unit tests shouldn't even have to test the flow of an event through the receiver, into the listener, and then the error back to the global error handler. You could simply test your listener as a "pure" function.

You're approach works, the spec would simply need to allow for multiple error receivers. 👍


Context for my thinking:

I agree this is generally a good approach and for error handling possibly it's the right approach, however, I find that it's useful to include as many happy path tests through as much of the code as possible. It helps a ton for catching bugs in libraries, quickly testing for compatibility of library upgrades (too many modules don't semver correctly), and ensuring that you're catching hard-to-find bugs at integration points between library and application code. Also, mocking code is a pain, it usually requires me to understand the library to mock it, which takes a bunch of time.

Given how tightly coupled listener functions are to bolt, most pure function test cases end up "testing that the code is the code" (eg: I called ack once, and say with these args), rather than the business logic.

@barlock
Copy link
Contributor

barlock commented Jan 3, 2020

What's next for making this proposal a reality? It would be great to start working on it.

To summarize the points of agreement I'm seeing:

  • next, ack, say, respond will all switch to promise based. (Did I miss any?)
  • next will act like koa-compose where logic after await next() will happen going up the middleware stack
  • Processing algorithm: Global middleware down => All listeners in parallel down then up => Global Middleware up. Errors anywhere are sent to the global error handler.
  • Receivers remove their event emitter and instead call app.processEvent which returns Promise<void> resolving after all middleware is done (completed or errored).
  • FaaS support will not be a priority for this proposal

Things still needing discussion: (My commentary as subpoints)

  • Function signatures for listener middleware
    • Receivers will break, technically a major semver update, do we communicate that to users with a change in listener middleware signature as well?
    • I think this would make the js community more cohesive (common signatures across projects) and make code-reuse easier (koa-compose vs handwritten). I don't feel strongly about this.
  • Function signature for receivers calling App.processEvent
    • Still feels wrong to me that the promise never rejects. Let me try another scenario to try to persuade. In the event of an error before an ack (fetching secrets, state, etc.), letting the receiver respond immediately with a non-200 error code will inform users that something wen't wrong with their interaction faster (Yey better ux) rather than spinning and timing out. Also, Slack's servers don't need to hold onto those connections as long. I would think receivers would only do this type of action rather than handle the error. The error would still also propagate to the global handler in parallel. I feel medium strong about this.
    • Seems like a good opportunity to clear a TODO and have processEvent handle signature verification rather than the receiver.

Did I miss anything? If there are parts that are good to go already, what are the steps to getting code changes? next branch to make prs into?

@selfcontained
Copy link
Contributor

Thanks for putting this all into a proposal @aoberoi ! All of the proposals around changes to make listeners and middleware async sound great to me. I was also particularly excited about this side-effect that I didn't even consider

next() returns Promise. The promise resolves when the next middleware in the chain’s returned promise resolves. This builds a chain through middleware where early middleware gain a signal for when processing has completed in all later middleware. That signal can then be passed onto the receiver. It also means middleware use await next() and follow with more code to post-process the event (such as logging middleware). This replaces the usage of next() where a callback function is passed in.

@aoberoi
Copy link
Contributor Author

aoberoi commented Jan 9, 2020

Thanks for all the input everyone! At this point, this proposal is being accepted and will move on to implementation. I'll update the top comment to describe the final state, but here's a summary for historical record:

Updates:

  1. say() and respond() will return Promises. The rationale is the same as making ack() return a Promise - visibility for errors during async operations. The cost of not updating the app code to use the returned Promise is negligible, since not handling them leaves you in roughly the same position as you are today (risk of UnhandledPromiseRejection at the process level).

  2. This proposal is not meant to address FaaS in its entirety. Specifically, Context#defer() is a bad idea. It might still help solutions for FaaS to become more feasible.

  3. The global error handler is expected to return a Promise or undefined (undefined is treated as Promise.resolve(undefined)). The Promise returned from App#processEvent() rejects when the return value from the global error handler rejects. The default implementation of the global error handler logs the error details and then rejects with the same error that triggered it. [@barlock: I think this solution accomplishes what you desire without incurring a confusing flow by triggering the global error handler and rejecting for the Receiver at the same time. I'd like to avoid that because it means there's no way for your global error handler to "recover" and allow the HTTP response to complete normally. Please verify whether this meets your needs].

Non-changes:

  1. Listeners will be expected to return a Promise or undefined (undefined is treated as Promise.resolve(undefined)). ack() will return a Promise. The cost of not updating listeners for these changes is negligible, since the observed behavior will be about the same as before. This will be a breaking change.

@barlock
Copy link
Contributor

barlock commented Jan 9, 2020

Thoughts on 3:

I think this solution accomplishes what you desire without incurring a confusing flow by triggering the global error handler and rejecting for the Receiver at the same time.

I agree that the error handlers shouldn't be called at the same time. They should have an order.

App developers should have a safe place to call process.exit(1) to signal to their runtime to restart. Ideally they wouldn't have to write or modify a receiver to do that. Depending on how I'm reading the chain of events, the proposal wouldn't allow a developer a chance to do that and give the receiver a chance to complete the http response normally.

It would though if App#processEvent was responsible for handling that in the event of an error.

(Forgive the psudo-code)

  async processEvent(event: ReceiverEvent) {
    try {
      verifySignature(event.body); // Throws if fails validation

      composeMiddleware(
        ...this.globalMiddleware,
        Promise.all(
          this.listeners.map(listener => composeMiddleware(listener)),
        ),
      );
    } catch (err) {
      // This needs a way to signal to `ack` that it should send a 500, not attached to this
      // Also, the receiver would need to be able to handle multiple `ack` calls, only doing anything on the first call.
      await event.ack(err); 

      this.globalErrorHandler(err);

      throw err;
    }

This would mean that the receiver would probably only want to retrow in the event of an error, and my global error handler would probably look like:

    app.error((err) => {
      if (err instanceof SomeErrorICanHandle) {
        // handle it
      } else if (receiver instanceof JestReceiver) { //or some way to know it's a test time
        throw err; // lets my test Receiver reject and I can inspect it
      } else {
        process.exit(1); // Reboot the things
      }
    })

I don't love it, not sure why, but I can't think of anything else that works. Thoughts?

@barlock
Copy link
Contributor

barlock commented Jan 17, 2020

An interesting note about discoveries when implementing this: #375 (comment)

Not sure where to discuss, here or in PR.

@aoberoi
Copy link
Contributor Author

aoberoi commented Feb 24, 2020

App developers should have a safe place to call process.exit(1) to signal to their runtime to restart.

From my understanding, this is considered a bad practice and should not be the recommended way to signal the process to terminate (and your process manager to possibly restart). From the Node.js Docs:

If it is necessary to terminate the Node.js process due to an error condition, throwing an uncaught error and allowing the process to terminate accordingly is safer than calling process.exit().

I think we might be talking about two different scenarios here:

  1. Graceful exit typically happens when an external signal (like Ctrl+C in the terminal) tells the process its time to shutdown and in those cases you want to allow the app enough time to finish its work which includes HTTP responses (or more generally the full bubble of the event back up the middleware chain into the receiver).
  2. Unexpected exit typically occurs when something went so bad that the program is in an unknown state and therefore it would be better to shut down than to continue.

For a graceful exit which involves a Receiver that implements an HTTP server, the Receiver would need to be aware of the external signal. It needs to tell its own HTTP server that it should no longer accept any incoming connections, and ensure existing open connections are closed in a timely manner. Closing them doesn't necessarily mean forcing the HTTP response to indicate success. For example, if I want my server to gracefully exit within 1 second, then I'd likely want to terminate connections which take longer by using a non-200 response since the work was not complete and the server is closing. That would allow Slack to either retry the event or show the user an error. I think this is a good enhancement for the default ExpressReceiver (and others) and I'd be happy to capture that in a new issue.

For an unexpected exit, the advice about throwing an Error to terminate the process is relevant. However, this sort of "my entire program is in an unknown state" situation is extremely uncommon. More typically, within the context of a single incoming event you could end up in an unknown state. In these situations, it's more common to bail on single event and keep the process running to serve other events. These can both be accomplished with the proposal without implementing a Receiver:

app.error(async (error) => {
  // when my handler wants to signal that the entire program is in a bad state and should terminate,
  // it will (re)throw a error with the globalFailure property set to true
  if (error.globalFailure) {
    // TODO: clean up global resources/handles
    process.nextTick(() => {
      // By indirecting through nextTick the call stack is outside the async function.
      // The following throw will cause process termination.
      throw new Error('Global Failure');
    });
    return;
  }

  // all other failures should only impact the processing of this event (more common)
  // TODO: clean up any event-specific resources/handles. middleware has already gotten a chance to clean up any resources it may have allocated
  // The following throw will reject the returned promise and allow the Receiver to determine what to do
  // NOTE: the default ExpressReceiver will respond with a 5xx level HTTP status
  throw new Error('Event Failure');
});

Depending on how I'm reading the chain of events, the proposal wouldn't allow a developer a chance to [terminate the process] and give the receiver a chance to complete the http response normally.

I've gotta be honest, I can't think of a use case for this. I think you're talking about an unexpected exit except, regardless of whether its a global failure or a failure for a single event, I can't see why you'd want the Receiver to complete normally. Yet, if that's what you want to do, you are able to do it with this proposal. You can use process.nextTick() to throw an error and return undefined (or a resolved Promise) as I've shown above. EDIT: I just realized this is wrong because process.nextTick()'s callback can execute before the Receiver has a chance to finish the HTTP response. Can you help me find a real use case for this?

The last example seems to indicate that you want a way for a custom Receiver to emit errors that can be inspected by the developer after the receiver's done its job. I don't see any use cases where this is a general need (but I'd like to hear them). Would you be able to handle that with an API on your custom Receiver? Then your receiver could choose to use the rejection from App#processEvent() by sending to a dedicated callback and still ack successfully. For example:

const appOptions = { ... };
if (process.env.NODE_ENV === 'test') {
  const receiver = new JestReceiver(...);
  receiver.onFailure((error) => {
    // some last resort processing and/or inspection
  });
  appOptions.receiver = receiver;
}
const app = new App(appOptions);

@barlock
Copy link
Contributor

barlock commented Feb 25, 2020

Sorry that process.exit(1) was distracting, I'll replace it with the psudocode gracefullExit() from here on out.

I've gotta be honest, I can't think of a use case for this. I think you're talking about an unexpected exit except, regardless of whether its a global failure or a failure for a single event, I can't see why you'd want the Receiver to complete normally.

What's interesting to me is that from the perspective of my runtime (currently k8s, but I think the same is true for lambda and heroku), restarts of my application are basically free. It's fast, and doesn't loose any uptime by doing it. So I actually view Unexpected exit a little differently than you, or more specifically, I'll add a possibly third type of exit, Unrecognized Error. Given that restarts are "free", whenever I encounter an error that I don't know how to handle, the safest thing is for me to shutdown and let my runtime reboot. (How that shutdown happens will differ very wildly based on each individual's stack and you pointed out that process.exit(1) isn't a good way to do that 👍). This is my use case.

Really, all I'm looking for is that developers using bolt, not bolt the framework, should have the last say in error handling.

In terms of why I'd want my http requests to complete normally, maybe normally is the confusing word in there? Just that it doesn't hang up. You kinda alluded to this already that the receivers should have a chance to complete any requests and respond with non-200s, I think we're talking about the same thing?

The last example seems to indicate that you want a way for a custom Receiver to emit errors that can be inspected by the developer after the receiver's done its job. I don't see any use cases where this is a general need (but I'd like to hear them).

Testing is the only one I know of, I think it's a very important one however. Related to above, I want to shutdown on any unrecognized error.

What about the case of a recognized error? I'd like to test that I'm handling it correctly.

Would you be able to handle that with an API on your custom Receiver? Then your receiver could choose to use the rejection from App#processEvent() by sending to a dedicated callback and still ack successfully.

I think so? I'm not sure exactly what your trying to discuss here.

Could you leave comments on my pull request to implement this so we can discuss over actual code? It's much easier to follow along and more concrete.

Possibly the best thing would be to implement failing test cases to show points of contention. I'm not following everything.

Relevant lines: https://github.com/slackapi/bolt/pull/380/files#diff-0286657a0789ea9446fa3de979ff3e09R569

@barlock
Copy link
Contributor

barlock commented Feb 25, 2020

I'm also still interested in a discussion over this comment (pasted below): #375 (comment)

An interesting thing to note here, it became quickly obvious to me why koa-compose chose to have their middleware signatures be (ctx, next) instead of just (ctx) with next inside ctx.

The reason here is that one of the purposes of ctx is to be passed around and modified in place, it's very awkward to do that when the processing has to modify the context mid stream. (As next needs to basically be the next function in the chain)

That lead to this object spread here. This will cause some (possibly) weirdness when it comes to listener middleware. Each listener middleware will have it's own context separate from the rest of them... sometimes. It would depend on the shape of context and how it was accessed and changed in processing.

I can see this being sorta helpful (partial isolation of middleware running at the same time), but could lead to unexpected results if folks are used to writing koa style middleware.

For this reason I'd still recommend using koa-style middleware rather than this one. But, if pseudo-backwards-compatibility is still more important, this works for the most part.

There's some unavoidable unexpected behavior when you embed next into the context object that could be avoided if next is it's own parameter.

@aoberoi
Copy link
Contributor Author

aoberoi commented Mar 27, 2020

We've gone ahead and landed the implementations we've converged on in the @slack/bolt@next branch! Closing this issue because I believe the discussion is over.

@aoberoi aoberoi closed this as completed Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion M-T: An issue where more input is needed to reach a decision
Projects
None yet
Development

No branches or pull requests

5 participants