Provide a `pause()` helper to eachMessage/eachBatch #1364

brianphillips · 2022-05-17T17:52:29Z

This function will take care of pausing (and optionally, resuming) message consumption on the current topic/partition when processing messages either within the eachMessage or eachBatch handler functions.

src/consumer/runner.js

docs/Consuming.md

Nevon · 2022-05-23T16:16:54Z

What's the intended use-case that makes increasing the API surface here worth it? The example from the documentation, just for comparison:

await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
        try {
            await sendToDependency(message)
        } catch (e) {
            if (e instanceof TooManyRequestsError) {
                consumer.pause([{ topic, partitions: [partition] }])
                setTimeout(() => {
                    consumer.resume([{ topic, partitions: [partition] }])
                }, e.retryAfter * 1000)
            }
            throw e
        }
    },
})

And this proposal:

await consumer.run({ eachMessage: async ({ topic, message, pause }) => {
    try {
        await sendToDependency(message)
    } catch (e) {
        if (e instanceof TooManyRequestsError) {
            pause(e.retryAfter * 1000) // returns control to KafkaJS until timeout has expired
        }
        throw e
    }
}})

To me they look very similar. The new proposal is a bit terser, but I'm not sure it's really eliminating much complexity. You still need, as a user, to understand that it's pausing a specific topic-partition, even if that parameter is now hidden internally.

The one benefit that I find interesting is that it allows a way to communicate back to the calling code that we actually want to exit out of the message loop without having to throw an error and everything this entails. Instead, we have a way for the consumer to know that we should continue operating as usual, just without processing any further fetched messages on that topic for as long as it's paused. The way it's communicated, by throwing a special error, I'm not so keen on, but the same thing can be achieved in other ways.

Then there's the thing about keeping the timer for resuming. Any time I see stateful code that includes timers, I get a bit nervous, because it's so hard to predict whether things are in a valid state to proceed with what you're doing when the timer fires. For example, does the consumer still exist? Is it still running? Is it still subscribed to the same topics? Is it still assigned the relevant partition? What if the user explicitly pauses the topic afterwards? And so on. Some of these might not be relevant in this case - I'm just pointing out that whenever you're dealing with doing something in the future based on nothing but a timer, things tend to get complicated and lead to complex bugs down the line.

This function will take care of pausing (and optionally, resuming) message consumption on the current topic/partition when processing messages either within the `eachMessage` or `eachBatch` handler functions.

brianphillips · 2022-06-01T22:04:29Z

@Nevon thank you for your feedback. I've tweaked the implementation and interface a bit based on your suggestions (timers are left to the user, no special exceptions are used for flow control).

What's the intended use-case that makes increasing the API surface here worth it?

I do think the primary benefit of this means of pausing/resuming is being able to stop processing messages from a batch in the middle without having to keep track of whether the current message is from a topic/partition that was paused. The other thing that this helps with is the ability to pause processing without passing around a reference to the consumer object (a minor annoyance, but is one that I find annoying).

...but the same thing can be achieved in other ways.

Agreed. Would an implementation that simply checks if the current topic/partition are paused after each eachMessage invocation be preferable to a callback that handles calling consumer.pause for you and then sets a boolean to trigger the loop to break? This would allow a user to just use the previously available consumer.pause(...) method (caveat being that you have to make sure the consumer object is passed into whatever your eachMessage handler is implemented as mentioned above).

Then there's the thing about keeping the timer for resuming. Any time I see stateful code that includes timers, I get a bit nervous

💯 This is a very good point and was short-sighted on my part. I've left one vestigial convenience in here for resuming (the pause() callback itself returns a callback for resuming the same topic/partition that could be passed to setTimeout(...), if desired) but could easily be removed as well if you prefer to just make use of consumer.resume() (although, again, this requires that the caller have access to the consumer object in the same way they would if we were relying simply on consumer.pause as described above).

There are a few paths forward, as I see them:

You like what you see here, maybe with some additional tweaks but we proceed
You'd prefer consumers use existing consumer.pause(...) and consumer.resume(...) methods without the convenience methods provided here but you like the idea of being able to stop processing a batch of messages early by checking for paused topics/partitions after the eachMessage handler is called
You don't find any of this worthwhile in which case I'll make do with the existing API and count this as a learning exercise since I understand some of the internals a lot better than when I started, and that will be perfectly fine

Let me know what you think, and thanks so much for your time reviewing and giving quality feedback.

Nevon · 2022-06-02T06:13:59Z

If the provided pause callback is just a simple convenience over consumer.pause so that you don't have to provide the topic and partition, then I think it's a reasonable convenience with a small downside. I like the design of pause returning a corresponding resume function.

However, we need to ensure that this functionality works the same whether you are pausing using pause or consumer.pause. Essentially, the implementation of pause can only be a simple delegation to consumer.pause. With the current design, the consumer will work differently depending on whether you've used the pause convenience function of consumer.pause, because only pause is affecting whether to break out of the consumer loop or not, which seems like a big footgun.

Funnily enough, someone else was doing similar work in #1382. What they found was that if you pause a topic-partition within the eachMessage or eachBatch functions, and then threw an error, the error would bubble up to the retrier and the retrier would retry processing the same batch from the now paused partition. There, the solution is to not invoke onBatch for paused topic-partitions (the list of paused topic-partitions can be gotten via the ConsumerGroup). I think very similarly, the same thing can be done here by simply bailing out of handleEachMessage/handleEachBatch if the current topic-partition is paused, but rather than keeping a local paused variable, we can expose SubscriptionState.isPaused via ConsumerGroup and check whether the current topic-partition is paused through there. That way it will work the exact same way whether you're pausing through consumer.pause or the convenience pause function. Make sense?

Nevon · 2022-06-02T07:06:56Z

Just to set expectations, I will be going on vacation tomorrow and will be back on June 25th. So if you haven't heard from me before then, I'm not ghosting you, I'm just enjoying a coconut drink on a beach. 😄

If a topic/partition is paused within the `eachMessage` or `eachBatch` callback, we want to stop processing messages and avoid retrying the batch or processing additional messages in the current batch.

src/consumer/runner.js

docs/Consuming.md

Nevon

Alright, I am back from vacation and have gone through this another round. Looks good to me overall, just have some notes on the comms as well as one possibly important note on handling errors that require rejoining.

Nevon · 2022-06-28T08:17:11Z

Merged master and resolved the conflicts. With #1382 the handling of paused partitions during error handling is already taken care of.

Nevon reviewed May 23, 2022

View reviewed changes

src/consumer/runner.js Outdated Show resolved Hide resolved

docs/Consuming.md Outdated Show resolved Hide resolved

brianphillips added 3 commits June 1, 2022 16:46

Provide a pause(optionalTimeout) helper to eachMessage/eachBatch

c80dc75

This function will take care of pausing (and optionally, resuming) message consumption on the current topic/partition when processing messages either within the `eachMessage` or `eachBatch` handler functions.

doc tweak

e092c15

Rework slightly to not make use of timers or special exception classes

96ad17c

brianphillips force-pushed the pause-resume-helper branch from d2d48a9 to 96ad17c Compare June 1, 2022 21:46

Check status of current topic/partition, no matter how they were paused

e8feaf3

If a topic/partition is paused within the `eachMessage` or `eachBatch` callback, we want to stop processing messages and avoid retrying the batch or processing additional messages in the current batch.

brianphillips changed the title ~~Provide a pause(optionalTimeout) helper to eachMessage/eachBatch~~ Provide a pause() helper to eachMessage/eachBatch Jun 2, 2022

fix missing method on mock consumerGroup

3a9d472