Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job queue system to generate books #489

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add job queue system to generate books #489

wants to merge 6 commits into from

Conversation

samwilson
Copy link
Member

Use the Symfony Messenger component to pass ebook generation requests from the web front-end to a server-processed queue.

Bug: T345406

Use the Symfony Messenger component to pass ebook generation
requests from the web front-end to a server-processed queue.

Bug: T345406
@Tpt
Copy link
Collaborator

Tpt commented Dec 18, 2023

Thank you!

I am afraid this patch will encounter a problem: if the messenger system works like a queue (first in first out) then if the submission rate is faster than the processing rate the queue will grow unbounded and the workers will only process old requests for which an error 500 has already been returned.

Some ideas to mitigate that:

  • drop jobs if older than x seconds
  • move to a stack (last in first out). However this gives more chances to a user that will spam the server with requests for the same book...
  • prioritise cheap jobs: we might choose to only process pdfs if all epub requests have been processed to make sure all cheap operations are done.

We also have a load balancing problem: It is likely epub jobs are network bounds and PDF CPU bounds, having the same worker pool might leads to wasted CPU if we process a lot of epubs or overloading if we process a lot of PDFs.

@samwilson
Copy link
Member Author

Good points, thanks!

We could have two queues, one for Epub and one for PDF (or all derived formats).

But that doesn't directly help with the problem of not processing jobs fast enough. Dropping jobs that were requested more than 30 seconds ago sound like it might be the best option — I guess ultimately this is the crux of the issue, that we don't have the resources to attend to every request, but at the moment we don't have a system of failing before running out of resources. If we drop old requests, then many users would get generation-failed errors, but the service would stay up and serve as many as it can. (And if we want to serve more, we'd add separate servers.)

Also, it seems like ideally we'd have a unique-job structure, where it wouldn't matter if someone spams requests, if they're for the same book+options then it'd just use the existing job. That can't work unless we store the generated book, so perhaps that should be a subsequent change? Or am I thinking about this wrongly?

Anyway, I'll start by a) using two queues; and b) dropping old jobs after 30 seconds.

@Tpt
Copy link
Collaborator

Tpt commented Dec 19, 2023

Dropping jobs that were requested more than 30 seconds ago sound like it might be the best option

What about the case "we have a lot of jobs in the queue so we process only the jobs that are just before ttl aka 30s. However the time to return a 500 error is also ~30s so we only process jobs for which it is very likely we already returned a 500 error before finishing the processing.

Instead of a complex queue what about:

  • a rate limiter that rejects requests if there are more than X requests currently processing. This would allow to either fail quickly or get a very high chance of succeeding the conversion. For that we would need a global atomic counter. I believe Memcache::increment and Memcache::decrement might help with that. In a follow up, we can refine the counters by allowing a smaller number of PDFs than ePubs.
  • a cache system. When converting a file we save it to disk. When having a same request for the same file we return the on-disk file if it exists and we "touch" it. We write cron job to remove too old files and limit the number of files. We also provide a way to purge the cache for editors who want to test their changes.

Sorry for not coming up with this suggestion sooner. What do you think about it?

@samwilson
Copy link
Member Author

That does sound simpler. :-)

The counter could use the existing cache service couldn't it? Am I understanding it correctly in that it'll have to a) increment when we start a conversion; b) throw an error when the count is over say 4; and c) decrement after creating the file.

@Tpt
Copy link
Collaborator

Tpt commented Dec 20, 2023

The counter could use the existing cache service couldn't it? Am I understanding it correctly in that it'll have to a) increment when we start a conversion; b) throw an error when the count is over say 4; and c) decrement after creating the file.

Yes! A tricky thing is to make sure to properly decrement the counter if the file creation process fails (even with things like out of memory errors) to avoid having "zombies" in the counter.

@samwilson
Copy link
Member Author

Cool. I'm working on a patch. :)

Do you think it'd be good to add the timestamp of each instance of the counter? That way we could tell how long ago it was incremented and if it was too long ago it could probably be disregarded and reused. e.g. a cache item called convert_processes or something, with an array of limited length with each value being a timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants