Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable auto-scaling for sidekiq #20

Open
michaelwittig opened this issue Dec 2, 2022 · 8 comments
Open

Enable auto-scaling for sidekiq #20

michaelwittig opened this issue Dec 2, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@michaelwittig
Copy link
Contributor

see

    My Sidekiq task is regularly pegging at 100% CPU utilization... definitely need some guidance on configuring scaling...

Originally posted by @scrappydog in #1 (comment)

    @scrappydog Same for us. I'm not sure if that is an issue. It likely doesn't matter if the background tasks utilize all resources as long as they finish withou much delay. For us, we see spikes to 100% but only for minutes. Do you see the same pattern?

Screenshot 2022-11-28 at 09 42 10

Originally posted by @michaelwittig in #1 (comment)

    That looks very similar to utilization on my instance.

My inner system admin really "wants" to add another task... but I agree as long as jobs are completing in a reasonable time it's not an immediate issue.

BUT we are running tiny instances for testing... we NEED a way to scale up... :-)

Originally posted by @scrappydog in #1 (comment)

    I bumped the CPU allocation up on the Sidekiq task to CPU .5 vCPU | Memory 3 GB... 

This feels happier for now... but it doesn't address the real scalability question...

Originally posted by @scrappydog in #1 (comment)

    ![image](https://user-images.githubusercontent.com/125875/204807795-541c039e-3b58-4bb2-922f-5f1e3d528938.png)

Upgraded about half way through this graph... definably a lot better!

Originally posted by @scrappydog in #1 (comment)

@scrappydog
Copy link

image
Status update after a couple days with the Sidekiq task to CPU .5 vCPU | Memory 3 GB

@compuguy
Copy link

compuguy commented Dec 4, 2022

There is a way to do auto scaling for most of the sidekiq queues. Except for the scheduler. You can only have one of those. This article helped me work on some of my experiments with scaling Sidekiq (https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/). At a minimum you need at least 1 gigabyte of memory for each instance. I'm not sure how many threads though. The default is 5 but it might make sense to reduce it to maybe 2 based on the amount of cpu units on each container instance you have (I'm using 0.5 for each).

@vesteinn
Copy link

vesteinn commented Dec 8, 2022

Were you able to integrate these changes into the CloudFormation configuration @compuguy? After increasing the Cpu and Memory flags I'm still seeing full load.

@compuguy
Copy link

compuguy commented Dec 11, 2022

I honestly went down a different road @vesteinn. I moved the mail and scheduler queues to their own separate instance, with 0.25 vCPU and .5 GB of memory. You can only have one scheduler queue per Mastodon instance, so I left it with the mail queue which wasn't using much CPU or RAM. Then I made the SidekiqService container run the rest of the needed queues AppCommand: 'bash,-c,bundle exec sidekiq -q default -q pull -q push -q ingress' with 0.5 CPU and 1 gig of memory (See: https://github.com/compuguy/mastodon-on-aws/blob/istoleyourpw-deploy/mastodon.yaml#L269). Memory seems to be good, but I still get way to many CPUUtilizationTooHighAlarms, especially when trends are updating. On the bright side, it is scaling up the instances when needed. I'm thinking of upping it to 1 vCPU, which would require upping the memory per container to 2 GB of memory. Here's a CPU utilization chart for the past week:

Screenshot from 2022-12-11 17-54-56

@pegli
Copy link
Contributor

pegli commented Dec 22, 2022

I wanted to share an incident report I created after a member of my instance reported problems uploading videos:

https://hub.montereybay.social/blog/degraded-service-video-transcoding-failures.html

tl;dr: iPhone video transcoding with ffmpeg was causing the CPU and memory usage to spike on the Sidekiq service. Changing vCPUs from 0.25 -> 0.5 and memory from 0.5 Gb -> 1 Gb in the Task Definition and redeploying that service resolved the issue, at least temporarily.

My instance is still pretty small at 19 users. If anyone would like me to report additional statistics, let me know what you want to see -- I'm happy to share operational metrics.

@michaelwittig
Copy link
Contributor Author

@pegli We increased memory from 0.5 to 1 GB in #16
The CPU is still at 0.25 which is not a lot of horse powers :)

Yes, we are interested in metrics! RequestCountPerTarget for both ALB target groups (web and streaming) as well as CPU and memory of web, streaming and sidekiq.

@pegli
Copy link
Contributor

pegli commented Dec 22, 2022

At your service! https://hub.montereybay.social/Operations.html now has a public CloudWatch dashboard with all of those metrics.

@michaelwittig
Copy link
Contributor Author

@pegli That's cool :) Do you mind sharing the JSON definition (open the dashboard in the CloudWatch UI, click Actions -> View/edit source) of the dashboard? We could add it to the template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants