-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage for lemmy.ml #2511
Comments
I think the best way to implement this is by adding a daily check in We then have to store these stats in the database. When sending activities, we would check a certain condition for each instance (eg all requests failed in the last 7 days), and in that case not send to the instance. Even if the instance is considered down, we can continue the regular checks, in case it comes back later. |
As for lemmy.ml, i see this in logs: 22k pending requests is certainly a lot, but it shouldnt do anything besides take up memory (which we have more than enough). Only 20 tasks are running at a time, which should be no problem either. So i cant really imagine that this would affect performance. |
You could add columns to the instance table, maybe a |
You might not see it because of the daily restarts now, but something does bump up |
fwiw i have 'breaker' system in my relay that counts the number of consecutive failures per-domain, and auto-fails further requests after an arbitrary limit is reached each 24 hours |
We need to find out whats causing these CPU spikes. I dont think its useful to make random changes based on guesses. If its really because of dead instances, I dont see why the problem would suddenly start happening now.
I like this approach, it also has the advantage that it doesnt need any changes in Lemmy, but can be implemented in the federation crate. Here is the code for reference: https://git.asonix.dog/asonix/relay/src/branch/main/src/requests.rs |
Okay it looks like you were right. The Lemmy container is constantly at 95% CPU, and Lemmy logs are full of errors from activity sending (timeouts, tls, dns errors etc). Stats from background_jobs crate look like this which doesnt seem so bad, but maybe submitting new jobs gets very slow at this queue size.
I will look into implementing the breakers mentioned by asonix. |
Some thoughts:
My main fear with doing it only in memory, or the breaker approach, is that every restart will run up the server memory significantly while it tries to send to a bunch of long-dead instances. |
After thinking about it more, the breakers actually seem like a very bad idea. The main purpose of these retries for activity sending is so that activities can be delivered even if the target instance goes down temporarily for maintenance, or other reasons. But 10 failed sends can be reached very easily if the target is down for an hour, and after that it wouldnt receive any activities for an entire day. However I noticed two other problems with activity sending that could easily be improved:
I believe that these two changes alone can significantly reduce the queue size for lemmy.ml, without affecting instances that are only down temporarily. Please tell me what you think about them, and if I should make any changes or merge directly.
When I checked yesterday, there was plenty of free RAM so I dont think thats the problem. The main problem seems to be CPU usage, or maybe operations on the queue data structure getting very slow when it gets too large.
See #2142. However this could result in a lot of additional db writes, which might also affect performance negatively. It might be better to only write tasks to disk on shutdown, and read them back in on startup. Probably needs benchmarking.
This is another possibility, but relatively complex as we need to store the times of successful and failed checks for each instance. Hopefully it wont be necessary with the changes i proposed above. |
Those two changes sound good to me. We can leave this open and re-evaluate if there are still high CPU problems. |
This doesn't appear to be an issue anymore, due to nutomic's fixes above. |
After lemmy.ml runs for a while, the CPU usage goes really high, even affecting things like voting, which seem delayed.
The CPU usage for lemmy (not the database) looks unusually high, and the logs are full of failed federation messages. I'm fairly certain this is due to some glitch in the apub code, where it continually retries sending activities to dead servers.
Restarting the server fixes the issue temporarily, so I've added a daily restart in the cron.
cc @Nutomic
The text was updated successfully, but these errors were encountered: