Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage for lemmy.ml #2511

Closed
dessalines opened this issue Oct 24, 2022 · 12 comments
Closed

High CPU usage for lemmy.ml #2511

dessalines opened this issue Oct 24, 2022 · 12 comments
Labels
area: federation support federation via activitypub bug Something isn't working

Comments

@dessalines
Copy link
Member

After lemmy.ml runs for a while, the CPU usage goes really high, even affecting things like voting, which seem delayed.

The CPU usage for lemmy (not the database) looks unusually high, and the logs are full of failed federation messages. I'm fairly certain this is due to some glitch in the apub code, where it continually retries sending activities to dead servers.

Restarting the server fixes the issue temporarily, so I've added a daily restart in the cron.

cc @Nutomic

@dessalines dessalines added the bug Something isn't working label Oct 24, 2022
@Nutomic
Copy link
Member

Nutomic commented Oct 25, 2022

@Nutomic
Copy link
Member

Nutomic commented Oct 28, 2022

I think the best way to implement this is by adding a daily check in scheduled_tasks.rs, which connects to every linked server to check if it is reachable. For now it should be enough to connect to the domain root, and ensure that status 200 is returned.

We then have to store these stats in the database. When sending activities, we would check a certain condition for each instance (eg all requests failed in the last 7 days), and in that case not send to the instance. Even if the instance is considered down, we can continue the regular checks, in case it comes back later.

@Nutomic Nutomic added the area: federation support federation via activitypub label Oct 28, 2022
@Nutomic
Copy link
Member

Nutomic commented Oct 28, 2022

As for lemmy.ml, i see this in logs:
lemmy_1 | 2022-10-28T14:01:41.948344Z INFO HTTP request{http.method=POST http.scheme="http" http.host=lemmy.ml http.target=/inbox otel.kind="server" request_id=396d9d35-8cc0-4d73-bb45-528089523f75}:shared_inbox:send:send_lemmy_activity: lemmy_apub_lib::activity_queue: Activity queue stats: pending:22112, running: 18, dead (this hour): 0, complete (this hour): 77

22k pending requests is certainly a lot, but it shouldnt do anything besides take up memory (which we have more than enough). Only 20 tasks are running at a time, which should be no problem either. So i cant really imagine that this would affect performance.

@dessalines
Copy link
Member Author

dessalines commented Oct 28, 2022

I think the best way to implement this is by adding a daily check in scheduled_tasks.rs, which connects to every linked server to check if it is reachable. For now it should be enough to connect to the domain root, and ensure that status 200 is returned.

You could add columns to the instance table, maybe a last_alive timestamp column, and check that its been alive within the past day.

@dessalines
Copy link
Member Author

22k pending requests is certainly a lot, but it shouldnt do anything besides take up memory (which we have more than enough). Only 20 tasks are running at a time, which should be no problem either. So i cant really imagine that this would affect performance.

You might not see it because of the daily restarts now, but something does bump up /app/lemmy's CPU usage to 100% if it runs long enough.

@asonix
Copy link
Collaborator

asonix commented Oct 28, 2022

fwiw i have 'breaker' system in my relay that counts the number of consecutive failures per-domain, and auto-fails further requests after an arbitrary limit is reached each 24 hours

@Nutomic
Copy link
Member

Nutomic commented Nov 2, 2022

You might not see it because of the daily restarts now, but something does bump up /app/lemmy's CPU usage to 100% if it runs long enough.

We need to find out whats causing these CPU spikes. I dont think its useful to make random changes based on guesses. If its really because of dead instances, I dont see why the problem would suddenly start happening now.

fwiw i have 'breaker' system in my relay that counts the number of consecutive failures per-domain, and auto-fails further requests after an arbitrary limit is reached each 24 hours

I like this approach, it also has the advantage that it doesnt need any changes in Lemmy, but can be implemented in the federation crate. Here is the code for reference: https://git.asonix.dog/asonix/relay/src/branch/main/src/requests.rs

@Nutomic
Copy link
Member

Nutomic commented Nov 15, 2022

Okay it looks like you were right. The Lemmy container is constantly at 95% CPU, and Lemmy logs are full of errors from activity sending (timeouts, tls, dns errors etc).

Stats from background_jobs crate look like this which doesnt seem so bad, but maybe submitting new jobs gets very slow at this queue size.

INFO HTTP request{http.method=POST http.scheme="http" http.host=lemmy.ml http.target=/inbox otel.kind="server" request_id=409f60ed-d561-4938-a1de-96bea03e7343}:shared_inbox:send:send_lemmy_activity: lemmy_apub_lib::activity_queue: Activity queue stats: pending:125111, running: 62, dead (this hour): 1635, complete (this hour): 649

I will look into implementing the breakers mentioned by asonix.

@dessalines
Copy link
Member Author

Some thoughts:

  • Add a last_alive column to the instance table, check all the instances on server restart, and don't send to anything that hasn't been alive in the last day or week or month.
  • Possibly move the queue (or at least failed requests) out of memory and into the DB. This would reduce the memory burden, and allow jobs to be restored after server restarts.

My main fear with doing it only in memory, or the breaker approach, is that every restart will run up the server memory significantly while it tries to send to a bunch of long-dead instances.

@Nutomic
Copy link
Member

Nutomic commented Nov 16, 2022

After thinking about it more, the breakers actually seem like a very bad idea. The main purpose of these retries for activity sending is so that activities can be delivered even if the target instance goes down temporarily for maintenance, or other reasons. But 10 failed sends can be reached very easily if the target is down for an hour, and after that it wouldnt receive any activities for an entire day.

However I noticed two other problems with activity sending that could easily be improved:

I believe that these two changes alone can significantly reduce the queue size for lemmy.ml, without affecting instances that are only down temporarily. Please tell me what you think about them, and if I should make any changes or merge directly.


My main fear with doing it only in memory, or the breaker approach, is that every restart will run up the server memory significantly while it tries to send to a bunch of long-dead instances.

When I checked yesterday, there was plenty of free RAM so I dont think thats the problem. The main problem seems to be CPU usage, or maybe operations on the queue data structure getting very slow when it gets too large.

Possibly move the queue (or at least failed requests) out of memory and into the DB. This would reduce the memory burden, and allow jobs to be restored after server restarts.

See #2142. However this could result in a lot of additional db writes, which might also affect performance negatively. It might be better to only write tasks to disk on shutdown, and read them back in on startup. Probably needs benchmarking.

Add a last_alive column to the instance table, check all the instances on server restart, and don't send to anything that hasn't been alive in the last day or week or month.

This is another possibility, but relatively complex as we need to store the times of successful and failed checks for each instance. Hopefully it wont be necessary with the changes i proposed above.

@dessalines
Copy link
Member Author

Those two changes sound good to me. We can leave this open and re-evaluate if there are still high CPU problems.

@dessalines
Copy link
Member Author

This doesn't appear to be an issue anymore, due to nutomic's fixes above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants