High CPU usage for lemmy.ml #2511

dessalines · 2022-10-24T14:24:37Z

After lemmy.ml runs for a while, the CPU usage goes really high, even affecting things like voting, which seem delayed.

The CPU usage for lemmy (not the database) looks unusually high, and the logs are full of failed federation messages. I'm fairly certain this is due to some glitch in the apub code, where it continually retries sending activities to dead servers.

Restarting the server fixes the issue temporarily, so I've added a daily restart in the cron.

cc @Nutomic

Nutomic · 2022-10-25T09:29:04Z

https://socialhub.activitypub.rocks/t/how-to-handle-dead-instances/2619

Nutomic · 2022-10-28T13:57:51Z

I think the best way to implement this is by adding a daily check in scheduled_tasks.rs, which connects to every linked server to check if it is reachable. For now it should be enough to connect to the domain root, and ensure that status 200 is returned.

We then have to store these stats in the database. When sending activities, we would check a certain condition for each instance (eg all requests failed in the last 7 days), and in that case not send to the instance. Even if the instance is considered down, we can continue the regular checks, in case it comes back later.

Nutomic · 2022-10-28T14:05:14Z

As for lemmy.ml, i see this in logs:
lemmy_1 | 2022-10-28T14:01:41.948344Z INFO HTTP request{http.method=POST http.scheme="http" http.host=lemmy.ml http.target=/inbox otel.kind="server" request_id=396d9d35-8cc0-4d73-bb45-528089523f75}:shared_inbox:send:send_lemmy_activity: lemmy_apub_lib::activity_queue: Activity queue stats: pending:22112, running: 18, dead (this hour): 0, complete (this hour): 77

22k pending requests is certainly a lot, but it shouldnt do anything besides take up memory (which we have more than enough). Only 20 tasks are running at a time, which should be no problem either. So i cant really imagine that this would affect performance.

dessalines · 2022-10-28T14:20:57Z

I think the best way to implement this is by adding a daily check in scheduled_tasks.rs, which connects to every linked server to check if it is reachable. For now it should be enough to connect to the domain root, and ensure that status 200 is returned.

You could add columns to the instance table, maybe a last_alive timestamp column, and check that its been alive within the past day.

dessalines · 2022-10-28T14:23:00Z

22k pending requests is certainly a lot, but it shouldnt do anything besides take up memory (which we have more than enough). Only 20 tasks are running at a time, which should be no problem either. So i cant really imagine that this would affect performance.

You might not see it because of the daily restarts now, but something does bump up /app/lemmy's CPU usage to 100% if it runs long enough.

asonix · 2022-10-28T15:15:59Z

fwiw i have 'breaker' system in my relay that counts the number of consecutive failures per-domain, and auto-fails further requests after an arbitrary limit is reached each 24 hours

Nutomic · 2022-11-02T12:18:19Z

You might not see it because of the daily restarts now, but something does bump up /app/lemmy's CPU usage to 100% if it runs long enough.

We need to find out whats causing these CPU spikes. I dont think its useful to make random changes based on guesses. If its really because of dead instances, I dont see why the problem would suddenly start happening now.

fwiw i have 'breaker' system in my relay that counts the number of consecutive failures per-domain, and auto-fails further requests after an arbitrary limit is reached each 24 hours

I like this approach, it also has the advantage that it doesnt need any changes in Lemmy, but can be implemented in the federation crate. Here is the code for reference: https://git.asonix.dog/asonix/relay/src/branch/main/src/requests.rs

Nutomic · 2022-11-15T21:10:39Z

Okay it looks like you were right. The Lemmy container is constantly at 95% CPU, and Lemmy logs are full of errors from activity sending (timeouts, tls, dns errors etc).

Stats from background_jobs crate look like this which doesnt seem so bad, but maybe submitting new jobs gets very slow at this queue size.

INFO HTTP request{http.method=POST http.scheme="http" http.host=lemmy.ml http.target=/inbox otel.kind="server" request_id=409f60ed-d561-4938-a1de-96bea03e7343}:shared_inbox:send:send_lemmy_activity: lemmy_apub_lib::activity_queue: Activity queue stats: pending:125111, running: 62, dead (this hour): 1635, complete (this hour): 649

I will look into implementing the breakers mentioned by asonix.

dessalines · 2022-11-15T21:29:38Z

Some thoughts:

Add a last_alive column to the instance table, check all the instances on server restart, and don't send to anything that hasn't been alive in the last day or week or month.
Possibly move the queue (or at least failed requests) out of memory and into the DB. This would reduce the memory burden, and allow jobs to be restored after server restarts.

My main fear with doing it only in memory, or the breaker approach, is that every restart will run up the server memory significantly while it tries to send to a bunch of long-dead instances.

Nutomic · 2022-11-16T10:54:50Z

After thinking about it more, the breakers actually seem like a very bad idea. The main purpose of these retries for activity sending is so that activities can be delivered even if the target instance goes down temporarily for maintenance, or other reasons. But 10 failed sends can be reached very easily if the target is down for an hour, and after that it wouldnt receive any activities for an entire day.

However I noticed two other problems with activity sending that could easily be improved:

We are retrying each failed item up to 10 times. This has little to no effect for instances that are reachable. But for an instance which is permanently down, it means every outgoing activity gets sent 11 times, which completely fills up the queue. I made a PR to lower this to three retries: Only retry activity sending at most 3 times activitypub-federation-rust#7
Activity sending is retried even if we completely fail to connect to the target instance, for example DNS cannot be resolved or TLS certificate is invalid. It is likely that such problems are permanent, and retry is pointless. The same goes for HTTP 4xx status codes (client errors). PR is here: Dont retry activity send in case of connection error or HTTP status 4xx activitypub-federation-rust#8

I believe that these two changes alone can significantly reduce the queue size for lemmy.ml, without affecting instances that are only down temporarily. Please tell me what you think about them, and if I should make any changes or merge directly.

My main fear with doing it only in memory, or the breaker approach, is that every restart will run up the server memory significantly while it tries to send to a bunch of long-dead instances.

When I checked yesterday, there was plenty of free RAM so I dont think thats the problem. The main problem seems to be CPU usage, or maybe operations on the queue data structure getting very slow when it gets too large.

Possibly move the queue (or at least failed requests) out of memory and into the DB. This would reduce the memory burden, and allow jobs to be restored after server restarts.

See #2142. However this could result in a lot of additional db writes, which might also affect performance negatively. It might be better to only write tasks to disk on shutdown, and read them back in on startup. Probably needs benchmarking.

Add a last_alive column to the instance table, check all the instances on server restart, and don't send to anything that hasn't been alive in the last day or week or month.

This is another possibility, but relatively complex as we need to store the times of successful and failed checks for each instance. Hopefully it wont be necessary with the changes i proposed above.

dessalines · 2022-11-16T19:09:08Z

Those two changes sound good to me. We can leave this open and re-evaluate if there are still high CPU problems.

dessalines · 2023-02-16T18:48:23Z

This doesn't appear to be an issue anymore, due to nutomic's fixes above.

dessalines added the bug Something isn't working label Oct 24, 2022

Nutomic added the area: federation support federation via activitypub label Oct 28, 2022

Nutomic added a commit that referenced this issue Nov 22, 2022

Upgrade activitypub_federation crate to 0.3.3 (ref #2511)

11a2ba2

dessalines pushed a commit that referenced this issue Nov 22, 2022

Upgrade activitypub_federation crate to 0.3.3 (ref #2511) (#2578)

bc19d94

dessalines closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage for lemmy.ml #2511

High CPU usage for lemmy.ml #2511

dessalines commented Oct 24, 2022

Nutomic commented Oct 25, 2022

Nutomic commented Oct 28, 2022

Nutomic commented Oct 28, 2022

dessalines commented Oct 28, 2022 •

edited

Loading

dessalines commented Oct 28, 2022

asonix commented Oct 28, 2022

Nutomic commented Nov 2, 2022

Nutomic commented Nov 15, 2022

dessalines commented Nov 15, 2022

Nutomic commented Nov 16, 2022

dessalines commented Nov 16, 2022

dessalines commented Feb 16, 2023

High CPU usage for lemmy.ml #2511

High CPU usage for lemmy.ml #2511

Comments

dessalines commented Oct 24, 2022

Nutomic commented Oct 25, 2022

Nutomic commented Oct 28, 2022

Nutomic commented Oct 28, 2022

dessalines commented Oct 28, 2022 • edited Loading

dessalines commented Oct 28, 2022

asonix commented Oct 28, 2022

Nutomic commented Nov 2, 2022

Nutomic commented Nov 15, 2022

dessalines commented Nov 15, 2022

Nutomic commented Nov 16, 2022

dessalines commented Nov 16, 2022

dessalines commented Feb 16, 2023

dessalines commented Oct 28, 2022 •

edited

Loading