Check for dead federated instances (fixes #2221) #3427

Nutomic · 2023-06-30T13:07:28Z

A very basic check which every day tries to connect to every known instance. If the connection fails or returns something different than HTTP 200, the instance is marked as dead and no federation activities will be sent to it.

This implementation is really basic, there can be false positives if an instance is temporarily down or unreachable during the check. It also rechecks all known instances every day, even if they have been down for years. Nevertheless it should be a major improvement, we can add more sophisticated checks later.

Still need to fix two problems mentioned in comments.

sunaurus · 2023-06-30T18:14:07Z

migrations/2023-06-30-111745_instance_alive_check/up.sql

@@ -0,0 +1 @@
+alter table site add column is_alive bool not null default true;


What about alive_check_fails_count integer default 0 instead of just a boolean?

i.e is_alive = alive_check_fails_count == 0.

That way, an exponential backoff could be added later to the is_alive check, for example:

one scheduled task running once per hour which retries all sites where alive_check_fails_count = 1

one scheduled task running once per day which retries where alive_check_fails_count < 5

one scheduled task running once per week which retries where alive_check_fails_count >= 5

Eskuero · 2023-07-01T12:10:29Z

Following up on what @RocketDerp said I guess it could be restructured the other way around. If you haven't received activity (comments, likes or posts) from a certain instance in X amount of time, actually check if they are still alive.

This reduces the risks of false positives. Otherwise effectively 24 hours defederation would be a heavy punishment for a maybe 30-60 seconds downtime of scheduled backups, updates, etc

sunaurus · 2023-07-01T12:11:33Z

Going by incoming traffic is problematic @RocketDerp - I've seen some cases of instances which send traffic out, but block incoming traffic, thus still tying up my federation workers.

Nutomic · 2023-07-03T14:16:39Z

Ive reworked this now to rely on the updated column for alive checks. Essentially there is a daily task which tries to connect to all connected instances, and if this succeeds, they are marked as updated at that time. When sending out activities, it checks that the instance was updated at most 3 days ago, otherwise no activities are sent to it. This way it doesnt matter if one or two checks fail.

Right now the code is very messy and needs cleanup/error handling as well as testing.

dessalines

This would probably be better as a SQL-only solution, rather than deal with OnceCell and doing a combination of SQL + memory.

IE before sending out any federation jobs, filter the instance list in SQL by your last_alive < X days column, rather than doing a contains(DEAD_INSTANCES)

dessalines · 2023-07-03T15:15:47Z

src/federation_scheduled_tasks.rs

+  let mut scheduler = AsyncScheduler::new();
+
+  // Check for dead federated instances
+  static CONTEXT: OnceCell<LemmyContext> = OnceCell::const_new();


Shouldn't the DeadInstances be the only thing you need in a OnceCell? Why are these other ones needed.

dessalines · 2023-07-03T15:17:46Z

src/federation_scheduled_tasks.rs

+  });
+
+  // Manually run the scheduler in an event loop
+  tokio::spawn(async move {


I'd make either both these scheduled things use tokio, or not. Not one use tokio and the other thread.

dessalines · 2023-07-03T15:18:06Z

src/federation_scheduled_tasks.rs

+use tokio::sync::OnceCell;
+use tracing::{error, info};
+
+pub async fn setup_federation_scheduled_tasks(


Not really necessary to make this async

dessalines · 2023-07-03T15:18:49Z

src/lib.rs

-      }
-    });
+    setup_database_scheduled_tasks(db_url.clone(), context.clone())?;
+    setup_federation_scheduled_tasks(db_url, context.clone()).await?;


Either both should be made async, or both not. I remember not being able to get clokwerk async to work correctly tho.

dessalines · 2023-07-03T15:22:36Z

crates/apub/src/lib.rs

@@ -28,6 +29,8 @@ pub mod protocol;

 pub const FEDERATION_HTTP_FETCH_LIMIT: u32 = 50;

+pub static DEAD_INSTANCES: RwLock<Vec<String>> = RwLock::new(Vec::new());


Shouldn't this be the OnceCell?

dessalines · 2023-07-03T15:24:10Z

crates/db_schema/src/source/site.rs

  pub inbox_url: Option<DbUrl>,
  pub private_key: Option<Option<String>>,
  pub public_key: Option<String>,
+  pub last_alive: Option<NaiveDateTime>,


Think you forgot to add the migration here.

Nutomic · 2023-07-04T10:10:45Z

Like I said its very messy and needs cleanup now. I did that now and also did another rework of the code. Most importantly dead instances and also blocklists are now stored in single-value moka caches. Much cleaner than using scheduled tasks to update them.

I also restored scheduled_tasks to the original implementation. However there is a problem because it checks nodeinfo, which isnt required for Activitypub and not present on some Fediverse instances (eg misskey.de). So it needs to check alternatively that a request to the domain root returns HTTP 200. Also these requests should really be async.

Nutomic · 2023-07-04T11:41:43Z

crates/apub/src/activities/mod.rs

+      .max_capacity(1)
+      .time_to_live(DB_QUERY_CACHE_DURATION)
+      .build()
+  });


Bit weird to use caches with capacity one and no key, but seems like the easiest way to implement this.

I wanted to mention that when writing my other PR, I accidentally made it construct a new whole moka cache for every single incoming event (so 1000s per second) and insert a single value, and it didn't negatively affect performance at all. Just as a reference that constructing tiny moka caches is probably fine regarding performance (if maybe not code beauty).

Nutomic · 2023-07-04T11:42:02Z

crates/db_schema/src/impls/instance.rs

+    instance::table
+      .select(instance::domain)
+      // TODO: should use instance::published if updated is null
+      //.filter(instance::updated.lt(now - 3.days()))


Not sure how to write this query.

COALESCE? coalesce(instance::updated, instance::published).lt(now - 3.days())

https://diesel.rs/guides/extending-diesel.html

use diesel::sql_types::{Nullable, Text};
sql_function! { fn coalesce(x: Nullable, y: Text) -> Text; }

That works, thanks! Although I dont see how to make it generic, and have to write it specifically for timestamp type.

Nutomic · 2023-07-04T13:59:26Z

Ready for review/merge now.

dessalines

I would much rather this be handled in SQL only, via a alive_instances query, and just optimize those queries. I foresee a lot of problems coming from adding a secondary store / caching layer. People won't understand why they've blocked an instance yet posts are still coming through, for example.

crates/db_schema/src/utils.rs

dessalines · 2023-07-04T19:28:26Z

crates/db_schema/src/impls/instance.rs

+    let conn = &mut get_conn(pool).await?;
+    instance::table
+      .select(instance::domain)
+      .filter(coalesce_time(instance::updated, instance::published).lt(now - 3.days()))


Looks good.

dessalines · 2023-07-04T19:36:20Z

src/scheduled_tasks.rs

-fn update_instance_software(conn: &mut PgConnection, user_agent: &str) {
+///
+/// TODO: this should be async
+/// TODO: if instance has been dead for a long time, it should be checked less frequently


For this one, in the select below, you could do let instances = instance::table.filter(coalesce(updated, published).gt(now - 1.months())

Even better, would be to add this as alive_instances in impls/instance.rs

Another possibility, would be to recheck the alive_instances every day, but only re-check all of them (even previously dead ones) every month. Up to you.

Or check old instances using random probability, eg in 1% of all checks.

Anyway this can be improved later, no need to include it in this PR.

crates/db_schema/src/impls/instance.rs

crates/apub/src/lib.rs

phiresky · 2023-07-04T19:52:14Z

People won't understand why they've blocked an instance yet posts are still coming through, for example.

Should be fairly easy to also update the cache where the query updates the database. Won't fix the issue if people are running multiple lemmy_server instances though

I foresee a lot of problems down the road as we start adding layers of stores and caches on top of each other.

Cache invalidation is definitely a non-trivial problem. The site just being down because it can't handle millions of queries is arguably a bigger problem though :)

phiresky · 2023-07-05T00:59:35Z

I want to mention that the fetch_local_site_data is the third most expensive function in the code base and the first-most expensive actually needed function. I'd recommend either this be merged or I can create a minimal PR that just caches fetch_local_site_data for a few seconds.

The cache duration can be significantly reduced and still have huge impact. Even though this function takes only 1 ms it is called with a frequency of over 1000Hz on lemmy.world. A cache duration of 5 seconds would be perfectly fine and even 1s would be useful.

Nutomic · 2023-07-05T08:45:12Z

src/scheduled_tasks.rs

      }
+    };


This code is quite confusing, open for suggestions how to simplify it.

Maybe something like this? I'm new to lemmy and rust so apologies if it is not appropriate for me to post this here. It explicitly sets default_view on HTTP 500 but I suspect that would happen anyway with the OG code on deserialization failure. Haven't tested this for correctness, but maybe it can be a template for something more readable?

let form_result = client.get(&node_info_url).send() .map(|response| { response.error_for_status() .and_then(|response| response.json::<NodeInfo>()) .map_or(default_form, |node_info| { InstanceForm::builder() .domain(instance.domain) .updated(Some(naive_now())) .software(node_info.software.and_then(|s| s.name)) .version(node_info.version.clone()) .build() }) }); if let Ok(form) = form_result { diesel::update(instance::table.find(instance.id)) .set(form) .execute(conn)?; }

Edit: Or perhaps even this.

let form_result = client.get(&node_info_url).send() .and_then(|response| response.error_for_status()) .and_then(|response| response.json::<NodeInfo>()) .map(|node_info| InstanceForm::builder() .domain(instance.domain) .updated(Some(naive_now())) .software(node_info.software.and_then(|s| s.name)) .version(node_info.version.clone()) .build()) .or_else(|err| if err.is_status() { Ok(default_form) } else { Err(err) });

I gave this a try but feel like its getting even more confusing. So I will leave it as is.

Nutomic · 2023-07-05T08:53:52Z

Moved the blocklist caching to #3486 and decreased cache time to one minute to ensure that changes take effect quickly.

This PR will take more scrutiny and testing, lets leave it for 0.18.2

…tasks

dessalines

Seems fine, just needs conflicts fixed.

* Check for dead federated instances (fixes LemmyNet#2221) * move to apub crate, use timestamp * make it compile * clippy * use moka to cache blocklists, dead instances, restore orig scheduled tasks * remove leftover last_alive var * error handling * wip * fix alive check for instances without nodeinfo, add coalesce * clippy * move federation blocklist cache to LemmyNet#3486 * unused deps

* Check for dead federated instances (fixes #2221) * move to apub crate, use timestamp * make it compile * clippy * use moka to cache blocklists, dead instances, restore orig scheduled tasks * remove leftover last_alive var * error handling * wip * fix alive check for instances without nodeinfo, add coalesce * clippy * move federation blocklist cache to #3486 * unused deps

sunaurus reviewed Jun 30, 2023

View reviewed changes

This comment was marked as abuse.

Sign in to view

Nutomic force-pushed the dead-instance-check branch from 914585c to 7360b3d Compare July 3, 2023 14:13

Nutomic mentioned this pull request Jul 3, 2023

Add failure rate limit for sends LemmyNet/activitypub-federation-rust#60

Closed

dessalines reviewed Jul 3, 2023

View reviewed changes

Nutomic mentioned this pull request Jul 4, 2023

implement simple moka cache for all dereference() calls LemmyNet/activitypub-federation-rust#55

Closed

Nutomic commented Jul 4, 2023

View reviewed changes

Nutomic marked this pull request as ready for review July 4, 2023 13:59

dessalines reviewed Jul 4, 2023

View reviewed changes

This comment was marked as abuse.

Sign in to view

Nutomic commented Jul 5, 2023

View reviewed changes

Nutomic added 9 commits July 5, 2023 10:56

Check for dead federated instances (fixes #2221)

ccc86f7

move to apub crate, use timestamp

d2a1dc0

make it compile

69812dd

clippy

7425f2a

use moka to cache blocklists, dead instances, restore orig scheduled …

d021ddb

…tasks

remove leftover last_alive var

dae74f6

error handling

8156416

wip

0ce2ff7

fix alive check for instances without nodeinfo, add coalesce

da11ec0

Nutomic added 2 commits July 5, 2023 10:56

clippy

93b6410

move federation blocklist cache to #3486

a898aca

Nutomic force-pushed the dead-instance-check branch from 79e3dc2 to a898aca Compare July 5, 2023 09:22

unused deps

733bdd2

Nutomic force-pushed the dead-instance-check branch from 2ec4aa1 to 733bdd2 Compare July 5, 2023 09:25

Nutomic mentioned this pull request Jul 5, 2023

Disable retry queue option #3468

Closed

Nutomic force-pushed the dead-instance-check branch from 1a581b6 to 17be5e5 Compare July 10, 2023 11:48

Merge branch 'main' into dead-instance-check

6c5c6e1

Nutomic force-pushed the dead-instance-check branch from 17be5e5 to 6c5c6e1 Compare July 10, 2023 11:48

dessalines approved these changes Jul 12, 2023

View reviewed changes

Merge branch 'main' into dead-instance-check

93e98b6

Nutomic enabled auto-merge (squash) July 13, 2023 09:01

Merge branch 'main' into dead-instance-check

382ad74

Nutomic disabled auto-merge July 13, 2023 14:11

Nutomic merged commit 7d8cb93 into main Jul 13, 2023

phiresky mentioned this pull request Jul 23, 2023

Persistent, performant, reliable federation queue #3605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for dead federated instances (fixes #2221) #3427

Check for dead federated instances (fixes #2221) #3427

Nutomic commented Jun 30, 2023

sunaurus Jun 30, 2023

This comment was marked as abuse.

Eskuero commented Jul 1, 2023

sunaurus commented Jul 1, 2023

This comment was marked as abuse.

Nutomic commented Jul 3, 2023

dessalines left a comment •

edited

Loading

dessalines Jul 3, 2023

dessalines Jul 3, 2023

dessalines Jul 3, 2023

dessalines Jul 3, 2023

dessalines Jul 3, 2023

dessalines Jul 3, 2023

Nutomic commented Jul 4, 2023

Nutomic Jul 4, 2023

phiresky Jul 4, 2023 •

edited

Loading

Nutomic Jul 4, 2023

phiresky Jul 4, 2023

Nutomic Jul 4, 2023

Nutomic commented Jul 4, 2023

dessalines left a comment

dessalines Jul 4, 2023

dessalines Jul 4, 2023

Nutomic Jul 5, 2023

Nutomic Jul 10, 2023

phiresky commented Jul 4, 2023 •

edited

Loading

phiresky commented Jul 5, 2023 •

edited

Loading

This comment was marked as abuse.

Nutomic Jul 5, 2023

ciscprocess Jul 6, 2023 •

edited

Loading

Nutomic Jul 10, 2023

Nutomic commented Jul 5, 2023

dessalines left a comment

		@@ -0,0 +1 @@
		alter table site add column is_alive bool not null default true;

		@@ -28,6 +29,8 @@ pub mod protocol;

		pub const FEDERATION_HTTP_FETCH_LIMIT: u32 = 50;

		pub static DEAD_INSTANCES: RwLock<Vec<String>> = RwLock::new(Vec::new());

Check for dead federated instances (fixes #2221) #3427

Check for dead federated instances (fixes #2221) #3427

Conversation

Nutomic commented Jun 30, 2023

Choose a reason for hiding this comment

This comment was marked as abuse.

Eskuero commented Jul 1, 2023

sunaurus commented Jul 1, 2023

This comment was marked as abuse.

Nutomic commented Jul 3, 2023

dessalines left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nutomic commented Jul 4, 2023

Choose a reason for hiding this comment

phiresky Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nutomic commented Jul 4, 2023

dessalines left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky commented Jul 4, 2023 • edited Loading

phiresky commented Jul 5, 2023 • edited Loading

This comment was marked as abuse.

Choose a reason for hiding this comment

ciscprocess Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nutomic commented Jul 5, 2023

dessalines left a comment

Choose a reason for hiding this comment

dessalines left a comment •

edited

Loading

phiresky Jul 4, 2023 •

edited

Loading

phiresky commented Jul 4, 2023 •

edited

Loading

phiresky commented Jul 5, 2023 •

edited

Loading

ciscprocess Jul 6, 2023 •

edited

Loading