Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for dead federated instances (fixes #2221) #3427

Merged
merged 15 commits into from
Jul 13, 2023
Merged

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented Jun 30, 2023

A very basic check which every day tries to connect to every known instance. If the connection fails or returns something different than HTTP 200, the instance is marked as dead and no federation activities will be sent to it.

This implementation is really basic, there can be false positives if an instance is temporarily down or unreachable during the check. It also rechecks all known instances every day, even if they have been down for years. Nevertheless it should be a major improvement, we can add more sophisticated checks later.

Still need to fix two problems mentioned in comments.

@@ -0,0 +1 @@
alter table site add column is_alive bool not null default true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about alive_check_fails_count integer default 0 instead of just a boolean?

i.e is_alive = alive_check_fails_count == 0.

That way, an exponential backoff could be added later to the is_alive check, for example:

  • one scheduled task running once per hour which retries all sites where alive_check_fails_count = 1
  • one scheduled task running once per day which retries where alive_check_fails_count < 5
  • one scheduled task running once per week which retries where alive_check_fails_count >= 5

@RocketDerp

This comment was marked as abuse.

@Eskuero
Copy link
Contributor

Eskuero commented Jul 1, 2023

Following up on what @RocketDerp said I guess it could be restructured the other way around. If you haven't received activity (comments, likes or posts) from a certain instance in X amount of time, actually check if they are still alive.

This reduces the risks of false positives. Otherwise effectively 24 hours defederation would be a heavy punishment for a maybe 30-60 seconds downtime of scheduled backups, updates, etc

@sunaurus
Copy link
Collaborator

sunaurus commented Jul 1, 2023

Going by incoming traffic is problematic @RocketDerp - I've seen some cases of instances which send traffic out, but block incoming traffic, thus still tying up my federation workers.

@RocketDerp

This comment was marked as abuse.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 3, 2023

Ive reworked this now to rely on the updated column for alive checks. Essentially there is a daily task which tries to connect to all connected instances, and if this succeeds, they are marked as updated at that time. When sending out activities, it checks that the instance was updated at most 3 days ago, otherwise no activities are sent to it. This way it doesnt matter if one or two checks fail.

Right now the code is very messy and needs cleanup/error handling as well as testing.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be better as a SQL-only solution, rather than deal with OnceCell and doing a combination of SQL + memory.

IE before sending out any federation jobs, filter the instance list in SQL by your last_alive < X days column, rather than doing a contains(DEAD_INSTANCES)

let mut scheduler = AsyncScheduler::new();

// Check for dead federated instances
static CONTEXT: OnceCell<LemmyContext> = OnceCell::const_new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the DeadInstances be the only thing you need in a OnceCell? Why are these other ones needed.

});

// Manually run the scheduler in an event loop
tokio::spawn(async move {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make either both these scheduled things use tokio, or not. Not one use tokio and the other thread.

use tokio::sync::OnceCell;
use tracing::{error, info};

pub async fn setup_federation_scheduled_tasks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary to make this async

src/lib.rs Outdated
}
});
setup_database_scheduled_tasks(db_url.clone(), context.clone())?;
setup_federation_scheduled_tasks(db_url, context.clone()).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either both should be made async, or both not. I remember not being able to get clokwerk async to work correctly tho.

@@ -28,6 +29,8 @@ pub mod protocol;

pub const FEDERATION_HTTP_FETCH_LIMIT: u32 = 50;

pub static DEAD_INSTANCES: RwLock<Vec<String>> = RwLock::new(Vec::new());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the OnceCell?

pub inbox_url: Option<DbUrl>,
pub private_key: Option<Option<String>>,
pub public_key: Option<String>,
pub last_alive: Option<NaiveDateTime>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you forgot to add the migration here.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 4, 2023

Like I said its very messy and needs cleanup now. I did that now and also did another rework of the code. Most importantly dead instances and also blocklists are now stored in single-value moka caches. Much cleaner than using scheduled tasks to update them.

I also restored scheduled_tasks to the original implementation. However there is a problem because it checks nodeinfo, which isnt required for Activitypub and not present on some Fediverse instances (eg misskey.de). So it needs to check alternatively that a request to the domain root returns HTTP 200. Also these requests should really be async.

.max_capacity(1)
.time_to_live(DB_QUERY_CACHE_DURATION)
.build()
});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit weird to use caches with capacity one and no key, but seems like the easiest way to implement this.

Copy link
Collaborator

@phiresky phiresky Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to mention that when writing my other PR, I accidentally made it construct a new whole moka cache for every single incoming event (so 1000s per second) and insert a single value, and it didn't negatively affect performance at all. Just as a reference that constructing tiny moka caches is probably fine regarding performance (if maybe not code beauty).

instance::table
.select(instance::domain)
// TODO: should use instance::published if updated is null
//.filter(instance::updated.lt(now - 3.days()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to write this query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COALESCE? coalesce(instance::updated, instance::published).lt(now - 3.days())

https://diesel.rs/guides/extending-diesel.html

use diesel::sql_types::{Nullable, Text};
sql_function! { fn coalesce(x: Nullable, y: Text) -> Text; }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works, thanks! Although I dont see how to make it generic, and have to write it specifically for timestamp type.

@Nutomic Nutomic marked this pull request as ready for review July 4, 2023 13:59
@Nutomic
Copy link
Member Author

Nutomic commented Jul 4, 2023

Ready for review/merge now.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather this be handled in SQL only, via a alive_instances query, and just optimize those queries. I foresee a lot of problems coming from adding a secondary store / caching layer. People won't understand why they've blocked an instance yet posts are still coming through, for example.

crates/db_schema/src/utils.rs Outdated Show resolved Hide resolved
let conn = &mut get_conn(pool).await?;
instance::table
.select(instance::domain)
.filter(coalesce_time(instance::updated, instance::published).lt(now - 3.days()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

fn update_instance_software(conn: &mut PgConnection, user_agent: &str) {
///
/// TODO: this should be async
/// TODO: if instance has been dead for a long time, it should be checked less frequently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, in the select below, you could do let instances = instance::table.filter(coalesce(updated, published).gt(now - 1.months())

Even better, would be to add this as alive_instances in impls/instance.rs

Another possibility, would be to recheck the alive_instances every day, but only re-check all of them (even previously dead ones) every month. Up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or check old instances using random probability, eg in 1% of all checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway this can be improved later, no need to include it in this PR.

crates/db_schema/src/impls/instance.rs Show resolved Hide resolved
crates/apub/src/lib.rs Outdated Show resolved Hide resolved
@phiresky
Copy link
Collaborator

phiresky commented Jul 4, 2023

People won't understand why they've blocked an instance yet posts are still coming through, for example.

Should be fairly easy to also update the cache where the query updates the database. Won't fix the issue if people are running multiple lemmy_server instances though

I foresee a lot of problems down the road as we start adding layers of stores and caches on top of each other.

Cache invalidation is definitely a non-trivial problem. The site just being down because it can't handle millions of queries is arguably a bigger problem though :)

@phiresky
Copy link
Collaborator

phiresky commented Jul 5, 2023

I want to mention that the fetch_local_site_data is the third most expensive function in the code base and the first-most expensive actually needed function. I'd recommend either this be merged or I can create a minimal PR that just caches fetch_local_site_data for a few seconds.

The cache duration can be significantly reduced and still have huge impact. Even though this function takes only 1 ms it is called with a frequency of over 1000Hz on lemmy.world. A cache duration of 5 seconds would be perfectly fine and even 1s would be useful.

@RocketDerp

This comment was marked as abuse.

}
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is quite confusing, open for suggestions how to simplify it.

Copy link

@ciscprocess ciscprocess Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this? I'm new to lemmy and rust so apologies if it is not appropriate for me to post this here. It explicitly sets default_view on HTTP 500 but I suspect that would happen anyway with the OG code on deserialization failure. Haven't tested this for correctness, but maybe it can be a template for something more readable?

    let form_result = client.get(&node_info_url).send()
      .map(|response| {
        response.error_for_status()
          .and_then(|response|  response.json::<NodeInfo>())
          .map_or(default_form, |node_info| {
            InstanceForm::builder()
              .domain(instance.domain)
              .updated(Some(naive_now()))
              .software(node_info.software.and_then(|s| s.name))
              .version(node_info.version.clone())
              .build()
          })
      });

      if let Ok(form) = form_result {
        diesel::update(instance::table.find(instance.id))
          .set(form)
          .execute(conn)?;
      }

Edit: Or perhaps even this.

    let form_result = client.get(&node_info_url).send()
      .and_then(|response| response.error_for_status())
      .and_then(|response| response.json::<NodeInfo>())
      .map(|node_info| InstanceForm::builder()
                .domain(instance.domain)
                .updated(Some(naive_now()))
                .software(node_info.software.and_then(|s| s.name))
                .version(node_info.version.clone())
                .build())
      .or_else(|err| if err.is_status() { Ok(default_form) } else { Err(err) });

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try but feel like its getting even more confusing. So I will leave it as is.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 5, 2023

Moved the blocklist caching to #3486 and decreased cache time to one minute to ensure that changes take effect quickly.

This PR will take more scrutiny and testing, lets leave it for 0.18.2

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, just needs conflicts fixed.

@Nutomic Nutomic enabled auto-merge (squash) July 13, 2023 09:01
@Nutomic Nutomic disabled auto-merge July 13, 2023 14:11
@Nutomic Nutomic merged commit 7d8cb93 into main Jul 13, 2023
Nutomic added a commit to cetra3/lemmy that referenced this pull request Jul 19, 2023
* Check for dead federated instances (fixes LemmyNet#2221)

* move to apub crate, use timestamp

* make it compile

* clippy

* use moka to cache blocklists, dead instances, restore orig scheduled tasks

* remove leftover last_alive var

* error handling

* wip

* fix alive check for instances without nodeinfo, add coalesce

* clippy

* move federation blocklist cache to LemmyNet#3486

* unused deps
Nutomic added a commit that referenced this pull request Jul 21, 2023
* Check for dead federated instances (fixes #2221)

* move to apub crate, use timestamp

* make it compile

* clippy

* use moka to cache blocklists, dead instances, restore orig scheduled tasks

* remove leftover last_alive var

* error handling

* wip

* fix alive check for instances without nodeinfo, add coalesce

* clippy

* move federation blocklist cache to #3486

* unused deps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants