-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for dead federated instances (fixes #2221) #3427
Conversation
@@ -0,0 +1 @@ | |||
alter table site add column is_alive bool not null default true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about alive_check_fails_count integer default 0
instead of just a boolean?
i.e is_alive = alive_check_fails_count == 0
.
That way, an exponential backoff could be added later to the is_alive check, for example:
- one scheduled task running once per hour which retries all sites
where alive_check_fails_count = 1
- one scheduled task running once per day which retries
where alive_check_fails_count < 5
- one scheduled task running once per week which retries
where alive_check_fails_count >= 5
This comment was marked as abuse.
This comment was marked as abuse.
Following up on what @RocketDerp said I guess it could be restructured the other way around. If you haven't received activity (comments, likes or posts) from a certain instance in X amount of time, actually check if they are still alive. This reduces the risks of false positives. Otherwise effectively 24 hours defederation would be a heavy punishment for a maybe 30-60 seconds downtime of scheduled backups, updates, etc |
Going by incoming traffic is problematic @RocketDerp - I've seen some cases of instances which send traffic out, but block incoming traffic, thus still tying up my federation workers. |
This comment was marked as abuse.
This comment was marked as abuse.
914585c
to
7360b3d
Compare
Ive reworked this now to rely on the Right now the code is very messy and needs cleanup/error handling as well as testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be better as a SQL-only solution, rather than deal with OnceCell and doing a combination of SQL + memory.
IE before sending out any federation jobs, filter the instance list in SQL by your last_alive < X days
column, rather than doing a contains(DEAD_INSTANCES)
src/federation_scheduled_tasks.rs
Outdated
let mut scheduler = AsyncScheduler::new(); | ||
|
||
// Check for dead federated instances | ||
static CONTEXT: OnceCell<LemmyContext> = OnceCell::const_new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the DeadInstances be the only thing you need in a OnceCell? Why are these other ones needed.
src/federation_scheduled_tasks.rs
Outdated
}); | ||
|
||
// Manually run the scheduler in an event loop | ||
tokio::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make either both these scheduled things use tokio, or not. Not one use tokio and the other thread.
src/federation_scheduled_tasks.rs
Outdated
use tokio::sync::OnceCell; | ||
use tracing::{error, info}; | ||
|
||
pub async fn setup_federation_scheduled_tasks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really necessary to make this async
src/lib.rs
Outdated
} | ||
}); | ||
setup_database_scheduled_tasks(db_url.clone(), context.clone())?; | ||
setup_federation_scheduled_tasks(db_url, context.clone()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either both should be made async, or both not. I remember not being able to get clokwerk async to work correctly tho.
crates/apub/src/lib.rs
Outdated
@@ -28,6 +29,8 @@ pub mod protocol; | |||
|
|||
pub const FEDERATION_HTTP_FETCH_LIMIT: u32 = 50; | |||
|
|||
pub static DEAD_INSTANCES: RwLock<Vec<String>> = RwLock::new(Vec::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be the OnceCell?
crates/db_schema/src/source/site.rs
Outdated
pub inbox_url: Option<DbUrl>, | ||
pub private_key: Option<Option<String>>, | ||
pub public_key: Option<String>, | ||
pub last_alive: Option<NaiveDateTime>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you forgot to add the migration here.
Like I said its very messy and needs cleanup now. I did that now and also did another rework of the code. Most importantly dead instances and also blocklists are now stored in single-value moka caches. Much cleaner than using scheduled tasks to update them. I also restored scheduled_tasks to the original implementation. However there is a problem because it checks nodeinfo, which isnt required for Activitypub and not present on some Fediverse instances (eg misskey.de). So it needs to check alternatively that a request to the domain root returns HTTP 200. Also these requests should really be async. |
.max_capacity(1) | ||
.time_to_live(DB_QUERY_CACHE_DURATION) | ||
.build() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit weird to use caches with capacity one and no key, but seems like the easiest way to implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to mention that when writing my other PR, I accidentally made it construct a new whole moka cache for every single incoming event (so 1000s per second) and insert a single value, and it didn't negatively affect performance at all. Just as a reference that constructing tiny moka caches is probably fine regarding performance (if maybe not code beauty).
instance::table | ||
.select(instance::domain) | ||
// TODO: should use instance::published if updated is null | ||
//.filter(instance::updated.lt(now - 3.days())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to write this query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COALESCE? coalesce(instance::updated, instance::published).lt(now - 3.days())
https://diesel.rs/guides/extending-diesel.html
use diesel::sql_types::{Nullable, Text};
sql_function! { fn coalesce(x: Nullable, y: Text) -> Text; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works, thanks! Although I dont see how to make it generic, and have to write it specifically for timestamp type.
Ready for review/merge now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather this be handled in SQL only, via a alive_instances
query, and just optimize those queries. I foresee a lot of problems coming from adding a secondary store / caching layer. People won't understand why they've blocked an instance yet posts are still coming through, for example.
let conn = &mut get_conn(pool).await?; | ||
instance::table | ||
.select(instance::domain) | ||
.filter(coalesce_time(instance::updated, instance::published).lt(now - 3.days())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
fn update_instance_software(conn: &mut PgConnection, user_agent: &str) { | ||
/// | ||
/// TODO: this should be async | ||
/// TODO: if instance has been dead for a long time, it should be checked less frequently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, in the select below, you could do let instances = instance::table.filter(coalesce(updated, published).gt(now - 1.months())
Even better, would be to add this as alive_instances
in impls/instance.rs
Another possibility, would be to recheck the alive_instances every day, but only re-check all of them (even previously dead ones) every month. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or check old instances using random probability, eg in 1% of all checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway this can be improved later, no need to include it in this PR.
Should be fairly easy to also update the cache where the query updates the database. Won't fix the issue if people are running multiple lemmy_server instances though
Cache invalidation is definitely a non-trivial problem. The site just being down because it can't handle millions of queries is arguably a bigger problem though :) |
I want to mention that the fetch_local_site_data is the third most expensive function in the code base and the first-most expensive actually needed function. I'd recommend either this be merged or I can create a minimal PR that just caches fetch_local_site_data for a few seconds. The cache duration can be significantly reduced and still have huge impact. Even though this function takes only 1 ms it is called with a frequency of over 1000Hz on lemmy.world. A cache duration of 5 seconds would be perfectly fine and even 1s would be useful. |
This comment was marked as abuse.
This comment was marked as abuse.
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is quite confusing, open for suggestions how to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this? I'm new to lemmy and rust so apologies if it is not appropriate for me to post this here. It explicitly sets default_view
on HTTP 500 but I suspect that would happen anyway with the OG code on deserialization failure. Haven't tested this for correctness, but maybe it can be a template for something more readable?
let form_result = client.get(&node_info_url).send()
.map(|response| {
response.error_for_status()
.and_then(|response| response.json::<NodeInfo>())
.map_or(default_form, |node_info| {
InstanceForm::builder()
.domain(instance.domain)
.updated(Some(naive_now()))
.software(node_info.software.and_then(|s| s.name))
.version(node_info.version.clone())
.build()
})
});
if let Ok(form) = form_result {
diesel::update(instance::table.find(instance.id))
.set(form)
.execute(conn)?;
}
Edit: Or perhaps even this.
let form_result = client.get(&node_info_url).send()
.and_then(|response| response.error_for_status())
.and_then(|response| response.json::<NodeInfo>())
.map(|node_info| InstanceForm::builder()
.domain(instance.domain)
.updated(Some(naive_now()))
.software(node_info.software.and_then(|s| s.name))
.version(node_info.version.clone())
.build())
.or_else(|err| if err.is_status() { Ok(default_form) } else { Err(err) });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a try but feel like its getting even more confusing. So I will leave it as is.
Moved the blocklist caching to #3486 and decreased cache time to one minute to ensure that changes take effect quickly. This PR will take more scrutiny and testing, lets leave it for 0.18.2 |
79e3dc2
to
a898aca
Compare
2ec4aa1
to
733bdd2
Compare
1a581b6
to
17be5e5
Compare
17be5e5
to
6c5c6e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, just needs conflicts fixed.
* Check for dead federated instances (fixes LemmyNet#2221) * move to apub crate, use timestamp * make it compile * clippy * use moka to cache blocklists, dead instances, restore orig scheduled tasks * remove leftover last_alive var * error handling * wip * fix alive check for instances without nodeinfo, add coalesce * clippy * move federation blocklist cache to LemmyNet#3486 * unused deps
* Check for dead federated instances (fixes #2221) * move to apub crate, use timestamp * make it compile * clippy * use moka to cache blocklists, dead instances, restore orig scheduled tasks * remove leftover last_alive var * error handling * wip * fix alive check for instances without nodeinfo, add coalesce * clippy * move federation blocklist cache to #3486 * unused deps
A very basic check which every day tries to connect to every known instance. If the connection fails or returns something different than HTTP 200, the instance is marked as dead and no federation activities will be sent to it.
This implementation is really basic, there can be false positives if an instance is temporarily down or unreachable during the check. It also rechecks all known instances every day, even if they have been down for years. Nevertheless it should be a major improvement, we can add more sophisticated checks later.
Still need to fix two problems mentioned in comments.