-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"table or event-trigger not found in schema cache" error in logs #5461
Comments
Update:
DELETE FROM hdb_catalog.event_invocation_logs;
DELETE FROM hdb_catalog.event_log; Query seems to help, I haven't seen the error messages anymore.
|
This has happened to us on multiple occasions specifically when we delete event triggers. Hasura seems to go crazy with it until we clear the event logs. To diagnose I recommend the following: select trigger_name, archived, count(*) from hdb_catalog.event_log
group by trigger_name, archived; There should be no event triggers listed that have been removed or look out of place. If there are then clear them individually to avoid clearing out too much data that you might need from other event triggers. Replace <event_trigger_name>: DELETE FROM hdb_catalog.event_invocation_logs
WHERE event_id IN (
SELECT id FROM hdb_catalog.event_log
WHERE trigger_name = '<event_trigger_name>' );
DELETE FROM hdb_catalog.event_log
WHERE trigger_name = '<event_trigger_name>'; These have fixed the issue for our case. It'd be great for the Hasura team to understand what's really causing this but as you mentioned it seems related to infinite loops of retries or something. |
The error message definitely needs improvement. This may happen when you have events in queue for an event trigger that you just deleted. But, it could also be some other race condition. |
@tirumaraiselvan Thank you clarifying! Would you be looking at fixing the issue, improving the error message, or both? We would have to prevent this from happening as Hasura creates a very high number of DB connections until it maxes out. We use Cloud SQL which limits and cuts off other valid connections. Ultimately this results in intermittent functionality failures which is affects app functionality. We were looking at implementing a mitigation that detects the issue and then automatically clears out the affected event logs but we'll skip it if you're looking at a fix. |
You can always limit the number of connections Hasura creates via We will definitely investigate the cause of this error as part of improving the error message. Currently, it is not clear why it is happening. The only likely situation that I can think of is that the events were getting processed while the event trigger was dropped in which case it is rather benign. |
For me it was an event trigger that had retried 4 times and had failed all times. |
@tirumaraiselvan this error still appears after updating to 1.3.2
|
Unfortunately, the improved error msg in #5718 didn't get included in v1.3.2. We will definitely try to incorporate it in the next one. Meanwhile, pls check the comment #5461 (comment) for getting to know the offending event trigger. |
@tirumaraiselvan We've noticed a more serious issue related to these errors. It's now clear that these errors are leading to an overload of DB connections from Hasura. It can lead to failures in deployments as a new instance can struggle to connect to the DB, getting rejected due to max connections. We use kubernetes with replicated instances and rolling updates so we run into this often unfortunately. You can see evidence of this in these two charts: DB Connections match up perfectly to the amount of these errors per second. The gap in the middle is when we cleared up the errors as mentioned in previous comment (#5461) Improving the logs is definitely a step in the right direction but ideally this would be fixed? |
@petecorreia Thanks for adding more details. We are investigating this. |
Just flagging that this is happened to us to after an upgrade. The log file flooded with gigabytes of the same event-trigger error and crashed the server. After clearing logs as per #5461 (comment) and manually deleting the log file things seem ok for now. |
Hi @tirumaraiselvan - I work with @petecorreia and I have a little more detail to this issue which might help you track things down; specifically I think there is perhaps two separate but related bugs here that combine to cause this issue. Before detailing the two issues we found, a little context on our setup:
Issue 1 - Metadata not applying properlyOn investigating our DB which is currently suffering from this issue via It seems at some point this metadata application on deploy failed, and then once this had happened Hasura wasn't able to detect that the triggers had not been removed (I understand this may be difficult because there may also be user created triggers within the DB). So at this point we have the unfortunate situation that Postgres still has triggers attached to a table, but from Hasura's perspective these triggers no longer exist. Issue 2 - Event invocations for missing triggers does not update tries on the
|
Thanks for the insight, I also checked event_log table on my end and I have the same problem. The number of logs show up drastically with the number of events that failed to deliver. |
@hrgui - we have attempted to mitigate this by adding a manual migration to remove any invalid triggers along the lines of: DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_UPDATE" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_INSERT" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_DELETE" ON table_name CASCADE; where those trigger names come from using We also added this function to delete invalid triggers: CREATE OR REPLACE FUNCTION purge_invalid_events() RETURNS void AS $$
BEGIN
WITH valid_trigger_names AS (
SELECT DISTINCT name FROM hdb_catalog.event_triggers
), invalid_event_ids AS (
SELECT id FROM hdb_catalog.event_log
WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names)
), delete_event_invocations AS (
DELETE FROM hdb_catalog.event_invocation_logs
WHERE event_id IN (SELECT * FROM invalid_event_ids)
)
DELETE FROM hdb_catalog.event_log
WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names);
END;
$$ LANGUAGE plpgsql; which we call daily via a cron job. We're hoping this will mitigate the problem for the time being but hopefully there will be some cleaner way of fixing this than this workaround as we'll continue to need to check the database for any dangling invalid triggers. |
I have been able to reproduce this, finally. Thanks for your details and patience.
This was done by design so as to not fail silently for an event which hitherto was known to be caused only for transient issues. We will see how we can improve this as well. |
Basically, replace_metadata is broken when dropping event triggers. Repro:
|
Hey folks, while we roll out a fix for this in the next few days (v1.3.3). You can run this query to clear any invalid postgres triggers:
Note that you may need to clear the events which might have been created due to these invalid triggers:
Courtesy: #5461 (comment) Another way is to run this SQL, before any
|
Hey @tirumaraiselvan this seems to be back in v2.0.0-alpha.10. The error is slightly different though In case someone finds them handy, we used these SQL bits to clean the unused crons:
|
Just upgraded from 1.2.2 to 1.3.0, noticed this following error message popping up more than once:
How do we go about debugging this error / what would be the cause? Hasura still functions normally, but the log message spams almost every second or minute.
The text was updated successfully, but these errors were encountered: