-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(alert): Refines send recap alerts command #4672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ERosendo this looks good. Just a small suggestion.
Also, as we discussed on Slack, it seems that the root cause of this issue is that Elasticsearch is taking too long to complete the re-indexing process:
INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.
From this log, it appears that after waiting 8126
seconds, the task status reported the same progress at 22,000 documents. As a result, the estimated time was recomputed to be much longer. This could indicate that ES was overwhelmed during that period, causing the task to slow down. Additionally, it's possible that the reindexing process might take several hours; due to the large volume of documents (over 3 million) targeted for reindexing. If that's the case, we have a problem, as we can't afford to wait more than a couple of hours to send RT/daily alerts.
I'll try to gather some metrics from these task queries in Kibana to verify the estimated times.
@@ -98,7 +98,7 @@ def compute_estimated_remaining_time( | |||
initial_wait, | |||
) | |||
|
|||
return estimated_time_remaining | |||
return min(estimated_time_remaining, 7200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to return the original estimated time here and use min(estimated_time_remaining, 7200)
only within the sleep call inside the while loop? This way, we can still know the actual estimated time while checking the task status more frequently. Additionally, I think it would be more informative if we logged the estimated time more frequently, perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only within the sleep call inside the while loop?
Sure!
perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?
Alright! I'll apply your suggestion.
You might have missed it, Alberto, but we had some issues with the indexes while you were out, and they were a bit degraded. It's possible that could be part of the cause here. If so, hopefully that's not something we'd need to deal with all the time. Why was it indexing 3M objects? Isn't that kind of weird? I thought it'd be more like 100k/day? |
Yes, the reason is that even if we receive around 100K dockets or RDs per day, we also need to re-index related documents to match cross-object alerts that could involve the docket and any of its RECAPDocuments. So we re-index:
I believe the query causing the required number of documents to be re-indexed to grow significantly is: In this case, it might be matching dockets with thousands of documents that also need to be indexed. I wanted to replicate the queries using the dates shown in the logs to confirm those numbers, but unfortunately, the URL that Ramiro shared to access Kibana or the Elasticsearch cluster endpoint is not working due to timeout issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks goods. Thanks @ERosendo
I think we can merge it and test how it performs with these tweaks. This way, we’ll be able to measure how long the command takes to complete.
@mlissner as Eduardo mentioned, we’ll also need to clean up the Redis keys described above.
Additionally, it would be a good idea to check all related reindex tasks in Elasticsearch to gather their details. This data would help us understand how many documents these tasks have targeted over the last few days. And then we should cancel any unnecessary tasks so that only one task runs when the command executes.
To get the details:
GET _tasks?detailed=true&actions=*reindex
This will show a list of tasks with details, including their IDs. If you like, you could share this list so we can decide which tasks to cancel.
From the screenshot you shared, I noticed multiple instances are running the command. The issue here is that when the cron job triggers, it starts a new command, while previous instances are still running due to the long wait times. The idea was that only one instance of the command to run at a time; otherwise, alerts could be duplicated.
With the tweaks in this PR, we’ll see if the command can complete in less than a day. But to prevent any issues it might also be worth it to include an additional Redis key to indicate if another instance of the command is already running. This way, if another instance is invoked, it can terminate early to avoid issues.
However, if the command continues taking several hours to complete, we could consider using slicing to parallelize the reindexing process so it can be accelerated.
Seems like a good idea. I guess we can monitor and see if we need this. Merging, thank you! |
Ran the redis commands. We should get a deployment shortly. Do we just need to check on the cronjob tomorrow, then, I assume? |
Yes, we could also re-check the ES query to confirm that the reindexing task is running.
|
This PR addresses issues described in #4646 (comment).
Key changes:
Refines checks to prevent early termination of the recap_document_sweep reindex process.
Enforces upper bound on time estimate calculations.
To ensure a clean slate for the next execution and prevent issues from previous failed attempts, we should run the following Python script in production to remove any residual keys from Redis: