Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(alert): Refines send recap alerts command #4672

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

ERosendo
Copy link
Contributor

This PR addresses issues described in #4646 (comment).

Key changes:

  • Refines checks to prevent early termination of the recap_document_sweep reindex process.

  • Enforces upper bound on time estimate calculations.

To ensure a clean slate for the next execution and prevent issues from previous failed attempts, we should run the following Python script in production to remove any residual keys from Redis:

from cl.lib.redis_utils import get_redis_interface

r = get_redis_interface("CACHE")

r.delete("alert_sweep:main_re_index_completed")
r.delete("alert_sweep:rd_re_index_completed")
r.delete("alert_sweep:query_date")
r.delete("alert_sweep:task_id")

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ERosendo this looks good. Just a small suggestion.

Also, as we discussed on Slack, it seems that the root cause of this issue is that Elasticsearch is taking too long to complete the re-indexing process:

INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.   

From this log, it appears that after waiting 8126 seconds, the task status reported the same progress at 22,000 documents. As a result, the estimated time was recomputed to be much longer. This could indicate that ES was overwhelmed during that period, causing the task to slow down. Additionally, it's possible that the reindexing process might take several hours; due to the large volume of documents (over 3 million) targeted for reindexing. If that's the case, we have a problem, as we can't afford to wait more than a couple of hours to send RT/daily alerts.

I'll try to gather some metrics from these task queries in Kibana to verify the estimated times.

@@ -98,7 +98,7 @@ def compute_estimated_remaining_time(
initial_wait,
)

return estimated_time_remaining
return min(estimated_time_remaining, 7200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to return the original estimated time here and use min(estimated_time_remaining, 7200) only within the sleep call inside the while loop? This way, we can still know the actual estimated time while checking the task status more frequently. Additionally, I think it would be more informative if we logged the estimated time more frequently, perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only within the sleep call inside the while loop?

Sure!

perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?

Alright! I'll apply your suggestion.

@mlissner
Copy link
Member

it seems that the root cause of this issue is that Elasticsearch is taking too long to complete the re-indexing process:

You might have missed it, Alberto, but we had some issues with the indexes while you were out, and they were a bit degraded. It's possible that could be part of the cause here. If so, hopefully that's not something we'd need to deal with all the time.

Why was it indexing 3M objects? Isn't that kind of weird? I thought it'd be more like 100k/day?

@albertisfu
Copy link
Contributor

Why was it indexing 3M objects? Isn't that kind of weird? I thought it'd be more like 100k/day?

Yes, the reason is that even if we receive around 100K dockets or RDs per day, we also need to re-index related documents to match cross-object alerts that could involve the docket and any of its RECAPDocuments.

So we re-index:

  • Dockets added/modified in the day
  • RECAPDocuments with parents added/modified in the day
  • RECAPDocuments added/modified in the day
  • Dockets that are parents of RECAPDocuments added/modified in the day

I believe the query causing the required number of documents to be re-indexed to grow significantly is: RECAPDocuments with parents added/modified on the same day.

In this case, it might be matching dockets with thousands of documents that also need to be indexed.

I wanted to replicate the queries using the dates shown in the logs to confirm those numbers, but unfortunately, the URL that Ramiro shared to access Kibana or the Elasticsearch cluster endpoint is not working due to timeout issues.

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks goods. Thanks @ERosendo

I think we can merge it and test how it performs with these tweaks. This way, we’ll be able to measure how long the command takes to complete.

@mlissner as Eduardo mentioned, we’ll also need to clean up the Redis keys described above.

Additionally, it would be a good idea to check all related reindex tasks in Elasticsearch to gather their details. This data would help us understand how many documents these tasks have targeted over the last few days. And then we should cancel any unnecessary tasks so that only one task runs when the command executes.

To get the details:

GET _tasks?detailed=true&actions=*reindex

This will show a list of tasks with details, including their IDs. If you like, you could share this list so we can decide which tasks to cancel.

From the screenshot you shared, I noticed multiple instances are running the command. The issue here is that when the cron job triggers, it starts a new command, while previous instances are still running due to the long wait times. The idea was that only one instance of the command to run at a time; otherwise, alerts could be duplicated.

With the tweaks in this PR, we’ll see if the command can complete in less than a day. But to prevent any issues it might also be worth it to include an additional Redis key to indicate if another instance of the command is already running. This way, if another instance is invoked, it can terminate early to avoid issues.

However, if the command continues taking several hours to complete, we could consider using slicing to parallelize the reindexing process so it can be accelerated.

@mlissner
Copy link
Member

With the tweaks in this PR, we’ll see if the command can complete in less than a day. But to prevent any issues it might also be worth it to include an additional Redis key to indicate if another instance of the command is already running. This way, if another instance is invoked, it can terminate early to avoid issues.

Seems like a good idea. I guess we can monitor and see if we need this.

Merging, thank you!

@mlissner mlissner merged commit 174aa8c into main Nov 14, 2024
15 checks passed
@mlissner mlissner deleted the 4646-refine-send-recap-alerts-command branch November 14, 2024 23:51
@mlissner
Copy link
Member

Ran the redis commands. We should get a deployment shortly. Do we just need to check on the cronjob tomorrow, then, I assume?

@albertisfu
Copy link
Contributor

Do we just need to check on the cronjob tomorrow, then, I assume?

Yes, we could also re-check the ES query to confirm that the reindexing task is running.

GET _tasks?detailed=true&actions=*reindex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants