feat(alert): Refines send recap alerts command #4672

ERosendo · 2024-11-11T00:26:14Z

This PR addresses issues described in #4646 (comment).

Key changes:

Refines checks to prevent early termination of the recap_document_sweep reindex process.
Enforces upper bound on time estimate calculations.

To ensure a clean slate for the next execution and prevent issues from previous failed attempts, we should run the following Python script in production to remove any residual keys from Redis:

from cl.lib.redis_utils import get_redis_interface

r = get_redis_interface("CACHE")

r.delete("alert_sweep:main_re_index_completed")
r.delete("alert_sweep:rd_re_index_completed")
r.delete("alert_sweep:query_date")
r.delete("alert_sweep:task_id")

albertisfu

Thanks @ERosendo this looks good. Just a small suggestion.

Also, as we discussed on Slack, it seems that the root cause of this issue is that Elasticsearch is taking too long to complete the re-indexing process:

INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.

From this log, it appears that after waiting 8126 seconds, the task status reported the same progress at 22,000 documents. As a result, the estimated time was recomputed to be much longer. This could indicate that ES was overwhelmed during that period, causing the task to slow down. Additionally, it's possible that the reindexing process might take several hours; due to the large volume of documents (over 3 million) targeted for reindexing. If that's the case, we have a problem, as we can't afford to wait more than a couple of hours to send RT/daily alerts.

I'll try to gather some metrics from these task queries in Kibana to verify the estimated times.

albertisfu · 2024-11-12T16:26:19Z

cl/alerts/management/commands/cl_send_recap_alerts.py

@@ -98,7 +98,7 @@ def compute_estimated_remaining_time(
        initial_wait,
    )

-    return estimated_time_remaining
+    return min(estimated_time_remaining, 7200)


Would it be possible to return the original estimated time here and use min(estimated_time_remaining, 7200) only within the sleep call inside the while loop? This way, we can still know the actual estimated time while checking the task status more frequently. Additionally, I think it would be more informative if we logged the estimated time more frequently, perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?

only within the sleep call inside the while loop?

Sure!

perhaps every 15 minutes (900 seconds) instead of every 2 hours (7200)?

Alright! I'll apply your suggestion.

mlissner · 2024-11-12T19:10:53Z

it seems that the root cause of this issue is that Elasticsearch is taking too long to complete the re-indexing process:

You might have missed it, Alberto, but we had some issues with the indexes while you were out, and they were a bit degraded. It's possible that could be part of the cause here. If so, hopefully that's not something we'd need to deal with all the time.

Why was it indexing 3M objects? Isn't that kind of weird? I thought it'd be more like 100k/day?

albertisfu · 2024-11-12T20:07:18Z

Why was it indexing 3M objects? Isn't that kind of weird? I thought it'd be more like 100k/day?

Yes, the reason is that even if we receive around 100K dockets or RDs per day, we also need to re-index related documents to match cross-object alerts that could involve the docket and any of its RECAPDocuments.

So we re-index:

Dockets added/modified in the day
RECAPDocuments with parents added/modified in the day
RECAPDocuments added/modified in the day
Dockets that are parents of RECAPDocuments added/modified in the day

I believe the query causing the required number of documents to be re-indexed to grow significantly is: RECAPDocuments with parents added/modified on the same day.

In this case, it might be matching dockets with thousands of documents that also need to be indexed.

I wanted to replicate the queries using the dates shown in the logs to confirm those numbers, but unfortunately, the URL that Ramiro shared to access Kibana or the Elasticsearch cluster endpoint is not working due to timeout issues.

albertisfu

This looks goods. Thanks @ERosendo

I think we can merge it and test how it performs with these tweaks. This way, we’ll be able to measure how long the command takes to complete.

@mlissner as Eduardo mentioned, we’ll also need to clean up the Redis keys described above.

Additionally, it would be a good idea to check all related reindex tasks in Elasticsearch to gather their details. This data would help us understand how many documents these tasks have targeted over the last few days. And then we should cancel any unnecessary tasks so that only one task runs when the command executes.

To get the details:

GET _tasks?detailed=true&actions=*reindex

This will show a list of tasks with details, including their IDs. If you like, you could share this list so we can decide which tasks to cancel.

From the screenshot you shared, I noticed multiple instances are running the command. The issue here is that when the cron job triggers, it starts a new command, while previous instances are still running due to the long wait times. The idea was that only one instance of the command to run at a time; otherwise, alerts could be duplicated.

With the tweaks in this PR, we’ll see if the command can complete in less than a day. But to prevent any issues it might also be worth it to include an additional Redis key to indicate if another instance of the command is already running. This way, if another instance is invoked, it can terminate early to avoid issues.

However, if the command continues taking several hours to complete, we could consider using slicing to parallelize the reindexing process so it can be accelerated.

mlissner · 2024-11-14T23:51:25Z

With the tweaks in this PR, we’ll see if the command can complete in less than a day. But to prevent any issues it might also be worth it to include an additional Redis key to indicate if another instance of the command is already running. This way, if another instance is invoked, it can terminate early to avoid issues.

Seems like a good idea. I guess we can monitor and see if we need this.

Merging, thank you!

mlissner · 2024-11-14T23:52:35Z

Ran the redis commands. We should get a deployment shortly. Do we just need to check on the cronjob tomorrow, then, I assume?

albertisfu · 2024-11-15T00:09:24Z

Do we just need to check on the cronjob tomorrow, then, I assume?

Yes, we could also re-check the ES query to confirm that the reindexing task is running.

GET _tasks?detailed=true&actions=*reindex

ERosendo added 2 commits November 10, 2024 19:51

fix(alerts): Prevent premature termination of reindex

6496a05

fix(alerts): Enforce upper bound on time estimate calculations

c27bdf0

ERosendo requested a review from mlissner November 11, 2024 00:27

albertisfu self-requested a review November 12, 2024 14:57

albertisfu reviewed Nov 12, 2024

View reviewed changes

refactor(alerts): Lower time estimate upper bound

da0282e

ERosendo requested a review from albertisfu November 12, 2024 19:55

albertisfu approved these changes Nov 12, 2024

View reviewed changes

mlissner assigned mlissner, ERosendo and albertisfu Nov 13, 2024

mlissner merged commit 174aa8c into main Nov 14, 2024
15 checks passed

mlissner deleted the 4646-refine-send-recap-alerts-command branch November 14, 2024 23:51

mlissner mentioned this pull request Nov 15, 2024

Monitor sweep index #4694

Open

mlissner unassigned mlissner and ERosendo Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(alert): Refines send recap alerts command #4672

feat(alert): Refines send recap alerts command #4672

ERosendo commented Nov 11, 2024

albertisfu left a comment •

edited

Loading

albertisfu Nov 12, 2024

ERosendo Nov 12, 2024

mlissner commented Nov 12, 2024

albertisfu commented Nov 12, 2024

albertisfu left a comment

mlissner commented Nov 14, 2024

mlissner commented Nov 14, 2024

albertisfu commented Nov 15, 2024

feat(alert): Refines send recap alerts command #4672

feat(alert): Refines send recap alerts command #4672

Conversation

ERosendo commented Nov 11, 2024

albertisfu left a comment • edited Loading

Choose a reason for hiding this comment

albertisfu Nov 12, 2024

Choose a reason for hiding this comment

ERosendo Nov 12, 2024

Choose a reason for hiding this comment

mlissner commented Nov 12, 2024

albertisfu commented Nov 12, 2024

albertisfu left a comment

Choose a reason for hiding this comment

mlissner commented Nov 14, 2024

mlissner commented Nov 14, 2024

albertisfu commented Nov 15, 2024

albertisfu left a comment •

edited

Loading