Optimize case queries in dump_domain_data #34010

gherceg · 2024-01-22T22:24:28Z

Product Description

Technical Summary

This modifies the SQL dump code to support filtering models by case ID where appropriate.

There are a few models that foreign key to CommCareCase, and do not have a domain attribute, so the only way to determine if they are part of a specific domain is to inspect the case's domain attribute. Taking CaseTransaction as an example, it requires iterating through the entire table to check whether a transasction's case is part of the specified domain or not. This process is slow, and becomes even slower once we run out of burst credits on our AWS instances and are limited to a slower IOPS for disk reads.

If we instead fetch all of the case ids associated with a domain, and use that list to filter the target table (CaseTransaction), we don't need to iterate through the entire table, and can take advantage of case's being indexed by domain.

Concerns

Ensure this includes all of the objects that would have been included if filtering via the foreign key relationship. I'm pretty this does as long as we are fetching all case ids, but something to keep in mind for reviewers as well.

Feature Flag

Safety Assurance

Safety story

I ran dump_domain_data locally to ensure it still works. This is a strictly developer facing area of code, so it isn't too risky to make any changes here, though it does have implications on our dumped data generated from dump_domain_data.

Automated test coverage

I wrote a small test to ensure CaseIDFilter behaves as expected, but it does feel like we could have more test coverage generally here.

QA Plan

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

This enables querying a specific parition if necessary/useful.

For databases with lots of cases, it can be time consuming to fetch all models for a domain that rely on their foreign key relationship with a case. For example, CaseTransaction.objects.filter(case__domain=domain). This requires iterating through all transactions to inspect the case in the relationship. Instead, we can fetch all case IDs for a domain with an inexpensive query since cases are indexed by domain, and use those IDs to fetch the related models, without ever needing to iterate over the entire table.

dannyroberts

Just to clarify (and from the code I think this is true)—you aren't pulling all cases for the domain from a given shard into memory at once anywhere here, right? You're paginating through all case_ids form the domain in each shard, and then for each "chunk" of case_ids, making a query to get the related model (CaseTransaction, etc)?

If so, very nicely done!

millerdev

Approved for urgency, but I'd like to see the count suggestion implemented in a follow-up PR if not this one.

millerdev · 2024-01-23T01:23:13Z

corehq/apps/dump_reload/sql/filters.py

@@ -21,7 +24,7 @@ class SimpleFilter(DomainFilter):
    def __init__(self, filter_kwarg):
        self.filter_kwarg = filter_kwarg

-    def get_filters(self, domain_name):
+    def get_filters(self, domain_name, db_alias):


Why does db_alias not have a default here while it does for some of the other filters?

millerdev · 2024-01-23T01:24:48Z

corehq/apps/dump_reload/sql/filters.py

+        active_case_count = len(CommCareCase.objects.get_case_ids_in_domain(domain_name))
+        deleted_case_count = len(CommCareCase.objects.get_deleted_case_ids_in_domain(domain_name))


This loads all of the domain's case ids into memory, which I think we want to avoid. It would be better to use a .count() query per shard. This could be implemented as CommCareCaseManager.count_cases_in_domain(domain_name, include_deleted=True).

The new manager method could simply raise NotImplementedError if include_deleted is false since there is no use case for that branch at this time.

If this is just for progress, there is also the corehq.sql_db.util.estimate_row_count function which uses the query plan to get an estimated count.

Yeah this was lazy on my part. Looking closer at where this used, the Builder object references it here, but I don't see where the builder object's count method is called, and if I set a breakpoint in that method and run dump_domain_data it doesn't get hit. So I'm tempted to just remove this count method altogether, but can dig a bit more to see if we made an intentional change to the StatsCounter at some point instead.

So the count method was introduced after the StatsCounter object, and it looks like in the context of ICDS #28895. Is it possible this count code isn't currently applicable to dump_domain_data because there was only custom ICDS code that took advantage of it?

ahhh it is used in print_domain_stats

So the count should return the count of objects the CaseIDFilter is setup for, not the count of cases. This made it a bit trickier, but I updated the filter in 6696b9b to handle this and added tests to verify that behavior.

snopoke · 2024-01-23T07:34:04Z

it requires iterating through the entire table to check whether a transasction's case is part of the specified domain or not

Can you explain why it requires iterating through the entire table. Is that how the join query get's executed?

I'm a bit concerned that loading ALL the case IDs for a large domain is going to be problematic as well.

snopoke · 2024-01-23T07:35:33Z

corehq/apps/dump_reload/sql/dump.py

+    FilteredModelIteratorBuilder('form_processor.CaseAttachment', CaseIDFilter()),
+    FilteredModelIteratorBuilder('form_processor.CaseTransaction', CaseIDFilter()),
    FilteredModelIteratorBuilder('form_processor.LedgerValue', SimpleFilter('domain')),
-    FilteredModelIteratorBuilder('form_processor.LedgerTransaction', SimpleFilter('case__domain')),
+    FilteredModelIteratorBuilder('form_processor.LedgerTransaction', CaseIDFilter()),


Is there caching in CaseIDFilter or will it re-fetch all the case IDs every time?

No caching at the moment. On my first attempt I did cache the list of all case ids, but since I wanted to use pagination and a generator instead, I put that on the backburner (seemed a bit trickier to cache that result). It certainly would be useful to cache though since this filter is used in multiple places.

gherceg · 2024-01-24T05:28:43Z

Can you explain why it requires iterating through the entire table. Is that how the join query get's executed?

Ah I didn't have a complete understanding of the existing query. I looked closer at the existing query with Cal's help today, and came to the understanding that the change here only applies to smaller domains. This is because the query planner assumes a domain will have ~17,000 cases in a shard, and chooses a plan that is not optimized for domains with significantly fewer cases. For the query CaseTransaction.objects.using('p1').filter(case__domain=domain), Postgres decides it is more efficient to do a sequential scan of the CaseTransaction table than it is to gather the case IDs based on the domain index of CommCareCase, and do individual lookups on the CaseTransaction for each case ID.

The numbers are arbitrary, but this change effectively broke up 1 query that took 500 seconds into 500 queries that took 1 second each. And realistically, performance for this change is worse on larger domains, meaning postgres was right in doing a sequential scan on the CaseTransaction table.

Additionally, even when testing this on my personal domain which has 7 cases total, the data dump lingered on CaseTransaction much longer than I expected, which was not what I observed when testing this in a django shell (factoring in the potential for postgres caching to speed things up).

corehq/apps/dump_reload/sql/filters.py

corehq/util/queries.py

gherceg · 2024-01-26T17:43:11Z

corehq/util/tests/test_queries.py


-        with self.assertNumQueries(4):
+        with self.assertNumQueries(3):


@esoergel given this behavior was defined in tests as well, I wanted to clarify if it was intentional. Basically, is there a concern that breaking the pagination loop when the # of docs returned in for a page is less than the limit set could lead to prematurely exiting pagination? This maps to the addition of:

if doc_count < limit: break

in queryset_to_iterator.

(also I assumed you added this based on commit history, but I admittedly did not look at it very hard so if you don't have context, totally fine)

Yeah I think I added this - that does seem reasonable, not sure why I didn't do that in the first place. Reviewing this now, my initial thought was that the dataset can change while the query is executing, which could cause weirdness, but the limit is part of that last query, so the number of results returned should be a valid data point in determining whether there are any remaining.

- Update comments to reflect order of keys/fields matters in paginate_by - Clean up queryset_to_iterator code that checks if doc count is less than limit

gherceg · 2024-01-26T19:53:05Z

I amended the most recent commit to fix a minor test issue. Apologies for the force push.

This has evolved a bit from my first understanding when making this PR. The performance issues with CaseTransaction.objects.using(db).filter(case__domain=domain) stem from pagination. When not considering how many results can be returned at once, this query on its own performs similar to, if not better than paginating over case ids and fetching a group of case transactions based on case id (roughly 2 minutes on larger domains). However since we have to paginate transactions (we can only load so many into memory), we end up needing to do the same 2 minute query N times, where N is # of cases / page size. To use realistic numbers, this would be 1 million cases / 500 = 2000 queries, 2000 * 2 minutes = > 1 hour to fetch case transactions for one of five shards, so at least 5 hours to fetch all case transactions.

The new strategy allows for more efficient pagination. Since we are comparing the list of case_ids provided to the case index on the CaseTransaction table, we can scan that index in order when paginating, enabling the query planner to jump to the exact spot in the index it should start from based on (case_id__gte=case_id, pk__gt=pk). Since there can be many transactions for a single case_id this is more useful than just paginating by pk.

gherceg added 2 commits January 22, 2024 16:58

Prefactor: pass db_alias into Filter methods

2f0e213

This enables querying a specific parition if necessary/useful.

gherceg marked this pull request as ready for review January 22, 2024 23:52

gherceg requested review from millerdev and dannyroberts January 22, 2024 23:53

dannyroberts reviewed Jan 23, 2024

View reviewed changes

millerdev approved these changes Jan 23, 2024

View reviewed changes

snopoke reviewed Jan 23, 2024

View reviewed changes

gherceg added the Open for review: do not merge A work in progress label Jan 25, 2024

Add paginate_by optional paramter on queryset_to_iterator

ed47ec6

gherceg requested a review from esoergel as a code owner January 26, 2024 03:32

millerdev reviewed Jan 26, 2024

View reviewed changes

corehq/apps/dump_reload/sql/filters.py Outdated Show resolved Hide resolved

corehq/util/queries.py Outdated Show resolved Hide resolved

gherceg commented Jan 26, 2024

View reviewed changes

gherceg added 2 commits January 26, 2024 13:16

Address feedback

dd5a46e

- Update comments to reflect order of keys/fields matters in paginate_by - Clean up queryset_to_iterator code that checks if doc count is less than limit

Fix issue with count method on CaseIDFilter

6696b9b

gherceg force-pushed the gh/data-dump/optimize-case-queries branch from 29d55f7 to 6696b9b Compare January 26, 2024 19:48

millerdev approved these changes Jan 29, 2024

View reviewed changes

Merge branch 'master' into gh/data-dump/optimize-case-queries

885b626

millerdev approved these changes May 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize case queries in dump_domain_data #34010

Optimize case queries in dump_domain_data #34010

gherceg commented Jan 22, 2024 •

edited

Loading

dannyroberts left a comment

millerdev left a comment

millerdev Jan 23, 2024

millerdev Jan 23, 2024

snopoke Jan 23, 2024

gherceg Jan 23, 2024

gherceg Jan 23, 2024

gherceg Jan 24, 2024

gherceg Jan 26, 2024

snopoke commented Jan 23, 2024

snopoke Jan 23, 2024

gherceg Jan 23, 2024

gherceg commented Jan 24, 2024

gherceg Jan 26, 2024

gherceg Jan 26, 2024

esoergel Jan 26, 2024

gherceg commented Jan 26, 2024

		active_case_count = len(CommCareCase.objects.get_case_ids_in_domain(domain_name))
		deleted_case_count = len(CommCareCase.objects.get_deleted_case_ids_in_domain(domain_name))


		with self.assertNumQueries(4):
		with self.assertNumQueries(3):

Optimize case queries in dump_domain_data #34010

Are you sure you want to change the base?

Optimize case queries in dump_domain_data #34010

Conversation

gherceg commented Jan 22, 2024 • edited Loading

Product Description

Technical Summary

Concerns

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

dannyroberts left a comment

Choose a reason for hiding this comment

millerdev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snopoke commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gherceg commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gherceg commented Jan 26, 2024

gherceg commented Jan 22, 2024 •

edited

Loading