Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize case queries in dump_domain_data #34010

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Jan 22, 2024

Product Description

Technical Summary

This modifies the SQL dump code to support filtering models by case ID where appropriate.

There are a few models that foreign key to CommCareCase, and do not have a domain attribute, so the only way to determine if they are part of a specific domain is to inspect the case's domain attribute. Taking CaseTransaction as an example, it requires iterating through the entire table to check whether a transasction's case is part of the specified domain or not. This process is slow, and becomes even slower once we run out of burst credits on our AWS instances and are limited to a slower IOPS for disk reads.

If we instead fetch all of the case ids associated with a domain, and use that list to filter the target table (CaseTransaction), we don't need to iterate through the entire table, and can take advantage of case's being indexed by domain.

Concerns

Ensure this includes all of the objects that would have been included if filtering via the foreign key relationship. I'm pretty this does as long as we are fetching all case ids, but something to keep in mind for reviewers as well.

Feature Flag

Safety Assurance

Safety story

I ran dump_domain_data locally to ensure it still works. This is a strictly developer facing area of code, so it isn't too risky to make any changes here, though it does have implications on our dumped data generated from dump_domain_data.

Automated test coverage

I wrote a small test to ensure CaseIDFilter behaves as expected, but it does feel like we could have more test coverage generally here.

QA Plan

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

This enables querying a specific parition if necessary/useful.
For databases with lots of cases, it can be time consuming to fetch all
models for a domain that rely on their foreign key relationship with a
case. For example, CaseTransaction.objects.filter(case__domain=domain).
This requires iterating through all transactions to inspect the case in
the relationship.

Instead, we can fetch all case IDs for a domain with an inexpensive
query since cases are indexed by domain, and use those IDs to fetch the
related models, without ever needing to iterate over the entire table.
@gherceg gherceg marked this pull request as ready for review January 22, 2024 23:52
Copy link
Member

@dannyroberts dannyroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify (and from the code I think this is true)—you aren't pulling all cases for the domain from a given shard into memory at once anywhere here, right? You're paginating through all case_ids form the domain in each shard, and then for each "chunk" of case_ids, making a query to get the related model (CaseTransaction, etc)?

If so, very nicely done!

Copy link
Contributor

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for urgency, but I'd like to see the count suggestion implemented in a follow-up PR if not this one.

@@ -21,7 +24,7 @@ class SimpleFilter(DomainFilter):
def __init__(self, filter_kwarg):
self.filter_kwarg = filter_kwarg

def get_filters(self, domain_name):
def get_filters(self, domain_name, db_alias):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does db_alias not have a default here while it does for some of the other filters?

Comment on lines 102 to 103
active_case_count = len(CommCareCase.objects.get_case_ids_in_domain(domain_name))
deleted_case_count = len(CommCareCase.objects.get_deleted_case_ids_in_domain(domain_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads all of the domain's case ids into memory, which I think we want to avoid. It would be better to use a .count() query per shard. This could be implemented as CommCareCaseManager.count_cases_in_domain(domain_name, include_deleted=True).

The new manager method could simply raise NotImplementedError if include_deleted is false since there is no use case for that branch at this time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just for progress, there is also the corehq.sql_db.util.estimate_row_count function which uses the query plan to get an estimated count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was lazy on my part. Looking closer at where this used, the Builder object references it here, but I don't see where the builder object's count method is called, and if I set a breakpoint in that method and run dump_domain_data it doesn't get hit. So I'm tempted to just remove this count method altogether, but can dig a bit more to see if we made an intentional change to the StatsCounter at some point instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the count method was introduced after the StatsCounter object, and it looks like in the context of ICDS #28895. Is it possible this count code isn't currently applicable to dump_domain_data because there was only custom ICDS code that took advantage of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh it is used in print_domain_stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the count should return the count of objects the CaseIDFilter is setup for, not the count of cases. This made it a bit trickier, but I updated the filter in 6696b9b to handle this and added tests to verify that behavior.

@snopoke
Copy link
Contributor

snopoke commented Jan 23, 2024

it requires iterating through the entire table to check whether a transasction's case is part of the specified domain or not

Can you explain why it requires iterating through the entire table. Is that how the join query get's executed?

I'm a bit concerned that loading ALL the case IDs for a large domain is going to be problematic as well.

Comment on lines 35 to 38
FilteredModelIteratorBuilder('form_processor.CaseAttachment', CaseIDFilter()),
FilteredModelIteratorBuilder('form_processor.CaseTransaction', CaseIDFilter()),
FilteredModelIteratorBuilder('form_processor.LedgerValue', SimpleFilter('domain')),
FilteredModelIteratorBuilder('form_processor.LedgerTransaction', SimpleFilter('case__domain')),
FilteredModelIteratorBuilder('form_processor.LedgerTransaction', CaseIDFilter()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there caching in CaseIDFilter or will it re-fetch all the case IDs every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No caching at the moment. On my first attempt I did cache the list of all case ids, but since I wanted to use pagination and a generator instead, I put that on the backburner (seemed a bit trickier to cache that result). It certainly would be useful to cache though since this filter is used in multiple places.

@gherceg
Copy link
Contributor Author

gherceg commented Jan 24, 2024

Can you explain why it requires iterating through the entire table. Is that how the join query get's executed?

Ah I didn't have a complete understanding of the existing query. I looked closer at the existing query with Cal's help today, and came to the understanding that the change here only applies to smaller domains. This is because the query planner assumes a domain will have ~17,000 cases in a shard, and chooses a plan that is not optimized for domains with significantly fewer cases. For the query CaseTransaction.objects.using('p1').filter(case__domain=domain), Postgres decides it is more efficient to do a sequential scan of the CaseTransaction table than it is to gather the case IDs based on the domain index of CommCareCase, and do individual lookups on the CaseTransaction for each case ID.

The numbers are arbitrary, but this change effectively broke up 1 query that took 500 seconds into 500 queries that took 1 second each. And realistically, performance for this change is worse on larger domains, meaning postgres was right in doing a sequential scan on the CaseTransaction table.

Additionally, even when testing this on my personal domain which has 7 cases total, the data dump lingered on CaseTransaction much longer than I expected, which was not what I observed when testing this in a django shell (factoring in the potential for postgres caching to speed things up).

@gherceg gherceg added the Open for review: do not merge A work in progress label Jan 25, 2024
@gherceg gherceg requested a review from esoergel as a code owner January 26, 2024 03:32
corehq/apps/dump_reload/sql/filters.py Outdated Show resolved Hide resolved
corehq/util/queries.py Outdated Show resolved Hide resolved

with self.assertNumQueries(4):
with self.assertNumQueries(3):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esoergel given this behavior was defined in tests as well, I wanted to clarify if it was intentional. Basically, is there a concern that breaking the pagination loop when the # of docs returned in for a page is less than the limit set could lead to prematurely exiting pagination? This maps to the addition of:

        if doc_count < limit:
            break

in queryset_to_iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also I assumed you added this based on commit history, but I admittedly did not look at it very hard so if you don't have context, totally fine)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I added this - that does seem reasonable, not sure why I didn't do that in the first place. Reviewing this now, my initial thought was that the dataset can change while the query is executing, which could cause weirdness, but the limit is part of that last query, so the number of results returned should be a valid data point in determining whether there are any remaining.

- Update comments to reflect order of keys/fields matters in paginate_by
- Clean up queryset_to_iterator code that checks if doc count is less
  than limit
@gherceg gherceg force-pushed the gh/data-dump/optimize-case-queries branch from 29d55f7 to 6696b9b Compare January 26, 2024 19:48
@gherceg
Copy link
Contributor Author

gherceg commented Jan 26, 2024

I amended the most recent commit to fix a minor test issue. Apologies for the force push.

This has evolved a bit from my first understanding when making this PR. The performance issues with CaseTransaction.objects.using(db).filter(case__domain=domain) stem from pagination. When not considering how many results can be returned at once, this query on its own performs similar to, if not better than paginating over case ids and fetching a group of case transactions based on case id (roughly 2 minutes on larger domains). However since we have to paginate transactions (we can only load so many into memory), we end up needing to do the same 2 minute query N times, where N is # of cases / page size. To use realistic numbers, this would be 1 million cases / 500 = 2000 queries, 2000 * 2 minutes = > 1 hour to fetch case transactions for one of five shards, so at least 5 hours to fetch all case transactions.

The new strategy allows for more efficient pagination. Since we are comparing the list of case_ids provided to the case index on the CaseTransaction table, we can scan that index in order when paginating, enabling the query planner to jump to the exact spot in the index it should start from based on (case_id__gte=case_id, pk__gt=pk). Since there can be many transactions for a single case_id this is more useful than just paginating by pk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants