-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize case queries in dump_domain_data #34010
base: master
Are you sure you want to change the base?
Conversation
This enables querying a specific parition if necessary/useful.
For databases with lots of cases, it can be time consuming to fetch all models for a domain that rely on their foreign key relationship with a case. For example, CaseTransaction.objects.filter(case__domain=domain). This requires iterating through all transactions to inspect the case in the relationship. Instead, we can fetch all case IDs for a domain with an inexpensive query since cases are indexed by domain, and use those IDs to fetch the related models, without ever needing to iterate over the entire table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify (and from the code I think this is true)—you aren't pulling all cases for the domain from a given shard into memory at once anywhere here, right? You're paginating through all case_ids form the domain in each shard, and then for each "chunk" of case_ids, making a query to get the related model (CaseTransaction
, etc)?
If so, very nicely done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for urgency, but I'd like to see the count suggestion implemented in a follow-up PR if not this one.
@@ -21,7 +24,7 @@ class SimpleFilter(DomainFilter): | |||
def __init__(self, filter_kwarg): | |||
self.filter_kwarg = filter_kwarg | |||
|
|||
def get_filters(self, domain_name): | |||
def get_filters(self, domain_name, db_alias): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does db_alias
not have a default here while it does for some of the other filters?
active_case_count = len(CommCareCase.objects.get_case_ids_in_domain(domain_name)) | ||
deleted_case_count = len(CommCareCase.objects.get_deleted_case_ids_in_domain(domain_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loads all of the domain's case ids into memory, which I think we want to avoid. It would be better to use a .count()
query per shard. This could be implemented as CommCareCaseManager.count_cases_in_domain(domain_name, include_deleted=True)
.
The new manager method could simply raise NotImplementedError
if include_deleted
is false since there is no use case for that branch at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just for progress, there is also the corehq.sql_db.util.estimate_row_count
function which uses the query plan to get an estimated count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was lazy on my part. Looking closer at where this used, the Builder object references it here, but I don't see where the builder object's count method is called, and if I set a breakpoint in that method and run dump_domain_data
it doesn't get hit. So I'm tempted to just remove this count
method altogether, but can dig a bit more to see if we made an intentional change to the StatsCounter at some point instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the count method was introduced after the StatsCounter object, and it looks like in the context of ICDS #28895. Is it possible this count
code isn't currently applicable to dump_domain_data
because there was only custom ICDS code that took advantage of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh it is used in print_domain_stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the count
should return the count of objects the CaseIDFilter is setup for, not the count of cases. This made it a bit trickier, but I updated the filter in 6696b9b to handle this and added tests to verify that behavior.
Can you explain why it requires iterating through the entire table. Is that how the join query get's executed? I'm a bit concerned that loading ALL the case IDs for a large domain is going to be problematic as well. |
corehq/apps/dump_reload/sql/dump.py
Outdated
FilteredModelIteratorBuilder('form_processor.CaseAttachment', CaseIDFilter()), | ||
FilteredModelIteratorBuilder('form_processor.CaseTransaction', CaseIDFilter()), | ||
FilteredModelIteratorBuilder('form_processor.LedgerValue', SimpleFilter('domain')), | ||
FilteredModelIteratorBuilder('form_processor.LedgerTransaction', SimpleFilter('case__domain')), | ||
FilteredModelIteratorBuilder('form_processor.LedgerTransaction', CaseIDFilter()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there caching in CaseIDFilter or will it re-fetch all the case IDs every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No caching at the moment. On my first attempt I did cache the list of all case ids, but since I wanted to use pagination and a generator instead, I put that on the backburner (seemed a bit trickier to cache that result). It certainly would be useful to cache though since this filter is used in multiple places.
Ah I didn't have a complete understanding of the existing query. I looked closer at the existing query with Cal's help today, and came to the understanding that the change here only applies to smaller domains. This is because the query planner assumes a domain will have ~17,000 cases in a shard, and chooses a plan that is not optimized for domains with significantly fewer cases. For the query The numbers are arbitrary, but this change effectively broke up 1 query that took 500 seconds into 500 queries that took 1 second each. And realistically, performance for this change is worse on larger domains, meaning postgres was right in doing a sequential scan on the CaseTransaction table. Additionally, even when testing this on my personal domain which has 7 cases total, the data dump lingered on CaseTransaction much longer than I expected, which was not what I observed when testing this in a django shell (factoring in the potential for postgres caching to speed things up). |
|
||
with self.assertNumQueries(4): | ||
with self.assertNumQueries(3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esoergel given this behavior was defined in tests as well, I wanted to clarify if it was intentional. Basically, is there a concern that breaking the pagination loop when the # of docs returned in for a page is less than the limit set could lead to prematurely exiting pagination? This maps to the addition of:
if doc_count < limit:
break
in queryset_to_iterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also I assumed you added this based on commit history, but I admittedly did not look at it very hard so if you don't have context, totally fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I added this - that does seem reasonable, not sure why I didn't do that in the first place. Reviewing this now, my initial thought was that the dataset can change while the query is executing, which could cause weirdness, but the limit is part of that last query, so the number of results returned should be a valid data point in determining whether there are any remaining.
- Update comments to reflect order of keys/fields matters in paginate_by - Clean up queryset_to_iterator code that checks if doc count is less than limit
29d55f7
to
6696b9b
Compare
I amended the most recent commit to fix a minor test issue. Apologies for the force push. This has evolved a bit from my first understanding when making this PR. The performance issues with The new strategy allows for more efficient pagination. Since we are comparing the list of case_ids provided to the case index on the CaseTransaction table, we can scan that index in order when paginating, enabling the query planner to jump to the exact spot in the index it should start from based on (case_id__gte=case_id, pk__gt=pk). Since there can be many transactions for a single case_id this is more useful than just paginating by pk. |
Product Description
Technical Summary
This modifies the SQL dump code to support filtering models by case ID where appropriate.
There are a few models that foreign key to CommCareCase, and do not have a domain attribute, so the only way to determine if they are part of a specific domain is to inspect the case's domain attribute. Taking CaseTransaction as an example, it requires iterating through the entire table to check whether a transasction's case is part of the specified domain or not. This process is slow, and becomes even slower once we run out of burst credits on our AWS instances and are limited to a slower IOPS for disk reads.
If we instead fetch all of the case ids associated with a domain, and use that list to filter the target table (CaseTransaction), we don't need to iterate through the entire table, and can take advantage of case's being indexed by domain.
Concerns
Ensure this includes all of the objects that would have been included if filtering via the foreign key relationship. I'm pretty this does as long as we are fetching all case ids, but something to keep in mind for reviewers as well.
Feature Flag
Safety Assurance
Safety story
I ran
dump_domain_data
locally to ensure it still works. This is a strictly developer facing area of code, so it isn't too risky to make any changes here, though it does have implications on our dumped data generated fromdump_domain_data
.Automated test coverage
I wrote a small test to ensure CaseIDFilter behaves as expected, but it does feel like we could have more test coverage generally here.
QA Plan
Rollback instructions
Labels & Review