-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set max limit on mobile report size during restores #34704
Conversation
97420f6
to
84816f0
Compare
@@ -505,6 +506,8 @@ def get_report_element( | |||
row_index = 0 | |||
rows = report_data_cache.get_data(report_config.uuid, data_source) | |||
for row_index, row in enumerate(rows): | |||
if row_index + 1 >= settings.MAX_MOBILE_UCR_SIZE: | |||
raise MobileUCRTooLargeException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for a restore that hits it? Does it mean the whole thing fails, or will it be caught somewhere and, for example, exclude this UCR from the restore but include everything else? Is the error communicated to the end user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wanted to take another look at this. Currently, the entire restore fails and the user is unaware of why. I think it is fine to fail the entire restore since that is essentially the same behavior they would have if they had too many mobile reports included in their restore, but it would be nice to be able to communicate the reason for the failure back to them.
That being said, we have talked to solutions and came to an agreement that SaaS will implement a minimum necessary change to protect our infrastructure, and then solutions can own making this a better user experience. I'll take another pass at trying to communicate the error to the user, and update here with what I find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok actually I misspoke. This only fails hard locally when running in debug mode. Typically, we silently fail on issues like this to avoid blocking restores entirely. The other question is who should we be notifying of a restore failure like this? The mobile worker likely isn't able to act on this error, so really we want to be notifying the app builder/project manager since it is actionable for them. I'm mainly thinking out loud, but am no longer optimistic that I will be able to add anything in this PR to make it clear to the user why their mobile reports may not be completely synced (if a report is too large).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amending my comment as of the most recent commit. In this case, we want to fail hard so I've created a new exception that can be used to bypass the fail_hard
flag in restores. This means that whenever this exception is raised, the restore will fail. Currently it still won't be obvious to the user why the restore is failing, but this is intended to protect our infrastructure first and foremost.
I looked more into making the error message less ambiguous and I think it is a straightforward change, but might require additional QA for mobile to ensure mobile is handling the error response correctly as well. To prevent slowing down this change from being deployed, I'll plan to do that separately. |
Going to rebase this off of #34795 since that has logic I'd like to use in this PR. |
I wasn't sure if report_data_cache returned a list or generator, but upon further inspection, it is a list, so calling len(rows) is safe.
40e6d18
to
1d2170d
Compare
1d2170d
to
a88b031
Compare
More technically correct, but wouldn't actually make a difference with the current implementation, since even if current_row_count is 0, the condition won't evaluate to True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but I'd like to see the default changed so the type check can be removed, which would simplify the code.
@@ -489,11 +489,14 @@ def _format_last_sync_time(restore_user, sync_time=None): | |||
return ServerTime(sync_time).user_time(timezone).done().isoformat() | |||
|
|||
|
|||
def generate_rows_and_filters(report_data_cache, report_config, restore_user, row_to_element): | |||
def generate_rows_and_filters( | |||
report_data_cache, report_config, restore_user, row_to_element, current_row_count=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to change the current_row_count
default to 0
(zero) rather than None
?
@@ -553,6 +565,11 @@ def get_report_element( | |||
f"Report {report_config.report_id} row count {len(rows)} exceeds max allowed row count " | |||
f"{settings.MAX_MOBILE_UCR_SIZE}" | |||
) | |||
if current_row_count and len(rows) + current_row_count > settings.MAX_MOBILE_UCR_SIZE * 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
block will not execute if current_row_count == 0
, which I think means a restore would be attempted even if len(rows) > settings.MAX_MOBILE_UCR_SIZE * 2
.
Edit: this is addressed in 428d5c4. However, I still think it would be better to set the default of current_row_count=0
. That would allow this initial condition to be removed.
Product Description
Technical Summary
https://dimagi.atlassian.net/browse/SAAS-15544
We currently have no upper bound on the size of mobile UCRs included in a restore. This has caused memory issues on machines attempting to serve restore requests, since we have to load the entire report into memory before writing it to a file.
This change implements a max size of 250,000 rows, which is based on existing mobile UCR sizes (see the ticket for details). We've seen confirmed performance issues with mobile reports with 1 million rows, and the next largest mobile report on prod is ~210,000 rows. After that, there is a steep drop off where every report is below 100,000 rows. I arbitrarily picked 250,000 rows as a max limit based on this information, but we are also in the process of rolling out better metrics around mobile ucr sizes in restores, and can tune this number based on that.
This PR changes the report provider code to raise an exception if attempting to build a mobile report with too many rows (over 250k), or if attempting to build mobile reports on V1 that sum up to be greater than 500k (250k * 2). This is again arbitrary, but designed to provide a buffer for V1 users. Once this error is more prevalent to the user and not just in the restore xml, it should also incentivize upgrading to V2.
Feature Flag
MOBILE_UCR
Safety Assurance
Safety story
Tested locally and on staging.
Automated test coverage
QA Plan
Will request
Rollback instructions
Labels & Review