Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set max limit on mobile report size during restores #34704

Merged
merged 12 commits into from
Jun 20, 2024

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented May 31, 2024

Product Description

image
Screenshot from 2024-06-19 23-06-07
image

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-15544

We currently have no upper bound on the size of mobile UCRs included in a restore. This has caused memory issues on machines attempting to serve restore requests, since we have to load the entire report into memory before writing it to a file.

This change implements a max size of 250,000 rows, which is based on existing mobile UCR sizes (see the ticket for details). We've seen confirmed performance issues with mobile reports with 1 million rows, and the next largest mobile report on prod is ~210,000 rows. After that, there is a steep drop off where every report is below 100,000 rows. I arbitrarily picked 250,000 rows as a max limit based on this information, but we are also in the process of rolling out better metrics around mobile ucr sizes in restores, and can tune this number based on that.

This PR changes the report provider code to raise an exception if attempting to build a mobile report with too many rows (over 250k), or if attempting to build mobile reports on V1 that sum up to be greater than 500k (250k * 2). This is again arbitrary, but designed to provide a buffer for V1 users. Once this error is more prevalent to the user and not just in the restore xml, it should also incentivize upgrading to V2.

Feature Flag

MOBILE_UCR

Safety Assurance

Safety story

Tested locally and on staging.

Automated test coverage

QA Plan

Will request

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@gherceg gherceg requested a review from orangejenny as a code owner May 31, 2024 18:05
@gherceg gherceg force-pushed the gh/mobile-reports/limit-row-count branch from 97420f6 to 84816f0 Compare May 31, 2024 22:04
@@ -505,6 +506,8 @@ def get_report_element(
row_index = 0
rows = report_data_cache.get_data(report_config.uuid, data_source)
for row_index, row in enumerate(rows):
if row_index + 1 >= settings.MAX_MOBILE_UCR_SIZE:
raise MobileUCRTooLargeException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for a restore that hits it? Does it mean the whole thing fails, or will it be caught somewhere and, for example, exclude this UCR from the restore but include everything else? Is the error communicated to the end user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wanted to take another look at this. Currently, the entire restore fails and the user is unaware of why. I think it is fine to fail the entire restore since that is essentially the same behavior they would have if they had too many mobile reports included in their restore, but it would be nice to be able to communicate the reason for the failure back to them.

That being said, we have talked to solutions and came to an agreement that SaaS will implement a minimum necessary change to protect our infrastructure, and then solutions can own making this a better user experience. I'll take another pass at trying to communicate the error to the user, and update here with what I find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok actually I misspoke. This only fails hard locally when running in debug mode. Typically, we silently fail on issues like this to avoid blocking restores entirely. The other question is who should we be notifying of a restore failure like this? The mobile worker likely isn't able to act on this error, so really we want to be notifying the app builder/project manager since it is actionable for them. I'm mainly thinking out loud, but am no longer optimistic that I will be able to add anything in this PR to make it clear to the user why their mobile reports may not be completely synced (if a report is too large).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amending my comment as of the most recent commit. In this case, we want to fail hard so I've created a new exception that can be used to bypass the fail_hard flag in restores. This means that whenever this exception is raised, the restore will fail. Currently it still won't be obvious to the user why the restore is failing, but this is intended to protect our infrastructure first and foremost.

@gherceg
Copy link
Contributor Author

gherceg commented Jun 18, 2024

I looked more into making the error message less ambiguous and I think it is a straightforward change, but might require additional QA for mobile to ensure mobile is handling the error response correctly as well. To prevent slowing down this change from being deployed, I'll plan to do that separately.

@gherceg
Copy link
Contributor Author

gherceg commented Jun 19, 2024

Going to rebase this off of #34795 since that has logic I'd like to use in this PR.

@gherceg gherceg force-pushed the gh/mobile-reports/limit-row-count branch from 40e6d18 to 1d2170d Compare June 20, 2024 03:22
@gherceg gherceg force-pushed the gh/mobile-reports/limit-row-count branch from 1d2170d to a88b031 Compare June 20, 2024 03:51
More technically correct, but wouldn't actually make a difference with
the current implementation, since even if current_row_count is 0, the
condition won't evaluate to True.
Copy link
Contributor

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but I'd like to see the default changed so the type check can be removed, which would simplify the code.

@@ -489,11 +489,14 @@ def _format_last_sync_time(restore_user, sync_time=None):
return ServerTime(sync_time).user_time(timezone).done().isoformat()


def generate_rows_and_filters(report_data_cache, report_config, restore_user, row_to_element):
def generate_rows_and_filters(
report_data_cache, report_config, restore_user, row_to_element, current_row_count=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to change the current_row_count default to 0 (zero) rather than None?

@@ -553,6 +565,11 @@ def get_report_element(
f"Report {report_config.report_id} row count {len(rows)} exceeds max allowed row count "
f"{settings.MAX_MOBILE_UCR_SIZE}"
)
if current_row_count and len(rows) + current_row_count > settings.MAX_MOBILE_UCR_SIZE * 2:
Copy link
Contributor

@millerdev millerdev Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block will not execute if current_row_count == 0, which I think means a restore would be attempted even if len(rows) > settings.MAX_MOBILE_UCR_SIZE * 2.

Edit: this is addressed in 428d5c4. However, I still think it would be better to set the default of current_row_count=0. That would allow this initial condition to be removed.

@millerdev millerdev merged commit 0560bc0 into master Jun 20, 2024
13 checks passed
@millerdev millerdev deleted the gh/mobile-reports/limit-row-count branch June 20, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants