Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove user data from couch user model #34805

Merged
merged 11 commits into from
Jun 27, 2024
Merged

Remove user data from couch user model #34805

merged 11 commits into from
Jun 27, 2024

Conversation

esoergel
Copy link
Contributor

@esoergel esoergel commented Jun 21, 2024

Product Description

Technical Summary

This is cleanup from the migration to SQL that happened earlier in the year. The couch field is no longer being updated and isn't in use anywhere, so this PR removes it. Turns out removing that field breaks a couple things that accessed the raw user document. Fixes for those are included.

Since the model change needs to go live before the migration is run, this PR includes a management command to be run post-deploy. #34798 will include a django migration that calls that command for 3rd-party environments.

Feature Flag

Safety Assurance

There were apparently a handful of places in tests that took issue with this field going away, so there is a chance there are other unknown, untested areas that reference this.

Safety story

Automated test coverage

QA Plan

Migrations

This has a model change and a management command, but the automatic migration will come in a second PR

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

esoergel added 4 commits June 19, 2024 12:58
It is overridden anyways in dehydrate
This was previously checking that user data changed before calling
`add_changes`, and then checking again inside `add_changes`, but this
time against the couch field, which shouldn't be in-use anymore
@dimagimon dimagimon added reindex/migration Reindex or migration will be required during or before deploy Risk: High Change affects files that have been flagged as high risk. Risk: Medium Change affects files that have been flagged as medium risk. labels Jun 21, 2024
The migration will come later
@esoergel esoergel force-pushed the es/user-data-prep branch from 8da070c to 983b087 Compare June 21, 2024 13:52
The query only returns docs that _don't_ have that value set, then
fetches those from couch and attempts to get that value.  However, this
information is no longer stored in couch anyways.
@esoergel esoergel force-pushed the es/user-data-prep branch from 675b859 to d04391c Compare June 21, 2024 17:35
@@ -367,32 +366,28 @@ def get_paginated_cases_without_gps(self, domain, page, limit):

def _get_paginated_users_without_gps(domain, page, limit, query):
location_prop_name = get_geo_user_property(domain)
query = (
res = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zandre-eng What do you think of this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late response on this as I was away on PTO. The changes look good, and it's nice that we're able to handle pagination now directly in the ES query.

@esoergel esoergel changed the title Es/user data prep Remove user data from couch user model Jun 21, 2024
These have been broken since the migration to SQL, apparently
@jingcheng16 jingcheng16 self-requested a review June 24, 2024 18:04
@esoergel esoergel marked this pull request as ready for review June 24, 2024 20:47
@esoergel esoergel requested review from AddisonDunn and Jtang-1 June 24, 2024 20:47
@esoergel esoergel force-pushed the es/user-data-prep branch from 77ba5e1 to a181308 Compare June 24, 2024 21:25
Copy link
Contributor

@Jtang-1 Jtang-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed, didn't have any additional comments but not enough confidence of the context to approve

@@ -174,20 +174,16 @@ def update_user_data(self, data, uncategorized_data, profile_name, profiles_by_n
if user_data.profile_id and user_data.profile_id != old_profile_id:
self.logger.add_info(UserChangeMessage.profile_info(user_data.profile_id, profile_name))

if old_user_data != user_data.raw:
self.logger.add_changes({'user_data': user_data.raw}, skip_confirmation=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we did that before too. But is logging the user data a good idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? This code logs all changes made to a user for auditing purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is just the name of the field being user_data and in actuality it does not contain anything sensitive. But I wonder if there was a chance something that we don't want logged ends up in there. I think it would not be clear to every developer changing what goes in user_data that it will also end up in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true - though logger here actually isn't a normal python logger, it's an instance of UserChangeLogger, which saves details of changes to postgres to power an auditing report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair

@@ -126,3 +127,25 @@ def all_domains_with_migrations_in_progress():
def reset_caches(domain, slug):
any_migrations_in_progress(domain, strict=True)
get_migration_status(domain, slug, strict=True)


def once_off_migration(slug):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

@@ -61,7 +61,7 @@ class Meta(CustomResourceMeta):

class CommCareUserResource(UserResource):
groups = fields.ListField(attribute='get_group_ids')
user_data = fields.DictField(attribute='user_data')
user_data = fields.DictField()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the API still be trying to update the user_data attribute on the CommCareUser model? Or is that somehow otherwise accounted for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good question. That's handled here:

def _update_user_data(user, new_user_data, user_change_logger):
try:
profile_id = new_user_data.pop(PROFILE_SLUG, ...)
changed = user.get_user_data(user.domain).update(new_user_data, profile_id=profile_id)
except UserDataError as e:
raise UpdateUserException(str(e))
if user_change_logger and changed:
user_change_logger.add_changes({
'user_data': user.get_user_data(user.domain).raw
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, obj_update should manually override the update for this field.

from corehq.apps.users.model_log import UserModelAction

if action in [UserModelAction.CREATE, UserModelAction.DELETE]:
changed_details = couch_user.to_json()
changed_details['user_data'] = couch_user.get_user_data(for_domain).raw if for_domain else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding: the logs have been missing the user data field for the last few months cause user data is already migrated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right 😬

.sort('created_on', desc=True)
.start((page - 1) * limit)
.size(limit)
.run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man was this pulling all the users every function call when you could just pass the page directly to ES? 💀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. And then fetching the users again from couch. It does the same thing in the function below this one too - I just didn't want to get carried away with cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I see tests cover this

return None
if related_doc_type == 'CommCareUser':
doc['user_data'] = CommCareUser.wrap(doc).get_user_data(domain).to_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess user data just hasn't been part of UCRs since the migration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these expressions not apply to WebUsers too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess user data just hasn't been part of UCRs since the migration?

That's right - this has been broken since the migration.

You can't do a related doc lookup on a web user, since they don't have a top level domain attribute. For the same reason, they also don't have a user_data attribute (since it varies from one domain to the next), though I guess we could inject one in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, this means of directly accessing the user_data from the raw JSON has been broken to some degree for years - user data is only supposed to be accessed via helper methods, as fields controlled by the profile aren't stored directly in the user document. That means that the profile wasn't accounted for in the API or in UCR. I just broke it further 😆

corehq/dbaccessors/couchapps/all_docs.py Show resolved Hide resolved
+ get_doc_count_by_type(db, 'CommCareUser'))
all_ids = chain(iter_all_doc_ids(db, 'WebUser'),
iter_all_doc_ids(db, 'CommCareUser'))
iter_update(db, _update_user, with_progress_bar(all_ids, count), verbose=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding, is this removing the couch data by just unwrapping and re-wrapping the doc without the user_data field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter_update is akin to map - it fetches documents by ID in chunks, applies _update_user to them, and saves the return value (also in chunks) if anything changed. So here _update_user defines the actual change to each document, which is user_doc.pop('user_data', ...) (.pop differs from .get in that it mutates the object).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I wasn't registering that pop is removing the value (I know that it does, just sort of slips my mind sometimes). That makes sense.



class Command(BaseCommand):
help = "Populate SQL user data from couch"
Copy link
Contributor

@AddisonDunn AddisonDunn Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this migration just removes couch data not doesn't populate SQL? Or is there some magic in saving the doc that populates SQL I haven't discovered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh whoops you're right - I forgot to update the help text. Let me do that now

@esoergel
Copy link
Contributor Author

Just ran the management command on staging, where it succeeded without issues:

Elapsed time: 0:28:04
couldn't find 0 docs
ignored 86 docs
deleted 0 docs
updated 76144 docs

@esoergel esoergel merged commit 9d20392 into master Jun 27, 2024
13 checks passed
@esoergel esoergel deleted the es/user-data-prep branch June 27, 2024 21:13
@esoergel
Copy link
Contributor Author

Ran the management command on this PR on prod:

Started at 2024-07-16 19:26:20
Processing [..................................................] 1395446/1395444 100% 1:02:32.965644 elapsed
Finished at 2024-07-16 20:28:53
Elapsed time: 1:02:33

india:

Started at 2024-07-16 19:27:33
Processing [..................................................] 159858/159858 100% 0:18:35.724473 elapsed
Finished at 2024-07-16 19:46:09
Elapsed time: 0:18:36

and swiss:

Started at 2024-07-16 19:28:49
Processing [..................................................] 935/935 100% 0:00:02.457244 elapsed
Finished at 2024-07-16 19:28:51
Elapsed time: 0:00:02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reindex/migration Reindex or migration will be required during or before deploy Risk: High Change affects files that have been flagged as high risk. Risk: Medium Change affects files that have been flagged as medium risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants