-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove user data from couch user model #34805
Conversation
It is overridden anyways in dehydrate
This was previously checking that user data changed before calling `add_changes`, and then checking again inside `add_changes`, but this time against the couch field, which shouldn't be in-use anymore
The migration will come later
8da070c
to
983b087
Compare
The query only returns docs that _don't_ have that value set, then fetches those from couch and attempts to get that value. However, this information is no longer stored in couch anyways.
675b859
to
d04391c
Compare
@@ -367,32 +366,28 @@ def get_paginated_cases_without_gps(self, domain, page, limit): | |||
|
|||
def _get_paginated_users_without_gps(domain, page, limit, query): | |||
location_prop_name = get_geo_user_property(domain) | |||
query = ( | |||
res = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zandre-eng What do you think of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late response on this as I was away on PTO. The changes look good, and it's nice that we're able to handle pagination now directly in the ES query.
These have been broken since the migration to SQL, apparently
77ba5e1
to
a181308
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed, didn't have any additional comments but not enough confidence of the context to approve
@@ -174,20 +174,16 @@ def update_user_data(self, data, uncategorized_data, profile_name, profiles_by_n | |||
if user_data.profile_id and user_data.profile_id != old_profile_id: | |||
self.logger.add_info(UserChangeMessage.profile_info(user_data.profile_id, profile_name)) | |||
|
|||
if old_user_data != user_data.raw: | |||
self.logger.add_changes({'user_data': user_data.raw}, skip_confirmation=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we did that before too. But is logging the user data a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? This code logs all changes made to a user for auditing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is just the name of the field being user_data
and in actuality it does not contain anything sensitive. But I wonder if there was a chance something that we don't want logged ends up in there. I think it would not be clear to every developer changing what goes in user_data
that it will also end up in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true - though logger
here actually isn't a normal python logger, it's an instance of UserChangeLogger
, which saves details of changes to postgres to power an auditing report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair
@@ -126,3 +127,25 @@ def all_domains_with_migrations_in_progress(): | |||
def reset_caches(domain, slug): | |||
any_migrations_in_progress(domain, strict=True) | |||
get_migration_status(domain, slug, strict=True) | |||
|
|||
|
|||
def once_off_migration(slug): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this.
@@ -61,7 +61,7 @@ class Meta(CustomResourceMeta): | |||
|
|||
class CommCareUserResource(UserResource): | |||
groups = fields.ListField(attribute='get_group_ids') | |||
user_data = fields.DictField(attribute='user_data') | |||
user_data = fields.DictField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the API still be trying to update the user_data
attribute on the CommCareUser
model? Or is that somehow otherwise accounted for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good question. That's handled here:
commcare-hq/corehq/apps/api/user_updates.py
Lines 112 to 122 in d3d0b80
def _update_user_data(user, new_user_data, user_change_logger): | |
try: | |
profile_id = new_user_data.pop(PROFILE_SLUG, ...) | |
changed = user.get_user_data(user.domain).update(new_user_data, profile_id=profile_id) | |
except UserDataError as e: | |
raise UpdateUserException(str(e)) | |
if user_change_logger and changed: | |
user_change_logger.add_changes({ | |
'user_data': user.get_user_data(user.domain).raw | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, obj_update
should manually override the update for this field.
from corehq.apps.users.model_log import UserModelAction | ||
|
||
if action in [UserModelAction.CREATE, UserModelAction.DELETE]: | ||
changed_details = couch_user.to_json() | ||
changed_details['user_data'] = couch_user.get_user_data(for_domain).raw if for_domain else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding: the logs have been missing the user data field for the last few months cause user data is already migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right 😬
.sort('created_on', desc=True) | ||
.start((page - 1) * limit) | ||
.size(limit) | ||
.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man was this pulling all the users every function call when you could just pass the page directly to ES? 💀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. And then fetching the users again from couch. It does the same thing in the function below this one too - I just didn't want to get carried away with cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I see tests cover this
return None | ||
if related_doc_type == 'CommCareUser': | ||
doc['user_data'] = CommCareUser.wrap(doc).get_user_data(domain).to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess user data just hasn't been part of UCRs since the migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these expressions not apply to WebUser
s too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess user data just hasn't been part of UCRs since the migration?
That's right - this has been broken since the migration.
You can't do a related doc lookup on a web user, since they don't have a top level domain
attribute. For the same reason, they also don't have a user_data
attribute (since it varies from one domain to the next), though I guess we could inject one in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, this means of directly accessing the user_data
from the raw JSON has been broken to some degree for years - user data is only supposed to be accessed via helper methods, as fields controlled by the profile aren't stored directly in the user document. That means that the profile wasn't accounted for in the API or in UCR. I just broke it further 😆
+ get_doc_count_by_type(db, 'CommCareUser')) | ||
all_ids = chain(iter_all_doc_ids(db, 'WebUser'), | ||
iter_all_doc_ids(db, 'CommCareUser')) | ||
iter_update(db, _update_user, with_progress_bar(all_ids, count), verbose=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check my understanding, is this removing the couch data by just unwrapping and re-wrapping the doc without the user_data
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iter_update is akin to map
- it fetches documents by ID in chunks, applies _update_user
to them, and saves the return value (also in chunks) if anything changed. So here _update_user
defines the actual change to each document, which is user_doc.pop('user_data', ...)
(.pop
differs from .get
in that it mutates the object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. I wasn't registering that pop
is removing the value (I know that it does, just sort of slips my mind sometimes). That makes sense.
|
||
|
||
class Command(BaseCommand): | ||
help = "Populate SQL user data from couch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this migration just removes couch data not doesn't populate SQL? Or is there some magic in saving the doc that populates SQL I haven't discovered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh whoops you're right - I forgot to update the help text. Let me do that now
Just ran the management command on staging, where it succeeded without issues:
|
Ran the management command on this PR on prod:
india:
and swiss:
|
Product Description
Technical Summary
This is cleanup from the migration to SQL that happened earlier in the year. The couch field is no longer being updated and isn't in use anywhere, so this PR removes it. Turns out removing that field breaks a couple things that accessed the raw user document. Fixes for those are included.
Since the model change needs to go live before the migration is run, this PR includes a management command to be run post-deploy. #34798 will include a django migration that calls that command for 3rd-party environments.
Feature Flag
Safety Assurance
There were apparently a handful of places in tests that took issue with this field going away, so there is a chance there are other unknown, untested areas that reference this.
Safety story
Automated test coverage
QA Plan
Migrations
This has a model change and a management command, but the automatic migration will come in a second PR
Rollback instructions
Labels & Review