Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Case and User Ownership (Script) #34588

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

zandre-eng
Copy link
Contributor

@zandre-eng zandre-eng commented May 9, 2024

Technical Summary

Link to ticket here.

Please Note: This PR is not intended to be merged. It simply exists as a way in which feedback can be given from other devs regarding a script that will be run on production.

The purpose of this script is to achieve to things for a specific domain:
(1) Change Location of Mobile Workers
All mobile workers that have a usertype user property matching rc will need to have their primary location changed to match what is saved under the user's rc_number property.

(2) Change Ownership of Cases
All cases matching a given set of case types will be queried and have their owner_id property changed. This will be changed to match the primary location of the mobile worker that created the specific case.

Though it was later made redundant though the first draft of this script makes use of SQLite for keeping track of what users/cases to update. Before executing the script, we would first do an initial run to retrieve and store all the relevant case and user IDs in SQLite. After this has been done, the script will retrieve and process chunks from the DB. Using this process will alleviate the need for managing chunking in ES, and will make starting/stopping/reversing the script more manageable (since we can simply query the DB to get the cases/users we need)

The updated approach didn't need to keep a track on ids updated for the user of tracking or resuming script in case of failures, so the use of temporary db was skipped.

)
# Set this new descendant location as user location
user.location_id = loc.location_id
user.save(fire_signals=False)
Copy link
Contributor

@Charl1996 Charl1996 May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe use bulk_save?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### Task 2 & 3
def transfer_case_ownership():
print("---MOVING CASE OWNERSHIP---")
case_ids = (
Copy link
Contributor Author

@zandre-eng zandre-eng May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curous to get thoughts from others on using ES to get the case IDs. When the script is run, we will be dealing with quite a lot of cases and I feel fetching them all at once in ES could lead to memory problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you don't want to try to hold all ids in memory at once. I don't have much experience paginating ES queries, but have done something like this in SQL on CommCareCase. You could use paginate_query, with the downside being you'd have to iterate over each partition but that isn't too bad. If helpful, here's an not yet merged example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, this is very helpful. I've implemented something similar in my approach where I'm storing case IDs in SQLite. The script retrieves the relevant cases using paginate_query_across_partitioned_databases(). Once we have the case ids in the DB, we can then easily handle chunking the cases appropriately.

Copy link
Contributor

@gherceg gherceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Just left a comment about paginating case ids. The Flake8 issue looks legit as well.

### Task 2 & 3
def transfer_case_ownership():
print("---MOVING CASE OWNERSHIP---")
case_ids = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you don't want to try to hold all ids in memory at once. I don't have much experience paginating ES queries, but have done something like this in SQL on CommCareCase. You could use paginate_query, with the downside being you'd have to iterate over each partition but that isn't too bad. If helpful, here's an not yet merged example.

Copy link
Contributor

@mkangia mkangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zandre-eng

This looks like a serious migration so I suggest we make this much more robust.
I have shared few ways to do so.

Looks like we are doing two updates in a single file? It is currently not clear what the entry point is. How about using vertical formatting or converting this to 2 management commands?

Good to check out some existing management commands for ideas.
Example:
https://github.com/dimagi/commcare-hq/blob/f0ad6f2433b0e43ab0ce29062ce10d6e1ade641a/corehq/motech/management/commands/create_repeat_records.py
https://github.com/dimagi/commcare-hq/blob/267dc437bfc0b2c158fad699948be654e6369a36/custom/covid/management/commands/update_owner_ids.py



success_count = fail_count = 0
for user in valid_users:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find valid_users. Where is this assigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this is not part of the actual script and was left in here for testing by myself. Please ignore all the code that comes hereafter, I'll be removing it.

loc = SQLLocation.objects.get(
domain=DOMAIN,
parent__location_id=user.location_id,
name=user_data['rc_number']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if this value is even present before firing up the query?
Is the parent check only for the immediate parent? I believe that is what this query would do though not certain. You could look at get_queryset_descendants if you need all descendants.

Searching by name could yield multiple results? Are we to handle that?

Assuming these are quite a number of cases, you could extract this into a function and cache this fetch by name & parent_location_id, so we fetch a name only once. If the number of locations are less, you could simply create a dict mapping even before starting the updates.

Copy link
Contributor Author

@zandre-eng zandre-eng May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that link didn't open up as you wanted it to @zandre-eng

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, the link has been fixed now.

try:
loc = SQLLocation.objects.get(
domain=DOMAIN,
parent__location_id=user.location_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if this value is present before firing the query?

Copy link
Contributor Author

@zandre-eng zandre-eng May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, that link isn't opening to a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is fixed for this comment as well.

case_type('seance_educative'),
case_type('fiche_pointage')
)
.sort('opened_on') # sort so that we can continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the run fails in between?
How do we resume?

How about dumping these ids in a file first and then reading from there instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need closed cases as well or just opened cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both closed and open cases

end_time = total_time = 0
success_count = fail_count = 0
start_time = time.time()
for case_obj in CommCareCase.objects.iter_cases(case_ids, domain=DOMAIN):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consider using chunked here along with with_progress_bar like here.
This way you can see progress and have better control on the chunk size which is 100 by default, but you can do more I guess, though even 100 is okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented chunking along with the with_progress_bar function for both users and cases (bf3b264)

case_block = CaseBlock(
create=False,
case_id=case_obj.case_id,
owner_id=user.location_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid update if its already this owner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We could probably log this separately and then simply ignore for updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the same for users, so that we don't save user updates if they are already at the correct location (bf3b264)

user_to_save = []
for idx, user in enumerate(users):
start_time = time.time()
percentage_done = round((idx / user_count) * 100, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use with_progress_bar to do this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten rid of all the timing/percentage metrics, making use of with_progress_bar (bf3b264)

continue

try:
# Get a descendant of user location which has the same rc number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to look at just the immediate descendants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that all the mobile workers need to be moved one level down, so we only need to consider the immediate descendants.

Skipped: {skip_count},
Total Time: {round(total_time / 60, 2)} minutes"
)
CommCareUser.bulk_save(user_to_save)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we holding off on saving users till the end? This could be problematic and is prone to failing, we should update them as we go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can see how this might become an issue. I'll change this so that we save the users in batches (similar to what we are doing for cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user class now processes/saves users in the same way that we do for cases bf3b264

# DOMAIN = 'alafiacomm'
DOMAIN = 'alafiacomm-prod'

success_case_log_file_path = os.path.expanduser('~/script_success.log')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer adding timestamps to filenames so in case of re-run they don't conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Given the scope of how many cases we'll be dealing with, do you think it will be fine to have everything inside a single file, or should we chunk our logging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, good point.
Single file won't be great, though this is just a text file & this would just dump data in it.

I think we should also consider breaking it down by iterating one case type at a time? This way you can run multiple of these at once!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion, and I think one that is easily achievable with how the script has been set up now. We can simply store the data for each case type in a separate DB table for processing.

I've updated the script to make CaseUpdater only handle a single case type now (559dfad).

@zandre-eng
Copy link
Contributor Author

The amount/length of functions was starting to make it difficult to follow along with the script, so I've refactored everything into classes taking into account some of the feedback thus far.

@zandre-eng
Copy link
Contributor Author

Looks like we are doing two updates in a single file? It is currently not clear what the entry point is. How about using vertical formatting or converting this to 2 management commands?

@mkangia I've split everything off into relevant classes to make keeping track of everything easier. The entry point would simply be calling the relevant class's start() function. Let me know if this refactoring helps clear things up.

Copy link
Contributor

@mkangia mkangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zandre-eng

Looking better. Consider the number of cases, we should do one case type at a time & run them in parallel?

with open(file_path, 'a') as log:
for id in ids:
log.write(f'{id} {msg_out}\n')
log.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be lots of file operations. Can we consider keeping the file open with a with block and if needed use flush to write to file though the writer might do it automatically for you. csv.writer might do it automatically for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is no longer relevant since the script has moved away from logging into a file to storing all the necessary data in a SQLite DB.

user.set_location(location)
log(f"User {user.username}:{user.user_id} location updated to {location.location_id}")
user.set_location(new_location)
user.unset_location_by_id(existing_location.location_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this was needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, based on the discussion with Adam, it was decided to have only the rc location for the mobile worker.
(Side note: It also helps issues (location ambiguity) with Case Sharing to have only 1 location that own cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, good to not do this if existing_location is same as new_location, Not sure if that is possible or not.

Copy link
Contributor

@ajeety4 ajeety4 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_location will always be different from existing_location for this script because it is child location of the existing location.



@task(queue='background_queue', ignore_result=True)
def process_updates_for_village_async(domain, village, dry_run):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

celery is great approach. Just one thing to note is that once things are in celery, you won't have any visibility/control on the updates as you lose the logging and once the tasks are queued up you don't know when they would get picked up.

Just good to consider that before you finalize the approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. That is a valid concern.
I have added a new logger (file) for the script however I am not sure if it work when tasks are run across multiple celery machines.

@zandre-eng zandre-eng marked this pull request as ready for review July 4, 2024 12:56
@zandre-eng
Copy link
Contributor Author

I will be merging in this PR so that we can have the celery task available in the production environment for when we need to execute this script.

custom/benin/tasks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ajeety4 ajeety4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and safe to me as this does not affect any current functionality being a management command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants