-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster CSV uploads #200
Faster CSV uploads #200
Conversation
Catch errors entirely within imd_for_postcode in preparation for calling the function from different places
in preparation for trying to async batch the calls when uploading a CSV
- Make all db calls in csv_upload async - Make iteration through the rows of the dataframe async in a task group - Add async wrapping to imd_for_postcode and validate_postcode but don't actually make them async yet - Wrap the login_and_otp_required decorator as it doesn't work with async views natively
…ehend that right now
It's no faster but there is async everywhere it should be (I think?!). How much faster will it be when we start using an async aware requests library? https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/ recommends httpx
@@ -57,15 +72,15 @@ def csv_upload(user, csv_file=None, organisation_ods_code=None, pdu_pz_code=None | |||
AuditCohort = apps.get_model("npda", "AuditCohort") | |||
|
|||
# set previous quarter to inactive | |||
AuditCohort.objects.filter( | |||
await AuditCohort.objects.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Django provides async versions of the ORM functionality I've simply used that where possible to avoid as much sync_to_async wrapping (since that implies a bigger code refactor pulling out functions all over the place)
@@ -820,12 +836,19 @@ def save_row(row, timestamp, cohort_id): | |||
|
|||
nhs_number = row["NHS Number"].replace(" ", "") | |||
|
|||
postcode = row["Postcode of usual address"] | |||
|
|||
index_of_multiple_deprivation_quintile = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from Patient.save
as that hook can't be async
return {"status": 500, "errors": error} | ||
|
||
return {"status": 200, "errors": None} | ||
|
||
|
||
@sync_to_async |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably trivial to make the calls within this function async but no need too
@@ -4,34 +4,50 @@ | |||
|
|||
# Standard imports | |||
import logging | |||
import requests | |||
import httpx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fly.io blog post suggested httpx and it looks great and is requests API compatible. I have no strong preference what we use though so long as it supports async and sadly httpx does bring some transitive dependencies.
@@ -206,6 +206,10 @@ def form_valid(self, form: BaseForm) -> HttpResponse: | |||
) | |||
patient.site = site | |||
patient.is_valid = True | |||
|
|||
if patient.postcode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, needed to come out of Patient.save
for async support.
|
||
if patient.postcode: | ||
imd = imd_for_postcode(patient.postcode) | ||
if imd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaky bug fix, don't overwrite an IMD score if the API transiently returns an error
275d55f
to
26f45e4
Compare
As opposed to looking them up within the Audit Cohort. This method is compatible with the async work in #200. We're able to do this as CSV uploads always involve creating a new cohort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works for me - a bunch quicker
Fixes #148
Fixes #147
Make processing CSV uploads an order of magnitude faster by making all the external API requests in parallel across all rows.
I've done this by making the parent view async along with all the csv_upload functions and the
validate_postcode
andimd_for_postcode
third party API lookups. The CSV upload code then creates a Python TaskGroup and submits a task to process each row.The tasks share an httpx AsyncClient so they can co-operatively submit and wait for the responses. For the existing sync calling code on patient create and update I've added wrappers to create a client on demand for the duration of the requests.
I got the approach from this blog post: https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/. It's my first time with async Python though so it's very possible I've done something wrong or inefficiently. In particular this approach is naive in proportion to the number of rows in the file. Submitting a CSV with thousands of rows will launch thousands of requests simulataneously which is almost certainly not the behaviour we want. That said, this speeds up our current demo with
dummy_sheet.csv
so much I think it's probably worth merging as is and iterating on. Async code is fiddly by nature so we should not adopt this approach across the board, only where we deem it necessary.Time to upload
dummy_sheet.csv
: