Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster CSV uploads #200

Closed
wants to merge 10 commits into from
Closed

Faster CSV uploads #200

wants to merge 10 commits into from

Conversation

mbarton
Copy link
Member

@mbarton mbarton commented Jul 18, 2024

Fixes #148
Fixes #147

Make processing CSV uploads an order of magnitude faster by making all the external API requests in parallel across all rows.

I've done this by making the parent view async along with all the csv_upload functions and the validate_postcode and imd_for_postcode third party API lookups. The CSV upload code then creates a Python TaskGroup and submits a task to process each row.

The tasks share an httpx AsyncClient so they can co-operatively submit and wait for the responses. For the existing sync calling code on patient create and update I've added wrappers to create a client on demand for the duration of the requests.

I got the approach from this blog post: https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/. It's my first time with async Python though so it's very possible I've done something wrong or inefficiently. In particular this approach is naive in proportion to the number of rows in the file. Submitting a CSV with thousands of rows will launch thousands of requests simulataneously which is almost certainly not the behaviour we want. That said, this speeds up our current demo with dummy_sheet.csv so much I think it's probably worth merging as is and iterating on. Async code is fiddly by nature so we should not adopt this approach across the board, only where we deem it necessary.

Time to upload dummy_sheet.csv:

  • Before: 11.91 seconds
  • After: 3.01 seconds! 🚀

Catch errors entirely within imd_for_postcode in preparation for calling the function from different places
in preparation for trying to async batch the calls when uploading a CSV
- Make all db calls in csv_upload async
- Make iteration through the rows of the dataframe async in a task group
- Add async wrapping to imd_for_postcode and validate_postcode but don't actually make them async yet
- Wrap the login_and_otp_required decorator as it doesn't work with async views natively
It's no faster but there is async everywhere it should be (I think?!). How much faster will it be when we start using an async aware requests library?

https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/ recommends httpx
@mbarton mbarton self-assigned this Jul 18, 2024
@@ -57,15 +72,15 @@ def csv_upload(user, csv_file=None, organisation_ods_code=None, pdu_pz_code=None
AuditCohort = apps.get_model("npda", "AuditCohort")

# set previous quarter to inactive
AuditCohort.objects.filter(
await AuditCohort.objects.filter(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Django provides async versions of the ORM functionality I've simply used that where possible to avoid as much sync_to_async wrapping (since that implies a bigger code refactor pulling out functions all over the place)

@@ -820,12 +836,19 @@ def save_row(row, timestamp, cohort_id):

nhs_number = row["NHS Number"].replace(" ", "")

postcode = row["Postcode of usual address"]

index_of_multiple_deprivation_quintile = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from Patient.save as that hook can't be async

return {"status": 500, "errors": error}

return {"status": 200, "errors": None}


@sync_to_async
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably trivial to make the calls within this function async but no need too

@@ -4,34 +4,50 @@

# Standard imports
import logging
import requests
import httpx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fly.io blog post suggested httpx and it looks great and is requests API compatible. I have no strong preference what we use though so long as it supports async and sadly httpx does bring some transitive dependencies.

project/npda/views/home.py Show resolved Hide resolved
@@ -206,6 +206,10 @@ def form_valid(self, form: BaseForm) -> HttpResponse:
)
patient.site = site
patient.is_valid = True

if patient.postcode:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, needed to come out of Patient.save for async support.


if patient.postcode:
imd = imd_for_postcode(patient.postcode)
if imd:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaky bug fix, don't overwrite an IMD score if the API transiently returns an error

@mbarton mbarton force-pushed the mbarton/faster-csv-uploads branch from 275d55f to 26f45e4 Compare July 18, 2024 10:14
mbarton added a commit that referenced this pull request Jul 23, 2024
As opposed to looking them up within the Audit Cohort. This method is compatible with the async work in #200. We're able to do this as CSV uploads always involve creating a new cohort.
Copy link
Member

@eatyourpeas eatyourpeas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works for me - a bunch quicker

@mbarton
Copy link
Member Author

mbarton commented Jul 24, 2024

To properly integrate this after #212 I'll need to refactor the CSV validation functions to the point it makes sense to do #31 and #47 first before rebuilding this PR on top of live again

@mbarton mbarton closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants