Faster CSV uploads #200

mbarton · 2024-07-18T09:15:11Z

Fixes #148
Fixes #147

Make processing CSV uploads an order of magnitude faster by making all the external API requests in parallel across all rows.

I've done this by making the parent view async along with all the csv_upload functions and the validate_postcode and imd_for_postcode third party API lookups. The CSV upload code then creates a Python TaskGroup and submits a task to process each row.

The tasks share an httpx AsyncClient so they can co-operatively submit and wait for the responses. For the existing sync calling code on patient create and update I've added wrappers to create a client on demand for the duration of the requests.

I got the approach from this blog post: https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/. It's my first time with async Python though so it's very possible I've done something wrong or inefficiently. In particular this approach is naive in proportion to the number of rows in the file. Submitting a CSV with thousands of rows will launch thousands of requests simulataneously which is almost certainly not the behaviour we want. That said, this speeds up our current demo with dummy_sheet.csv so much I think it's probably worth merging as is and iterating on. Async code is fiddly by nature so we should not adopt this approach across the board, only where we deem it necessary.

Time to upload dummy_sheet.csv:

Before: 11.91 seconds
After: 3.01 seconds! 🚀

Catch errors entirely within imd_for_postcode in preparation for calling the function from different places

in preparation for trying to async batch the calls when uploading a CSV

- Make all db calls in csv_upload async - Make iteration through the rows of the dataframe async in a task group - Add async wrapping to imd_for_postcode and validate_postcode but don't actually make them async yet - Wrap the login_and_otp_required decorator as it doesn't work with async views natively

…ehend that right now

It's no faster but there is async everywhere it should be (I think?!). How much faster will it be when we start using an async aware requests library? https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/ recommends httpx

mbarton · 2024-07-18T09:17:10Z

project/npda/general_functions/csv_upload.py

@@ -57,15 +72,15 @@ def csv_upload(user, csv_file=None, organisation_ods_code=None, pdu_pz_code=None
    AuditCohort = apps.get_model("npda", "AuditCohort")

    # set previous quarter to inactive
-    AuditCohort.objects.filter(
+    await AuditCohort.objects.filter(


since Django provides async versions of the ORM functionality I've simply used that where possible to avoid as much sync_to_async wrapping (since that implies a bigger code refactor pulling out functions all over the place)

mbarton · 2024-07-18T09:17:46Z

project/npda/general_functions/csv_upload.py

@@ -820,12 +836,19 @@ def save_row(row, timestamp, cohort_id):

        nhs_number = row["NHS Number"].replace(" ", "")

+        postcode = row["Postcode of usual address"]
+
+        index_of_multiple_deprivation_quintile = None


Moved from Patient.save as that hook can't be async

mbarton · 2024-07-18T09:18:28Z

project/npda/general_functions/csv_upload.py

        return {"status": 500, "errors": error}

    return {"status": 200, "errors": None}

-
+@sync_to_async


Probably trivial to make the calls within this function async but no need too

mbarton · 2024-07-18T09:19:51Z

project/npda/general_functions/index_multiple_deprivation.py

@@ -4,34 +4,50 @@

 # Standard imports
 import logging
-import requests
+import httpx


The fly.io blog post suggested httpx and it looks great and is requests API compatible. I have no strong preference what we use though so long as it supports async and sadly httpx does bring some transitive dependencies.

project/npda/views/home.py

mbarton · 2024-07-18T09:23:04Z

project/npda/views/patient.py

@@ -206,6 +206,10 @@ def form_valid(self, form: BaseForm) -> HttpResponse:
        )
        patient.site = site
        patient.is_valid = True
+
+        if patient.postcode:


As above, needed to come out of Patient.save for async support.

mbarton · 2024-07-18T09:23:20Z

project/npda/views/patient.py

+
+        if patient.postcode:
+            imd = imd_for_postcode(patient.postcode)
+            if imd:


Sneaky bug fix, don't overwrite an IMD score if the API transiently returns an error

As opposed to looking them up within the Audit Cohort. This method is compatible with the async work in #200. We're able to do this as CSV uploads always involve creating a new cohort.

eatyourpeas

this works for me - a bunch quicker

mbarton · 2024-07-24T10:55:02Z

To properly integrate this after #212 I'll need to refactor the CSV validation functions to the point it makes sense to do #31 and #47 first before rebuilding this PR on top of live again

mbarton added 9 commits July 17, 2024 14:32

Put back unknown postcode checking accidentally removed in #197

ba92b61

Catch errors entirely within imd_for_postcode in preparation for calling the function from different places

Calculate IMD outside of patient.save

8339652

in preparation for trying to async batch the calls when uploading a CSV

add a todo for a proper async login decorator as my brain can't compr…

45c5421

…ehend that right now

Finish the async plumbing

e6ec656

It's no faster but there is async everywhere it should be (I think?!). How much faster will it be when we start using an async aware requests library? https://fly.io/django-beats/running-tasks-concurrently-in-django-asynchronous-views/ recommends httpx

make validate_postcode async

3d43834

it lives!

f339132

remove debugging

6bd7e51

Put back the try catch around csv uploads

2c55265

mbarton added the performance label Jul 18, 2024

mbarton self-assigned this Jul 18, 2024

mbarton commented Jul 18, 2024

View reviewed changes

Fix login_and_otp_required to work with async views

26f45e4

mbarton force-pushed the mbarton/faster-csv-uploads branch from 275d55f to 26f45e4 Compare July 18, 2024 10:14

mbarton mentioned this pull request Jul 22, 2024

Fix duplicate patients with multiple visits in CSV upload #212

Merged

mbarton added the DO NOT MERGE label Jul 22, 2024

eatyourpeas approved these changes Jul 23, 2024

View reviewed changes

mbarton closed this Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster CSV uploads #200

Faster CSV uploads #200

mbarton commented Jul 18, 2024 •

edited

Loading

mbarton Jul 18, 2024

mbarton Jul 18, 2024

mbarton Jul 18, 2024

mbarton Jul 18, 2024

mbarton Jul 18, 2024

mbarton Jul 18, 2024

eatyourpeas left a comment

mbarton commented Jul 24, 2024

Faster CSV uploads #200

Faster CSV uploads #200

Conversation

mbarton commented Jul 18, 2024 • edited Loading

mbarton Jul 18, 2024

Choose a reason for hiding this comment

mbarton Jul 18, 2024

Choose a reason for hiding this comment

mbarton Jul 18, 2024

Choose a reason for hiding this comment

mbarton Jul 18, 2024

Choose a reason for hiding this comment

mbarton Jul 18, 2024

Choose a reason for hiding this comment

mbarton Jul 18, 2024

Choose a reason for hiding this comment

eatyourpeas left a comment

Choose a reason for hiding this comment

mbarton commented Jul 24, 2024

mbarton commented Jul 18, 2024 •

edited

Loading