Increase DB chunk size for respondent files #2360

matiasgarciaisaia · 2024-08-01T22:54:06Z

Respondent files are usually large (Interactions files can grow up to 1M rows), and the "low" limit in queries made the DB work much more than needed (we've observed 99% CPU usage in the mysqld process when generating a 1M-rows interactions file with 1000 rows per query).

Increasing this limit makes the app generate less queries to the DB, effectively driving the CPU usage down to about 30% instead.

There's probably more room for improvement (the generation of the file is still CPU-bound instead of network-bound), but that's on the app itself - we should profile the app's code to further improve the performance.

See #2350
See #2359

Respondent files are usually large (Interactions files can grow up to 1M rows), and the "low" limit in queries made the DB work much more than needed (we've observed 99% CPU usage in the mysqld process when generating a 1M-rows interactions file with 1000 rows per query). Increasing this limit makes the app generate less queries to the DB, effectively driving the CPU usage down to about 30% instead. There's probably more room for improvement (the generation of the file is still CPU-bound instead of network-bound), but that's on the app itself - we should profile the app's code to further improve the performance. See #2350 See #2359 Co-authored-by: Gustavo Giráldez <ggiraldez@manas.tech>

matiasgarciaisaia · 2024-08-01T22:55:22Z

We suspect the CSV encoder may be suboptimal.

We've also tested with 100k and 50k limits for the interactions files, but the performance was a slightly worse (startup took a bit longer, and the overall speed didn't improve further).

matiasgarciaisaia · 2024-08-01T22:55:46Z

CC: @ggiraldez in case you want to add anything else.

ggiraldez · 2024-08-02T01:32:49Z

CC: @ggiraldez in case you want to add anything else.

I know very little of Elixir or Ecto, but you may also want to explore streaming directly from the database. The MySQL driver supports it, although it requires a transaction which may be a deal breaker perhaps? Anyway, see https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2

anaPerezGhiglia · 2024-08-02T13:30:22Z

In line with @ggiraldez suggestion I realized that the incentives file is built using Repo.Stream instead of Stream.resource.
I think it's worth the effort try changing the other three files to be built this way and see how the servers behave

matiasgarciaisaia · 2024-08-12T15:18:20Z

I tried changing the queries to use Ecto's Repo.stream (instead of manually doing Stream.resource and paginating from the app) but preloads are not supported on streams 🫠

There may still be room for doing the CSV streaming straight from the database (ie, make MySQL output CSV) as @ggiraldez suggested me, but I'm not sure if that'll work or not - I'll leave this as is, we might explore that optimization if we eventually need it. Given we're about to make the file generation async (in #2350) improving the times won't be that important, either.

matiasgarciaisaia requested a review from anaPerezGhiglia August 1, 2024 22:54

matiasgarciaisaia merged commit 98ba225 into main Aug 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase DB chunk size for respondent files #2360

Increase DB chunk size for respondent files #2360

matiasgarciaisaia commented Aug 1, 2024

matiasgarciaisaia commented Aug 1, 2024

matiasgarciaisaia commented Aug 1, 2024

ggiraldez commented Aug 2, 2024

anaPerezGhiglia commented Aug 2, 2024

matiasgarciaisaia commented Aug 12, 2024

Increase DB chunk size for respondent files #2360

Increase DB chunk size for respondent files #2360

Conversation

matiasgarciaisaia commented Aug 1, 2024

matiasgarciaisaia commented Aug 1, 2024

matiasgarciaisaia commented Aug 1, 2024

ggiraldez commented Aug 2, 2024

anaPerezGhiglia commented Aug 2, 2024

matiasgarciaisaia commented Aug 12, 2024