refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

n-dusan · 2024-06-11T16:49:02Z

Closes #39

A couple of notable changes:

Database improvements.

Moved database functions from statements/inserts.rs and statements/queries.rs to models/*/manager.rs modules. This change shows a gigantic git diff for all those functions, but is mostly just codebase reorganizing.
Add md-5 crate that hashes the composite key columns as the primary key, instead of using composite keys everywhere.
This improvement brings the database size down significantly, since we're not storing expensive duplicate column text values for tables.
Commits:

Ergonomics and QoL improvements:

Small adjustments to stelae in preparation for a 0.3.0 release.
- Renamed stelae insert-history into stelae update.
- Add an update instrument and small tracing improvements.
- Progress on centralizing errors. I unfortunately wasn't able to completely do this rewrite in app.rs and routes.rs modules, since returning/propagating errors from closures when we return an opaque App type was causing us compiling issues. I tried to explain the reasoning behind the issue in an app.rs comment. I think we can fix/resolve this at a later point?
Commits:

Two notable changes: Remove `inserts.rs` and `queries.rs` modules and move existing methods to it's own `models/*/manager.rs` module. This change makes it really clear which model the method belongs to, and hopefully increase codebase legibility. It should now be quite obvious where in our codebase we're calling out to database operations. Introduce `TxManager` trait, which holds all the methods that work with a database transaction. Currently these methods are functions that insert change objects in the database. Ideally we would want the `Manager` and `TxManager` to be a single trait, but we would need to know which Pool or Transaction object we use at compile time. This seemed too tricky to figure out quickly, so instead opt to call out to `TxManager` when managing transactions, and `Manager` when managing regular connection calls (outside of a transaction). Fix insert for multiple stele. Previously everything was getting applied to a connection, instead of a transaction. This was because we used `DatabaseConnection` struct everywhere. What we instead wanted was the `DatabaseTransaction` struct, which runs the sqlx queries/inserts within a transaction. Add error logging to git2::walk method and fix the `find_blob` method to not fail during the git repo walk.

- This commit aims at optimizing the `db.sqlite3` size. - The sqlite db size is now down from 47 MB to 30 MB for one stele by hashing composite keys. - Redundant string columns are removed. - Add `utils/md5.rs` module for utility hashing. The composite columns are now hashed to make the db size more compact. - Update versions endpoint following the refactor - Update database model, structs and SQL statements following the optimization.

The purpose of `stelae update` will be twofold: 1) Pull down any changes to the default stelae using TAF updater. 2) Update database. Currently stelae only does the latter. Leave an inline comment with our long term goals.

…ents

Progress on #36 Added `CliError` enum that maps any/all errors related to working with stelae CLI. Updated `run` to not return any errors, but instead to decide how to stop the process. We expect to currently return error code 1 on any found stelae errors, and instruct the users to inspect the local logs. Still have issues with centralizing errors found during initializing actix `App` instance. The reason this error handling is tricky is because we're passing in the generic, opaque `App` type to both `app.rs` and `routes.rs`. I tried initializing the `init_app` in `app.rs` outside of the `HttpServer::new(..)`, but the problem was that `App` did not implement `Clone` trait, so could not get it working quickly. This is something we'd like to resolve in the future, so leave the process exits uncentralized for now, and leave a comment in `app.rs`.

tombh

Looks good.

The only thing I'm unclear on is why you have some single use traits for database transactions. For example the trait that defines find_all_document_versions_by_mpath_and_publication, it's only ever implemented once. I usually think of a trait indicating that something is going to be implemented more than once.

tombh · 2024-06-14T17:16:01Z

src/db/models/status/mod.rs

+
+    /// Convert a `Status` enum to an integer.
+    #[must_use]
+    pub const fn to_int(&self) -> i64 {


Very minor, but Postgres supports enums https://www.postgresql.org/docs/current/datatype-enum.html I don't know how easy it is to actually map Rust enums to Postgres enums, but it's something worth bearing in mind. Because seeing say ElementAdded when debugging raw SQL output is much nicer than just seeing 0.

Thanks for the suggestion. We'll probably completely transition away from using Postgres. Then I'll remove all the Postgres statements and generic connections that we currently have in the code, so this might not be important?

tombh · 2024-06-14T17:18:57Z

src/server/app.rs

-            tracing::error!("Error: {:?}", err);
-            process::exit(1);
+            tracing::error!("Error: {err:?}");
+            // NOTE: We should not need to exit code 1 here (or in any of the closures in `routes.rs`).


Interesting. I'm surprised the error here doesn't get propogated to the error in run() that your already catching. But I don't think panicing here is so bad, because it's just at app startup anyway, what do you think?

Agree, I'll say we leave it here for now

n-dusan · 2024-06-17T17:49:50Z

Thanks for the review. My thought for traits is that they were a convenient way of mapping out all the queries, because then we can mock/test the database queries in our testing framework. Do you think there might be a better way of doing it? I'll test my hypothesis by adding tests hopefully in the near future!

tombh · 2024-06-18T15:36:56Z

Ah yes, traits make sense for testing. I can't say I've ever actually mocked DB calls, in any language. What are the advantages? I suppose the main one is that it speeds up tests? It seems a bit brittle to me, that the mocked responses could become stale, creating false positives. And that some extra coverage is gained by testing against a real database.

But anyway, using traits to aid tests is a legitimate use case, so all good.

n-dusan added 6 commits June 7, 2024 18:11

refact(model): use new to initialize sqlx structs

fad31c1

refact(cli): rename cli insert-history to update

e0a6285

The purpose of `stelae update` will be twofold: 1) Pull down any changes to the default stelae using TAF updater. 2) Update database. Currently stelae only does the latter. Leave an inline comment with our long term goals.

refact: add instrument span to insert changes, small tracing improvem…

35cc374

…ents

n-dusan requested a review from tombh June 11, 2024 16:49

n-dusan self-assigned this Jun 11, 2024

chore(build): tag pre-release

461c50c

tombh reviewed Jun 14, 2024

View reviewed changes

n-dusan requested a review from tombh June 17, 2024 17:50

tombh approved these changes Jun 18, 2024

View reviewed changes

n-dusan merged commit c4a5936 into main Jun 19, 2024
10 checks passed

n-dusan deleted the ndusan/optimize-history-database branch June 19, 2024 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

n-dusan commented Jun 11, 2024 •

edited

Loading

tombh left a comment

tombh Jun 14, 2024

n-dusan Jun 17, 2024

tombh Jun 14, 2024

n-dusan Jun 17, 2024

n-dusan commented Jun 17, 2024

tombh commented Jun 18, 2024

refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

Conversation

n-dusan commented Jun 11, 2024 • edited Loading

tombh left a comment

Choose a reason for hiding this comment

tombh Jun 14, 2024

Choose a reason for hiding this comment

n-dusan Jun 17, 2024

Choose a reason for hiding this comment

tombh Jun 14, 2024

Choose a reason for hiding this comment

n-dusan Jun 17, 2024

Choose a reason for hiding this comment

n-dusan commented Jun 17, 2024

tombh commented Jun 18, 2024

n-dusan commented Jun 11, 2024 •

edited

Loading