Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refact: optimize stelae database by hashing composite keys; small ergonomic improvements #44

Merged
merged 7 commits into from
Jun 19, 2024

Conversation

n-dusan
Copy link
Contributor

@n-dusan n-dusan commented Jun 11, 2024

Closes #39

A couple of notable changes:

Database improvements.

Ergonomics and QoL improvements:

n-dusan added 6 commits June 7, 2024 18:11
Two notable changes:

Remove `inserts.rs` and `queries.rs` modules and move existing methods
to it's own `models/*/manager.rs` module.
This change makes it really clear which model the method belongs to, and
hopefully increase codebase legibility. It should now be quite obvious
where in our codebase we're calling out to database operations.

Introduce `TxManager` trait, which holds all the methods that work with
a database transaction. Currently these methods are functions that
insert change objects in the database. Ideally we would want the
`Manager` and `TxManager` to be a single trait, but we would need to
know which Pool or Transaction object we use at compile time. This
seemed too tricky to figure out quickly, so instead opt to call out to
`TxManager` when managing transactions, and `Manager` when managing
regular connection calls (outside of a transaction).

Fix insert for multiple stele. Previously everything was getting applied
to a connection, instead of a transaction. This was because we used
`DatabaseConnection` struct everywhere. What we instead wanted was the
`DatabaseTransaction` struct, which runs the sqlx queries/inserts within
a transaction.

Add error logging to git2::walk method and fix the `find_blob` method to
not fail during the git repo walk.
- This commit aims at optimizing the `db.sqlite3` size.
- The sqlite db size is now down from 47 MB to 30 MB for one stele by hashing composite keys.
- Redundant string columns are removed.
- Add `utils/md5.rs` module for utility hashing. The composite columns
  are now hashed to make the db size more compact.
- Update versions endpoint following the refactor
- Update database model, structs and SQL statements following the
  optimization.
The purpose of `stelae update` will be twofold:

1) Pull down any changes to the default stelae using TAF updater.
2) Update database.

Currently stelae only does the latter. Leave an inline comment with our
long term goals.
Progress on #36

Added `CliError` enum that maps any/all errors related to working with
stelae CLI.

Updated `run` to not return any errors, but instead to decide how to
stop the process. We expect to currently return error code 1 on any
found stelae errors, and instruct the users to inspect the local logs.

Still have issues with centralizing errors found during initializing
actix `App` instance. The reason this error handling is tricky is
because we're passing in the generic, opaque `App` type to both `app.rs`
and `routes.rs`. I tried initializing the `init_app` in `app.rs` outside
of the `HttpServer::new(..)`, but the problem was that `App` did not
implement `Clone` trait, so could not get it working quickly. This is
something we'd like to resolve in the future, so leave the process
exits uncentralized for now, and leave a comment in `app.rs`.
@n-dusan n-dusan requested a review from tombh June 11, 2024 16:49
@n-dusan n-dusan self-assigned this Jun 11, 2024
Copy link
Contributor

@tombh tombh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

The only thing I'm unclear on is why you have some single use traits for database transactions. For example the trait that defines find_all_document_versions_by_mpath_and_publication, it's only ever implemented once. I usually think of a trait indicating that something is going to be implemented more than once.


/// Convert a `Status` enum to an integer.
#[must_use]
pub const fn to_int(&self) -> i64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but Postgres supports enums https://www.postgresql.org/docs/current/datatype-enum.html I don't know how easy it is to actually map Rust enums to Postgres enums, but it's something worth bearing in mind. Because seeing say ElementAdded when debugging raw SQL output is much nicer than just seeing 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. We'll probably completely transition away from using Postgres. Then I'll remove all the Postgres statements and generic connections that we currently have in the code, so this might not be important?

tracing::error!("Error: {:?}", err);
process::exit(1);
tracing::error!("Error: {err:?}");
// NOTE: We should not need to exit code 1 here (or in any of the closures in `routes.rs`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I'm surprised the error here doesn't get propogated to the error in run() that your already catching. But I don't think panicing here is so bad, because it's just at app startup anyway, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'll say we leave it here for now

@n-dusan
Copy link
Contributor Author

n-dusan commented Jun 17, 2024

Thanks for the review. My thought for traits is that they were a convenient way of mapping out all the queries, because then we can mock/test the database queries in our testing framework. Do you think there might be a better way of doing it? I'll test my hypothesis by adding tests hopefully in the near future!

@n-dusan n-dusan requested a review from tombh June 17, 2024 17:50
@tombh
Copy link
Contributor

tombh commented Jun 18, 2024

Ah yes, traits make sense for testing. I can't say I've ever actually mocked DB calls, in any language. What are the advantages? I suppose the main one is that it speeds up tests? It seems a bit brittle to me, that the mocked responses could become stale, creating false positives. And that some extra coverage is gained by testing against a real database.

But anyway, using traits to aid tests is a legitimate use case, so all good.

@n-dusan n-dusan merged commit c4a5936 into main Jun 19, 2024
10 checks passed
@n-dusan n-dusan deleted the ndusan/optimize-history-database branch June 19, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize sqlite database
2 participants