Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: errors per module [WPB-14354] #791

Merged
merged 31 commits into from
Dec 20, 2024

Conversation

coriolinus
Copy link
Contributor

@coriolinus coriolinus commented Nov 29, 2024

What's new in this PR

Refactors core-crypto such that each logical module defines its own internal error type, and adds context whenever wrapping another module's error type.

  • core_crypto::e2e_identity
  • core_crypto::mls::client
  • core_crypto::mls::conversation
  • core_crypto::mls::credential
  • core_crypto::mls
  • core_crypto

Logical modules are assumed to be the set of peer modules at a particular level in the directory tree.

PR notes

Verbose {context, source} variants

There's quite a lot of manual coding here, and the usage pattern is ok, but it could be improved. I put together a macro library (context-err) a while ago which is intended to improve the situation, but I don't recommend it for production use yet. Right now the macro is very verbose, but still restrictive: it doesn't allow for boxing or 'static str contexts. We should think about whether we want to take this more in that direction.

General-purpose variants which should be collapsed

There are at least a few general-purpose {context, source} variants which could usefully be collapsed into more specific variants. I haven't yet done the analysis to identify all of these. In general the rule I will use is, if there are two or fewer distinct contexts in use, then they should just be variants of their own. If there are three or more, we should use the more general form.

Overall factoring

I've factored down the error types as finely as I think I reasonably can. Usage is relatively straightforward, and we end up with nice error stacks with a lot of context.

The downside of this is that when you actually get into the weeds of it, the error model in this crate is now insanely complex. Lots of simple pieces attaching to each other and converting between each other in non-simple ways.

I think this is the best we can do for now, but if we decide to do a Big Refactor after the Whiteboard Week in January, then we should definitely refactor and simplify the error model at that time as well.


PR Submission Checklist for internal contributors
  • The PR Title
    • conforms to the style of semantic commits messages¹ supported in Wire's Github Workflow²
    • contains a reference JIRA issue number like SQPIT-764
    • answers the question: If merged, this PR will: ... ³
  1. https://sparkbox.com/foundry/semantic_commit_messages
  2. https://github.com/wireapp/.github#usage
  3. E.g. feat(conversation-list): Sort conversations by most emojis in the title #SQPIT-764.

@coriolinus
Copy link
Contributor Author

Currently there are errors when running cargo nextest run -p core-crypto --no-fail-fast:

  • mls::conversation::leaf_node_validation::tests::stages::should_validate_leaf_node_when_adding (x3): also appears to fail in main

~10 other errors. The set of failing cases changes each time I run the test command, and running these cases individually always seems to produce a passing output. So I'm going to mark this down to flakiness and not worry too much about fixing them.

@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch from f966cbb to a14bad0 Compare December 9, 2024 15:19
Copy link

github-actions bot commented Dec 9, 2024

🐰 Bencher Report

Branchprgn/refactor/errors-per-module
Testbedubuntu-latest
Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
Commit add f(group size)/cs1/mem/1002📈 view plot
🚷 view threshold
19,569,000.00
Commit add f(group size)/cs1/mem/2📈 view plot
🚷 view threshold
7,052,400.00
Commit add f(group size)/cs1/mem/202📈 view plot
🚷 view threshold
9,395,500.00
Commit add f(group size)/cs1/mem/402📈 view plot
🚷 view threshold
12,120,000.00
Commit add f(group size)/cs1/mem/602📈 view plot
🚷 view threshold
14,604,000.00
Commit add f(group size)/cs1/mem/802📈 view plot
🚷 view threshold
17,073,000.00
Commit add f(number clients)/cs1/mem/1002📈 view plot
🚷 view threshold
992,590,000.00
Commit add f(number clients)/cs1/mem/2📈 view plot
🚷 view threshold
6,886,200.00
Commit add f(number clients)/cs1/mem/202📈 view plot
🚷 view threshold
84,651,000.00
Commit add f(number clients)/cs1/mem/402📈 view plot
🚷 view threshold
221,870,000.00
Commit add f(number clients)/cs1/mem/602📈 view plot
🚷 view threshold
429,750,000.00
Commit add f(number clients)/cs1/mem/802📈 view plot
🚷 view threshold
682,910,000.00
Commit pending proposals f(group size)/cs1/mem/1002📈 view plot
🚷 view threshold
117,850,000.00
Commit pending proposals f(group size)/cs1/mem/2📈 view plot
🚷 view threshold
28,654,000.00
Commit pending proposals f(group size)/cs1/mem/202📈 view plot
🚷 view threshold
47,226,000.00
Commit pending proposals f(group size)/cs1/mem/402📈 view plot
🚷 view threshold
62,326,000.00
Commit pending proposals f(group size)/cs1/mem/602📈 view plot
🚷 view threshold
79,591,000.00
Commit pending proposals f(group size)/cs1/mem/802📈 view plot
🚷 view threshold
95,394,000.00
Commit pending proposals f(pending size)/cs1/mem/1📈 view plot
🚷 view threshold
19,427,000.00
Commit pending proposals f(pending size)/cs1/mem/101📈 view plot
🚷 view threshold
116,690,000.00
Commit pending proposals f(pending size)/cs1/mem/21📈 view plot
🚷 view threshold
36,101,000.00
Commit pending proposals f(pending size)/cs1/mem/41📈 view plot
🚷 view threshold
57,381,000.00
Commit pending proposals f(pending size)/cs1/mem/61📈 view plot
🚷 view threshold
76,328,000.00
Commit pending proposals f(pending size)/cs1/mem/81📈 view plot
🚷 view threshold
95,746,000.00
Commit remove f(group size)/cs1/mem/1002📈 view plot
🚷 view threshold
27,643,000.00
Commit remove f(group size)/cs1/mem/2📈 view plot
🚷 view threshold
7,044,600.00
Commit remove f(group size)/cs1/mem/202📈 view plot
🚷 view threshold
9,172,200.00
Commit remove f(group size)/cs1/mem/402📈 view plot
🚷 view threshold
11,702,000.00
Commit remove f(group size)/cs1/mem/602📈 view plot
🚷 view threshold
17,078,000.00
Commit remove f(group size)/cs1/mem/802📈 view plot
🚷 view threshold
22,308,000.00
Commit remove f(number clients)/cs1/mem/1002📈 view plot
🚷 view threshold
30,169,000.00
Commit remove f(number clients)/cs1/mem/2📈 view plot
🚷 view threshold
137,310,000.00
Commit remove f(number clients)/cs1/mem/202📈 view plot
🚷 view threshold
115,120,000.00
Commit remove f(number clients)/cs1/mem/402📈 view plot
🚷 view threshold
93,815,000.00
Commit remove f(number clients)/cs1/mem/602📈 view plot
🚷 view threshold
71,952,000.00
Commit remove f(number clients)/cs1/mem/802📈 view plot
🚷 view threshold
51,076,000.00
Commit update f(group size)/cs1/mem/1002📈 view plot
🚷 view threshold
140,020,000.00
Commit update f(group size)/cs1/mem/2📈 view plot
🚷 view threshold
7,051,600.00
Commit update f(group size)/cs1/mem/202📈 view plot
🚷 view threshold
33,810,000.00
Commit update f(group size)/cs1/mem/402📈 view plot
🚷 view threshold
59,886,000.00
Commit update f(group size)/cs1/mem/602📈 view plot
🚷 view threshold
86,436,000.00
Commit update f(group size)/cs1/mem/802📈 view plot
🚷 view threshold
111,820,000.00
🐰 View full continuous benchmarking report in Bencher

@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch from a14bad0 to fc14d7b Compare December 9, 2024 15:32
@coriolinus
Copy link
Contributor Author

Looks like the flaky cases have vanished after this rebase; presumably the work done in the interim made the difference.

@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch 2 times, most recently from a62ed73 to 0971126 Compare December 13, 2024 09:24
@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 56.97813% with 1082 lines in your changes missing coverage. Please review.

Project coverage is 74.21%. Comparing base (97cf75b) to head (e93cee6).
Report is 34 commits behind head on main.

Files with missing lines Patch % Lines
crypto-ffi/src/generic/mod.rs 0.00% 276 Missing ⚠️
crypto/src/mls/mod.rs 40.52% 113 Missing ⚠️
crypto/src/mls/conversation/commit.rs 5.10% 93 Missing ⚠️
crypto/src/proteus.rs 65.62% 77 Missing ⚠️
crypto/src/mls/external_commit.rs 47.00% 53 Missing ⚠️
crypto/src/mls/client/mod.rs 68.69% 36 Missing ⚠️
crypto/src/mls/conversation/decrypt.rs 50.70% 35 Missing ⚠️
crypto-ffi/src/generic/context/mod.rs 0.00% 30 Missing ⚠️
crypto/src/context.rs 43.13% 29 Missing ⚠️
crypto/src/mls/client/key_package.rs 68.13% 29 Missing ⚠️
... and 34 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #791      +/-   ##
==========================================
- Coverage   75.15%   74.21%   -0.94%     
==========================================
  Files         107      116       +9     
  Lines       19599    20729    +1130     
==========================================
+ Hits        14729    15384     +655     
- Misses       4870     5345     +475     
Files with missing lines Coverage Δ
crypto-ffi/src/lib.rs 0.00% <ø> (ø)
crypto/src/e2e_identity/refresh_token.rs 100.00% <100.00%> (ø)
crypto/src/e2e_identity/stash.rs 98.14% <100.00%> (+0.32%) ⬆️
crypto/src/error/keystore.rs 100.00% <100.00%> (ø)
crypto/src/error/mod.rs 100.00% <100.00%> (ø)
crypto/src/error/wrapper.rs 100.00% <100.00%> (ø)
crypto/src/lib.rs 37.50% <ø> (ø)
crypto/src/mls/buffer_external_commit.rs 97.54% <100.00%> (+0.28%) ⬆️
crypto/src/mls/client/id.rs 59.25% <ø> (ø)
crypto/src/mls/client/identifier.rs 100.00% <100.00%> (ø)
... and 58 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97cf75b...e93cee6. Read the comment docs.

@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch from fc70892 to 7f3296d Compare December 13, 2024 14:48
@coriolinus coriolinus marked this pull request as ready for review December 13, 2024 15:42
@coriolinus coriolinus requested a review from a team as a code owner December 13, 2024 15:42
By always returning a local error type, we open the door to removing
the centralized error types entirely in favor of a more specific error stack.

Note that this commit introduces a deprecated wrapper aroud `CryptoError`
within `e2e_identity::error::Error`. This will be removed shortly, but
it is necessary as an interim step, so that we don't have to rewrite
actually everything, all at once.

Note also that this affects some files within `mls`. This is required
in order to keep everything buildable.
@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch from f8a756b to 3dedc9e Compare December 17, 2024 14:49
crypto/src/proteus.rs Outdated Show resolved Hide resolved
crypto/src/error/legacy.rs Outdated Show resolved Hide resolved
crypto-ffi/src/generic/mod.rs Show resolved Hide resolved
crypto-ffi/src/wasm/mod.rs Show resolved Hide resolved
Comment on lines +121 to +123
.mls_client()
.await
.map_err(RecursiveError::root("getting mls client"))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeated in quite a few places. Are we ever getting calling mls_client() with another intention than "getting mls client"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question also applies for mls_provider()

Copy link
Contributor Author

@coriolinus coriolinus Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I don't see a better way to structure this.

The .mls_client() method (and mls_provider() as well) returns a root error: core_crypto::Error, implemented in crypto/src/error/mod.rs. But we call that method from all over the place, and we need to create all kinds of different errors from that. In this case what the function actually returns is a conversation error: core_crypto::mls::conversation::Error, defined in crypto/src/mls/conversation/error.rs.

You can see for yourself what's going on here: RecursiveError::root returns a function which wraps the root error and adds the context line, and then the ? operator uses the From<RecursiveError> implementation (from the #[from] annotation) to actually convert the error. But the point of the context field is to distinguish the cases. Quite likely, this "getting mls client" error is not the top-level error. It's common to get a MLS client and provider very near each other, and both operations will end up being mapped with RecursiveError::root. Having this context allows us to distinguish which of those operations failed.

Maybe you're thinking that the answer would be to have the mls_client method do its own wrapping: instead of returning Result<Client, core_crypto::Error>, it might return Result<Client, E> where RecursiveError: Into<E>. That would allow us to inline the context at least, avoid explicitly specifying that we're getting the mls client each time we get the mls client. Unfortunately, there are problems with that approach:

  1. It breaks the pattern that each group of module always returns the internal error type for that group of modules. Now we're returning something based on this internal error type, but we can't say exactly what until we know the context of our caller.
  2. It forces us to wrap everything everywhere in RecursiveError. That makes it more difficult, not less, to determine what the internal error type is.
  3. It breaks the ? operator. The reason for this is that this defines the returned type in terms of Into. But the ? operator also applies an Into conversion. So desugaring everything looks like:
    let client = match self.mls_client() {
      Ok(client) => client,
      Err(e) => return e.into(),
    };
    But Rust can't infer what type e should have there, because any number of types might potentially satisfy both into conditions.

Maybe I'm missing something. But I do think that this somewhat-repetitive pattern is the best option we have for now.

(But it's worth mentioning that avoiding this kind of mess is a big driver of my desire to do a big refactoring that we can talk about next month! I don't have the domain knowledge yet to come into that week with a plan for what precisely we should do. But my hope is by the end of the week we will have designed a structure that lets avoid RecursiveError entirely.)

crypto/src/mls/conversation/renew.rs Outdated Show resolved Hide resolved
crypto/src/mls/credential/mod.rs Outdated Show resolved Hide resolved
crypto/src/mls/credential/mod.rs Outdated Show resolved Hide resolved
crypto/src/mls/error.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@SimonThormeyer SimonThormeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, impressive work :)

@istankovic
Copy link
Member

I'm wondering if we can find a better name for LeafError. Leaf nodes are part of MLS and this error name may cause the reader to think that the error has something to do with those nodes, so it would be nice if we could avoid the ambiguity.

crypto/src/proteus.rs Outdated Show resolved Hide resolved
@istankovic
Copy link
Member

I would drop the two empty commits:
378e53de095fca72cb52cc05655b626ab3a9191a
4d791f57f67b5591f5d9eb1c483813341e787f9f

to avoid confusing whoever comes across them later.

…al error type

Note that there are a number of test failures as of this commit,
as the error types change. I am intentionally permitting this,
with the intent to fix them all up once we have arrived at the
final state of the error stacks.
Also eliminates module-internal references to the `CryptoError` type.

Intentionally does not yet remove the `CryptoError` type, because it
is likely that I've missed something in different build configurations
than I am testing by default, so I want to keep it around until I have
more confidence that it is really obsolete.
Also get rid of the entirely superfluous documentation requirements.
Error variants are simple, and have explicit documentation anyway
in the form of the display implementation.
…m various modules

Certain error variants were duplicated across a variety of module-specific errors.
This could lead to problems in case we wanted a special-case handler for one
of those error variants.

This commit moves leaf error types into a common error type which can be extracted
from module-specific errors.
Three main objectives in this commit:

- introduce a small module hierarchy for errors, so that it is possible
  to better navigate through the error types
- reduce code duplication by factoring out common patterns
- remove unnecessary error dynamism by utilizing existing error kind enums

We're finally beginning to approach a situation I would consider satisfactory
with regard to error handling:

- There is a top-level `error` module containing general-purpose crate-level errors
- It also contains specialty errors with which we can construct general-purpose wrapped error stacks
- Each code module group (i.e. folder) also contains its own internal error type, which
  contains unique leaf errors and sufficient information to wrap other crate error types.

I still need to analyze and eliminate leaf types which are never constructed,
fix the tests, and get rid of the legacy `CryptoError` entirely, but finally
the end of this work is in sight.
At this point the next failing case appears to be related to logic
errors, not type errors. Which is unfortunate, as none of the logic
was supposed to change.
Though running `cargo nextest run -p core-crypto --no-fail-fast` always
produces approx 10 failing tests, going through that list is not profitable.
The failing cases fall into two buckets:

- `mls::conversation::leaf_node_validation::tests::stages::should_validate_leaf_node_when_adding`,
  which also appears to fail on `main`
- the rest of the errors. This set varies each time I run a test job,
  and each case I've tried passes when run in isolation. So I have to
  assume that there is some kind of concurrency problem in the test suite,
  which is outside the scope of this PR.

As such I am deciding to move on to broader testing instead of continuing
to debug these test cases.
The error-handling status quo in `core_crypto` is wildly different now,
so we needed to change quite a lot about how `core_crypto_ffi` used
those errors.

This commit causes `core-crypto-ffi` to build properly.

Weirdly, this was _much_ simpler in wasm than doing so for non-wasm. I'll take
it though!
These are _very sensitive_ to Cargo conditions though, on my machine at least.
So now that they've passed once locally I am going to very carefully
set them on the shelf and not breathe on them.
Basic error reporting and debug representations of things hide `Box`es.
It's also not trivial to downcast into them, particularly in the context
of a macro like this!

Non-trivial does not mean impossible, though. Adding this capability
to the macro means that we can avoid an ugly workaround in the test suite.
@coriolinus coriolinus force-pushed the prgn/refactor/errors-per-module branch from 80ae45a to e93cee6 Compare December 20, 2024 09:45
Copy link
Member

@istankovic istankovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a heroic effort. Well done! 🍷

@coriolinus coriolinus merged commit adfa067 into main Dec 20, 2024
33 checks passed
@coriolinus coriolinus deleted the prgn/refactor/errors-per-module branch December 20, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants