refactor: errors per module [WPB-14354] #791

coriolinus · 2024-11-29T16:56:09Z

What's new in this PR

Refactors core-crypto such that each logical module defines its own internal error type, and adds context whenever wrapping another module's error type.

Logical modules are assumed to be the set of peer modules at a particular level in the directory tree.

PR notes

Verbose `{context, source}` variants

There's quite a lot of manual coding here, and the usage pattern is ok, but it could be improved. I put together a macro library (context-err) a while ago which is intended to improve the situation, but I don't recommend it for production use yet. Right now the macro is very verbose, but still restrictive: it doesn't allow for boxing or 'static str contexts. We should think about whether we want to take this more in that direction.

General-purpose variants which should be collapsed

There are at least a few general-purpose {context, source} variants which could usefully be collapsed into more specific variants. I haven't yet done the analysis to identify all of these. In general the rule I will use is, if there are two or fewer distinct contexts in use, then they should just be variants of their own. If there are three or more, we should use the more general form.

Overall factoring

I've factored down the error types as finely as I think I reasonably can. Usage is relatively straightforward, and we end up with nice error stacks with a lot of context.

The downside of this is that when you actually get into the weeds of it, the error model in this crate is now insanely complex. Lots of simple pieces attaching to each other and converting between each other in non-simple ways.

I think this is the best we can do for now, but if we decide to do a Big Refactor after the Whiteboard Week in January, then we should definitely refactor and simplify the error model at that time as well.

PR Submission Checklist for internal contributors

The PR Title
- conforms to the style of semantic commits messages¹ supported in Wire's Github Workflow²
- contains a reference JIRA issue number like SQPIT-764
- answers the question: If merged, this PR will: ... ³

https://sparkbox.com/foundry/semantic_commit_messages
https://github.com/wireapp/.github#usage
E.g. feat(conversation-list): Sort conversations by most emojis in the title #SQPIT-764.

coriolinus · 2024-12-09T13:47:59Z

Currently there are errors when running cargo nextest run -p core-crypto --no-fail-fast:

mls::conversation::leaf_node_validation::tests::stages::should_validate_leaf_node_when_adding (x3): also appears to fail in main

~10 other errors. The set of failing cases changes each time I run the test command, and running these cases individually always seems to produce a passing output. So I'm going to mark this down to flakiness and not worry too much about fixing them.

github-actions · 2024-12-09T15:25:22Z

Bencher Report

Branch	prgn/refactor/errors-per-module
Testbed	ubuntu-latest

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
Commit add f(group size)/cs1/mem/1002	📈 view plot 🚷 view threshold	19,569,000.00
Commit add f(group size)/cs1/mem/2	📈 view plot 🚷 view threshold	7,052,400.00
Commit add f(group size)/cs1/mem/202	📈 view plot 🚷 view threshold	9,395,500.00
Commit add f(group size)/cs1/mem/402	📈 view plot 🚷 view threshold	12,120,000.00
Commit add f(group size)/cs1/mem/602	📈 view plot 🚷 view threshold	14,604,000.00
Commit add f(group size)/cs1/mem/802	📈 view plot 🚷 view threshold	17,073,000.00
Commit add f(number clients)/cs1/mem/1002	📈 view plot 🚷 view threshold	992,590,000.00
Commit add f(number clients)/cs1/mem/2	📈 view plot 🚷 view threshold	6,886,200.00
Commit add f(number clients)/cs1/mem/202	📈 view plot 🚷 view threshold	84,651,000.00
Commit add f(number clients)/cs1/mem/402	📈 view plot 🚷 view threshold	221,870,000.00
Commit add f(number clients)/cs1/mem/602	📈 view plot 🚷 view threshold	429,750,000.00
Commit add f(number clients)/cs1/mem/802	📈 view plot 🚷 view threshold	682,910,000.00
Commit pending proposals f(group size)/cs1/mem/1002	📈 view plot 🚷 view threshold	117,850,000.00
Commit pending proposals f(group size)/cs1/mem/2	📈 view plot 🚷 view threshold	28,654,000.00
Commit pending proposals f(group size)/cs1/mem/202	📈 view plot 🚷 view threshold	47,226,000.00
Commit pending proposals f(group size)/cs1/mem/402	📈 view plot 🚷 view threshold	62,326,000.00
Commit pending proposals f(group size)/cs1/mem/602	📈 view plot 🚷 view threshold	79,591,000.00
Commit pending proposals f(group size)/cs1/mem/802	📈 view plot 🚷 view threshold	95,394,000.00
Commit pending proposals f(pending size)/cs1/mem/1	📈 view plot 🚷 view threshold	19,427,000.00
Commit pending proposals f(pending size)/cs1/mem/101	📈 view plot 🚷 view threshold	116,690,000.00
Commit pending proposals f(pending size)/cs1/mem/21	📈 view plot 🚷 view threshold	36,101,000.00
Commit pending proposals f(pending size)/cs1/mem/41	📈 view plot 🚷 view threshold	57,381,000.00
Commit pending proposals f(pending size)/cs1/mem/61	📈 view plot 🚷 view threshold	76,328,000.00
Commit pending proposals f(pending size)/cs1/mem/81	📈 view plot 🚷 view threshold	95,746,000.00
Commit remove f(group size)/cs1/mem/1002	📈 view plot 🚷 view threshold	27,643,000.00
Commit remove f(group size)/cs1/mem/2	📈 view plot 🚷 view threshold	7,044,600.00
Commit remove f(group size)/cs1/mem/202	📈 view plot 🚷 view threshold	9,172,200.00
Commit remove f(group size)/cs1/mem/402	📈 view plot 🚷 view threshold	11,702,000.00
Commit remove f(group size)/cs1/mem/602	📈 view plot 🚷 view threshold	17,078,000.00
Commit remove f(group size)/cs1/mem/802	📈 view plot 🚷 view threshold	22,308,000.00
Commit remove f(number clients)/cs1/mem/1002	📈 view plot 🚷 view threshold	30,169,000.00
Commit remove f(number clients)/cs1/mem/2	📈 view plot 🚷 view threshold	137,310,000.00
Commit remove f(number clients)/cs1/mem/202	📈 view plot 🚷 view threshold	115,120,000.00
Commit remove f(number clients)/cs1/mem/402	📈 view plot 🚷 view threshold	93,815,000.00
Commit remove f(number clients)/cs1/mem/602	📈 view plot 🚷 view threshold	71,952,000.00
Commit remove f(number clients)/cs1/mem/802	📈 view plot 🚷 view threshold	51,076,000.00
Commit update f(group size)/cs1/mem/1002	📈 view plot 🚷 view threshold	140,020,000.00
Commit update f(group size)/cs1/mem/2	📈 view plot 🚷 view threshold	7,051,600.00
Commit update f(group size)/cs1/mem/202	📈 view plot 🚷 view threshold	33,810,000.00
Commit update f(group size)/cs1/mem/402	📈 view plot 🚷 view threshold	59,886,000.00
Commit update f(group size)/cs1/mem/602	📈 view plot 🚷 view threshold	86,436,000.00
Commit update f(group size)/cs1/mem/802	📈 view plot 🚷 view threshold	111,820,000.00

🐰 View full continuous benchmarking report in Bencher

coriolinus · 2024-12-09T15:39:44Z

Looks like the flaky cases have vanished after this rebase; presumably the work done in the interim made the difference.

codecov-commenter · 2024-12-13T13:54:00Z

Codecov Report

Attention: Patch coverage is 56.97813% with 1082 lines in your changes missing coverage. Please review.

Project coverage is 74.21%. Comparing base (97cf75b) to head (e93cee6).
Report is 34 commits behind head on main.

Files with missing lines	Patch %	Lines
crypto-ffi/src/generic/mod.rs	0.00%	276 Missing ⚠️
crypto/src/mls/mod.rs	40.52%	113 Missing ⚠️
crypto/src/mls/conversation/commit.rs	5.10%	93 Missing ⚠️
crypto/src/proteus.rs	65.62%	77 Missing ⚠️
crypto/src/mls/external_commit.rs	47.00%	53 Missing ⚠️
crypto/src/mls/client/mod.rs	68.69%	36 Missing ⚠️
crypto/src/mls/conversation/decrypt.rs	50.70%	35 Missing ⚠️
crypto-ffi/src/generic/context/mod.rs	0.00%	30 Missing ⚠️
crypto/src/context.rs	43.13%	29 Missing ⚠️
crypto/src/mls/client/key_package.rs	68.13%	29 Missing ⚠️
... and 34 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #791      +/-   ##
==========================================
- Coverage   75.15%   74.21%   -0.94%     
==========================================
  Files         107      116       +9     
  Lines       19599    20729    +1130     
==========================================
+ Hits        14729    15384     +655     
- Misses       4870     5345     +475

Files with missing lines	Coverage Δ
crypto-ffi/src/lib.rs	`0.00% <ø> (ø)`
crypto/src/e2e_identity/refresh_token.rs	`100.00% <100.00%> (ø)`
crypto/src/e2e_identity/stash.rs	`98.14% <100.00%> (+0.32%)`	⬆️
crypto/src/error/keystore.rs	`100.00% <100.00%> (ø)`
crypto/src/error/mod.rs	`100.00% <100.00%> (ø)`
crypto/src/error/wrapper.rs	`100.00% <100.00%> (ø)`
crypto/src/lib.rs	`37.50% <ø> (ø)`
crypto/src/mls/buffer_external_commit.rs	`97.54% <100.00%> (+0.28%)`	⬆️
crypto/src/mls/client/id.rs	`59.25% <ø> (ø)`
crypto/src/mls/client/identifier.rs	`100.00% <100.00%> (ø)`
... and 58 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97cf75b...e93cee6. Read the comment docs.

…memorize ascii

By always returning a local error type, we open the door to removing the centralized error types entirely in favor of a more specific error stack. Note that this commit introduces a deprecated wrapper aroud `CryptoError` within `e2e_identity::error::Error`. This will be removed shortly, but it is necessary as an interim step, so that we don't have to rewrite actually everything, all at once. Note also that this affects some files within `mls`. This is required in order to keep everything buildable.

crypto/src/proteus.rs

crypto/src/error/legacy.rs

crypto-ffi/src/generic/mod.rs

crypto-ffi/src/wasm/mod.rs

typfel · 2024-12-17T16:36:49Z

crypto/src/mls/conversation/commit.rs

+            .mls_client()
+            .await
+            .map_err(RecursiveError::root("getting mls client"))?;


This repeated in quite a few places. Are we ever getting calling mls_client() with another intention than "getting mls client"?

Same question also applies for mls_provider()

No, but I don't see a better way to structure this.

The .mls_client() method (and mls_provider() as well) returns a root error: core_crypto::Error, implemented in crypto/src/error/mod.rs. But we call that method from all over the place, and we need to create all kinds of different errors from that. In this case what the function actually returns is a conversation error: core_crypto::mls::conversation::Error, defined in crypto/src/mls/conversation/error.rs.

You can see for yourself what's going on here: RecursiveError::root returns a function which wraps the root error and adds the context line, and then the ? operator uses the From<RecursiveError> implementation (from the #[from] annotation) to actually convert the error. But the point of the context field is to distinguish the cases. Quite likely, this "getting mls client" error is not the top-level error. It's common to get a MLS client and provider very near each other, and both operations will end up being mapped with RecursiveError::root. Having this context allows us to distinguish which of those operations failed.

Maybe you're thinking that the answer would be to have the mls_client method do its own wrapping: instead of returning Result<Client, core_crypto::Error>, it might return Result<Client, E> where RecursiveError: Into<E>. That would allow us to inline the context at least, avoid explicitly specifying that we're getting the mls client each time we get the mls client. Unfortunately, there are problems with that approach:

It breaks the pattern that each group of module always returns the internal error type for that group of modules. Now we're returning something based on this internal error type, but we can't say exactly what until we know the context of our caller.

It forces us to wrap everything everywhere in RecursiveError. That makes it more difficult, not less, to determine what the internal error type is.

It breaks the ? operator. The reason for this is that this defines the returned type in terms of Into. But the ? operator also applies an Into conversion. So desugaring everything looks like:
let client = match self.mls_client() { Ok(client) => client, Err(e) => return e.into(), };
But Rust can't infer what type e should have there, because any number of types might potentially satisfy both into conditions.

Maybe I'm missing something. But I do think that this somewhat-repetitive pattern is the best option we have for now.

(But it's worth mentioning that avoiding this kind of mess is a big driver of my desire to do a big refactoring that we can talk about next month! I don't have the domain knowledge yet to come into that week with a plan for what precisely we should do. But my hope is by the end of the week we will have designed a structure that lets avoid RecursiveError entirely.)

crypto/src/mls/conversation/renew.rs

crypto/src/mls/credential/mod.rs

crypto/src/mls/error.rs

crypto/src/mls/client/key_package.rs

crypto-ffi/src/generic/mod.rs

crypto-ffi/src/wasm/mod.rs

crypto/src/e2e_identity/rotate.rs

SimonThormeyer

Amazing, impressive work :)

crypto/src/mls/conversation/commit.rs

istankovic · 2024-12-19T13:23:26Z

I'm wondering if we can find a better name for LeafError. Leaf nodes are part of MLS and this error name may cause the reader to think that the error has something to do with those nodes, so it would be nice if we could avoid the ambiguity.

crypto/src/proteus.rs

istankovic · 2024-12-19T15:42:57Z

I would drop the two empty commits:
378e53de095fca72cb52cc05655b626ab3a9191a
4d791f57f67b5591f5d9eb1c483813341e787f9f

to avoid confusing whoever comes across them later.

…al error type Note that there are a number of test failures as of this commit, as the error types change. I am intentionally permitting this, with the intent to fix them all up once we have arrived at the final state of the error stacks.

…rror type

Also eliminates module-internal references to the `CryptoError` type. Intentionally does not yet remove the `CryptoError` type, because it is likely that I've missed something in different build configurations than I am testing by default, so I want to keep it around until I have more confidence that it is really obsolete.

Also get rid of the entirely superfluous documentation requirements. Error variants are simple, and have explicit documentation anyway in the form of the display implementation.

…m various modules Certain error variants were duplicated across a variety of module-specific errors. This could lead to problems in case we wanted a special-case handler for one of those error variants. This commit moves leaf error types into a common error type which can be extracted from module-specific errors.

Three main objectives in this commit: - introduce a small module hierarchy for errors, so that it is possible to better navigate through the error types - reduce code duplication by factoring out common patterns - remove unnecessary error dynamism by utilizing existing error kind enums We're finally beginning to approach a situation I would consider satisfactory with regard to error handling: - There is a top-level `error` module containing general-purpose crate-level errors - It also contains specialty errors with which we can construct general-purpose wrapped error stacks - Each code module group (i.e. folder) also contains its own internal error type, which contains unique leaf errors and sufficient information to wrap other crate error types. I still need to analyze and eliminate leaf types which are never constructed, fix the tests, and get rid of the legacy `CryptoError` entirely, but finally the end of this work is in sight.

…fError`

…ssible

…ible

…e possible

At this point the next failing case appears to be related to logic errors, not type errors. Which is unfortunate, as none of the logic was supposed to change.

Though running `cargo nextest run -p core-crypto --no-fail-fast` always produces approx 10 failing tests, going through that list is not profitable. The failing cases fall into two buckets: - `mls::conversation::leaf_node_validation::tests::stages::should_validate_leaf_node_when_adding`, which also appears to fail on `main` - the rest of the errors. This set varies each time I run a test job, and each case I've tried passes when run in isolation. So I have to assume that there is some kind of concurrency problem in the test suite, which is outside the scope of this PR. As such I am deciding to move on to broader testing instead of continuing to debug these test cases.

The error-handling status quo in `core_crypto` is wildly different now, so we needed to change quite a lot about how `core_crypto_ffi` used those errors. This commit causes `core-crypto-ffi` to build properly. Weirdly, this was _much_ simpler in wasm than doing so for non-wasm. I'll take it though!

These are _very sensitive_ to Cargo conditions though, on my machine at least. So now that they've passed once locally I am going to very carefully set them on the shelf and not breathe on them.

Basic error reporting and debug representations of things hide `Box`es. It's also not trivial to downcast into them, particularly in the context of a macro like this! Non-trivial does not mean impossible, though. Adding this capability to the macro means that we can avoid an ugly workaround in the test suite.

istankovic

This was a heroic effort. Well done! 🍷

coriolinus force-pushed the prgn/refactor/errors-per-module branch from f966cbb to a14bad0 Compare December 9, 2024 15:19

coriolinus force-pushed the prgn/refactor/errors-per-module branch from a14bad0 to fc14d7b Compare December 9, 2024 15:32

coriolinus force-pushed the prgn/refactor/errors-per-module branch 2 times, most recently from a62ed73 to 0971126 Compare December 13, 2024 09:24

coriolinus force-pushed the prgn/refactor/errors-per-module branch from fc70892 to 7f3296d Compare December 13, 2024 14:48

coriolinus marked this pull request as ready for review December 13, 2024 15:42

coriolinus requested a review from a team as a code owner December 13, 2024 15:42

coriolinus added 4 commits December 17, 2024 15:49

build(mls/client): byte literals exist so that humans do not have to …

4d5244b

…memorize ascii

refactor!(mls): simplify get_conversation api

d548298

refactor!(mls/client): client module uses a module-internal error type

c4a2b1c

coriolinus force-pushed the prgn/refactor/errors-per-module branch from f8a756b to 3dedc9e Compare December 17, 2024 14:49

typfel reviewed Dec 17, 2024

View reviewed changes