-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State migration fails in some cases #3192
Comments
@turbolent Maybe we should add a debug flag in state migration program to report health status of input data (no migration). It will confirm if errors related to "slab not found" and "slab not referenced" are already in existing state. |
I think what happens for the dictionary key problem is that we always perform the key removal with the "legacy form" of the key. For example, for reference types, we encode the key with the old boolean authorization flag ( However, if this happens after the entitlements migration, e.g. in the capability migration, then the key is already in the new "new form", with an authorization that is an entitlements set. I'm currently trying to write a reproduction test case for this hypothesis. @SupunS Do you think this could be the cause? |
While analyzing the logs of the latest run of the migration against a more recent TN state snapshot, I realized I missed an error: the entitlements migration encounters |
In the latest run of the migration against a more recent TN state snapshot, the first two errors are not encountered anymore (capability static type without borrow type is handled properly, and key removal succeeds) |
Unfortunately, the atree related errors are not due to prior state issues, but seem to be caused by the migration itself. For example (note the "after migration", which is only reported if the health check succeeded before the migration):
I'll continue to investigate why / how these occur. |
@turbolent where you able to identify why the |
@SupunS No, I wasn't able to identify yet why this happens. I tried to reproduce the issue by adding a test case in https://github.com/onflow/cadence/pull/3205/files, but it succeeds as expected. I'll create a reduced state snapshot with just the core accounts + the problematic accounts, which cause the remaining errors (AuthAccount fails to load, atree health check errors), and share it, so we can locally run the migration and debug it. |
@turbolent one of the accounts is key collision from interface ordering, I guess. https://f.dnz.dev/0x289f9fe703f5a8ec/raw/storage/CapabilityFactory0x96b15ff6dfde11fe |
@bluesign Oh wow, good catch, thank you for looking into this! Great lead 👍 The static type migration is normalizing the ordering of the interfaces in intersection types and it's not accounting for the possibility that the new key already stores something. The old value is simply overwritten, leaving the slabs of the old data unreferenced. I'll add a check and error for when the new key already exists. I don't think we can resolve this situation automatically and probably need to ask the owner of the account to resolve this manually before the migration. |
yeah not much possible to do there automatically. |
.@SupunS mentioned that the value migrations do not fail on first error, so it's possible that the entitlements migration encounters a deprecated type that should have been migrated by the static type migration before, if the static type migration itself failed. I'll check if there are any prior errors in the logs |
@SupunS in regards to the AuthAccount type loading error: Looking at the logs for just one particular account for which the error occurs, there are no prior static type migration errors, e.g.
|
@SupunS Created |
Managed to debug the The problem with We need to handle these forms and also re-write them if they are deprecated. In general we should maybe use this chance and clean up the state and rewrite all such occurrences (deprecated or not). I wonder how the type ended up being stored like this. Maybe during import of the type? |
Probably from the type constructors i.e: e.g: cadence/runtime/tests/interpreter/runtimetype_test.go Lines 342 to 347 in c574c0a
|
nice find btw! |
We should maybe also fix the type constructor, to return the pre-defined type if they are builtin. But not sure what the implications are. e.g: will it still be considered a "CompositeType"? Does that really matter? etc.. |
Account 5d63c34d7f05e5a4 nicely reproduces the slab not found with address 0x0 (stack) issue, it only has 27 payloads |
Account 289f9fe703f5a8ec nicely reproduces the slab not referenced issue: with #3211 it reports:
after which the health check fails (as expected):
|
Further investigating the last remaining issue, slabs not findable, it turned out that I made a mistake when adding the health check in onflow/flow-go#5591: I mixed up the parameter order, and the health check before the migration was not run. When running with the fix, onflow/flow-go#5618, the unhealthy storage is correctly reported:
This resolves the last outstanding issue for the migration :-) However, we might still want to do something about this case. The data is gone, but for example, when this occurs for a dictionary or array, we could remove the broken slab references / elements, to at least "repair" the value and make it accessible again. The unhealthy storage / broken value seems to be related to https://www.notion.so/dapperlabs/Mainnet-contracts-affected-by-data-loss-bug-fb22d83d0c9c4704a74f2cd859b0f05c |
All known issues for healthy data are now complete, going to open a separate issue for the remaining issues with unhealthy data. |
Opened #3220 and onflow/flow-go#5634 |
Current Behavior
The state migration still runs into some errors. In particular, the log for running the migration on the full TN state (https://www.notion.so/flowfoundation/State-Migration-for-Crescendo-Release-56fe231eb56a4a5383f919ffe444d109?pvs=4#679a28687bbf49e9b3b52697820bc7ae) revealed the following problems:
Problem: During the static type migration, values are encountered which have nil types, even though they should not.
For example:
In particular, the following cases are encountered (in
v1.0.0-preview.14
):statictypes/statictype_migration.go:144
: Dictionarystatictypes/statictype_migration.go:173
statictypes/statictype_migration.go:185
: Capabilitystatictypes/statictype_migration.go:450
Analysis:
Capability
's type parameter is optional, but the static type migration assumes it is always set / never nil.Fix: Handle unparameterized Capability static types #3196
Problem: A particular key of a dictionary fails to get migrated, because the migrated key could not get removed:
Fix: Only encode reference static type in legacy format if it was decoded as such #3199
Problem: The atree storage health check fails, as storage refers to zero address slabs, which are not stored in accounts. For example:
Analysis:
Problem: The atree storage health check fails, because storage contains slabs which are never referenced. For example:
Analysis:
Problem: The entitlements migration encounters
AuthAccount
and it fails to load the type (as expected, it does not exist anymore)For example:
Analysis: Built-in types may not always be represented as
PrimitiveStaticType
s, but also asCompositeStaticType
s.Fix: Fix migration of built-in types #3205
Expected Behavior
The migration should succeed and not report errors.
Steps To Reproduce
Run latest state migration on recent, full TN state
Environment
The text was updated successfully, but these errors were encountered: