Atree storage migration #4633

janezpodhostnik · 2023-08-16T16:48:33Z

This is the migration that was planned for atree register inlining.

Changes and Migrations

The migration has 4 steps:

pre-migration hashing of cadence values
atree register inlining
contracts register contract deduplication for 2 mainnet accounts
post-migration hashing of cadence values and comparison with the pre-migration values

Other changes:

the account migration code was changed:
- to do grouping by sorting (to make it faster)
- to have the ability to do multiple operations (migrations) on one account
progress logging was changed to have an ETA and to print something every 10% or 1 minute

Current state of the code

The migration isn't yet using the correct version of atree that does the inlining, but is setup in a way where nothing will need changing when atree (and cadence) versions are upgraded.
The migration works on canary, testnet, mainnet
The migration takes a few minutes on testnet and canary, but takes ~7h on mainnet with 160 cores and takes approximately 1.5TB of memory at peak.
The migration has an exception prepared for the mainnet cricketMoment collection (which is the bulk of the time during migration) that should speed up the migration, but it has a bug, that makes the after migration validation fail.
output from mainnet: top_longest_migrations=�["30cf5dcf6ea8d379: 12m27.007314972s","1e3c78c6d580273b: 12m37.598222755s","e9a85b21a91040b1: 14m34.614031384s","e2e1689b53e92a82: 14m14.566103677s","b6f2481eba4df97b: 17m11.768844197s","321d8fcde05f6e8c: 28m7.706118076s","87ca73a41bb50ad5: 26m22.046354959s","481914259cb9174e: 15m39.537646583s","8528d8dd8a586f0d: 16m17.15586826s","20187093790b9aef: 20m31.733136401s","ecfad18ba9582d4f: 2h33m22.364125919s","bd56134d152d0ae2: 1h1m3.288347443s","fa57101aa0d55954: 47m38.798327044s","5eb12ad3d5a99945: 1h32m59.467484076s","81e95660ab5308e1: 41m8.044854272s","e4cf4bdc1751c65d: 59m55.150660767s","cecd8c81c050eef1: 19m44.674823294s","5cd62ecc83793917: 21m14.581682407s","e1f2a091f7bb5245: 54m24.543389847s","4eded0de73020ca5: 4h2m13.039145307s"]
last successful run on mainnet was on mainnet23-execution-us-central1-a-20231025163946-0465sras which is from oct 25 2023

codecov-commenter · 2023-08-16T16:57:24Z

Codecov Report

Attention: 1018 lines in your changes are missing coverage. Please review.

Comparison is base (cafb1a6) 56.02% compared to head (759ba39) 55.62%.
Report is 2128 commits behind head on master.

Files	Patch %	Lines
...util/ledger/migrations/cadence_value_validation.go	43.62%	172 Missing and 27 partials ⚠️
...md/util/ledger/util/migration_runtime_interface.go	0.00%	150 Missing ⚠️
...util/ledger/migrations/atree_register_migration.go	52.10%	104 Missing and 21 partials ⚠️
cmd/execution_builder.go	0.00%	98 Missing ⚠️
...d/util/ledger/migrations/storage_used_migration.go	2.40%	81 Missing ⚠️
cmd/util/ledger/util/util.go	0.00%	79 Missing ⚠️
.../util/ledger/migrations/cadence_data_validation.go	58.91%	64 Missing and 12 partials ⚠️
.../util/ledger/migrations/account_based_migration.go	70.53%	50 Missing and 11 partials ⚠️
cmd/util/ledger/util/payload_grouping.go	83.63%	19 Missing and 8 partials ⚠️
cmd/scaffold.go	0.00%	26 Missing ⚠️
... and 13 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4633      +/-   ##
==========================================
- Coverage   56.02%   55.62%   -0.41%     
==========================================
  Files         965     1003      +38     
  Lines       89671    96785    +7114     
==========================================
+ Hits        50241    53836    +3595     
- Misses      35678    38881    +3203     
- Partials     3752     4068     +316

Flag	Coverage Δ
unittests	`55.62% <44.06%> (-0.41%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fxamacker

Nice! Thanks for doing this! 👍 I mostly looked at payload migration code.

Just an idea, maybe we can use 2 storages for migration:

Original payloads can be loaded into the first storage and migrated payloads can be stored in the second storage.
Two storages can more completely test migration results by comparing all Cadence values before and after migration.
Not sure, maybe there is also a performance benefit because we don't need to recursively remove migrated payloads from first storage (StorageMap.SetValue).
Since it will use more memory, we can enable 2 storages migration with a flag.

Thoughts?

janezpodhostnik · 2023-08-18T10:45:56Z

I was considering that, but I opted for a single storage for simplicity.

When we move values to a new storage, we also have to move non atree registers (account status register, account key registers, contract names register). and also fix the storageUsed field either before moving all the other register, or in a separate migration (that we already have).

If you think there is a possibility that it will make it faster, I can try.

bluesign · 2023-08-18T11:40:45Z

I think if 2 storages will be used ( and considering reliability > speed ) , one more stage to check orphan slabs can be good. After comparing cadence values ( storage2==storage1 ) , we can delete stuff from storage 1 and in the end we can assert nothing left in storage 1.

fxamacker · 2023-08-18T14:37:21Z

@janezpodhostnik

When we move values to a new storage, we also have to move non atree registers (account status register, account key registers, contract names register). and also fix the storageUsed field either before moving all the other register, or in a separate migration (that we already have).

Yes. All non-atree registers can be moved as is, except for storage key registers.

How is storageUsed field computed? Does it require data migration to complete first?

If you think there is a possibility that it will make it faster, I can try.

Yes, I think using 2 storages might speed up the process by not recursively removing all atree registers after cloning.

The main benefit is being able to fully compare the full migration result during testing and also during spork (if speed is acceptable).

So even if it is same speed or slightly worse, it would be worthwhile.

fxamacker · 2023-08-18T14:46:35Z

@bluesign

I think if 2 storages will be used ( and considering reliability > speed ) , one more stage to check orphan slabs can be good. After comparing cadence values ( storage2==storage1 ) , we can delete stuff from storage 1 and in the end we can assert nothing left in storage 1.

Good point! This check for orphan slabs would be great to have during data migration dry runs, etc. 👍 Maybe this check can be enabled by a flag.

There is already a function in Atree to check for orphan slabs so I think we can use it during data migration as an option.

janezpodhostnik · 2023-08-18T16:37:04Z

Those are good points. I'll switch to having two storages.

the storage used field updates automatically if you go through the environment.StatefulAccounts object, which the migration does. But if there is two storages the new storage used will be recorded incorectly...

storageUsed is X:

when removing and adding to the same storage storage used is X -A +B
when moving to new storage nothing gets deleted in the new storage so its just X +B

but this is an issue I can fix

fxamacker

I took a look (not a full PR review) for skipped registers problem. Nothing obvious jumped out related to skipped registers but I left a few comments.

Maybe it can be useful to have atree register dump for a small account with known skipped registers:

before migration
after migration

Thoughts?

cmd/util/ledger/util/util.go

cmd/util/ledger/migrations/atree_register_migration.go

fxamacker

@janezpodhostnik Thanks for sharing latest findings today!

You also mentioned needing more memory on the test machine, so I took a look for quick memory savings.

cmd/util/ledger/migrations/account_based_migration.go

bluesign · 2023-09-08T07:49:41Z

@janezpodhostnik sharding groups.Accounts and using many pass can decrease memory usage a lot I guess. First pass accounts ending with 1, second pass ending with 2 etc.

janezpodhostnik · 2023-09-08T15:21:14Z

@bluesign yeah, something like that might be necessary. I was also thinking of just dumping intermediate results to disk.

fxamacker · 2023-09-08T15:30:53Z

@janezpodhostnik Maybe we can eliminate map of account groups. Instead of creating map, we can sort input payload slice by address in place.

dsainati1

I just took a look at this from a Cadence perspective, but had a couple thoughts

dsainati1 · 2023-09-14T17:39:20Z

cmd/util/ledger/util/util.go

+}
+
+// NopMemoryGauge is a no-op implementation of the MemoryGauge interface
+type NopMemoryGauge struct{}


I am not 100% sure, but I don't think it's necessary to define a new Nop interface here for a memory gauge that doesn't do anything; the Cadence code should be set up such that nil can be passed to anywhere that expects a MemoryGauge if you don't care about tracking memory usage.

dsainati1 · 2023-09-14T17:52:07Z

cmd/util/ledger/migrations/atree_register_migration.go

+		// Attachments are enabled everywhere except for Mainnet
+		AttachmentsEnabled: true,
+		// Capability Controllers are enabled everywhere except for Mainnet
+		CapabilityControllersEnabled: true,


I assume these are enabled so this migration can be run on testnet too?

fxamacker · 2023-09-15T15:40:38Z

@janezpodhostnik

Thanks for sharing results yesterday of dry-run for migration without atree linining.

Largest account took hours longer than we'd like, so we can try to parallelize migration within large accounts by:

value.Clone() can be done by multiple goroutines for accounts with large number of payloads. For example, goroutine can receive cadence.Value from input channel, clone the value`, and then push cloned value into output channel. This should only be used for large accounts because of overhead and more memory.

flow-go/cmd/util/ledger/migrations/atree_register_migration.go

Lines 217 to 223 in ab9077a

    
           err = capturePanic(func() { 
        
           	// force the value to be read entirely 
        
           	value = value.Clone(inter) 
        
           }) 
        
           if err != nil { 
        
           	return fmt.Errorf("failed to clone value for key %s: %w", key, err) 
        
           }

storageMap.SetValue() overwrites new value and removes old value recursively. It effectively reads and traverses all registers for removed value and it can be time-consuming for large account. Maybe we can switch to two storage approach discussed earlier or have a new storage map so SetValue() doesn't trigger recursive traversal and removal. However, this speedup may use more memory.

flow-go/cmd/util/ledger/migrations/atree_register_migration.go

Lines 225 to 231 in ab9077a

    
           err = capturePanic(func() { 
        
           	// set value will first purge the old value 
        
           	storageMap.SetValue(inter, key, value) 
        
           }) 
        
           if err != nil { 
        
           	return fmt.Errorf("failed to set value for key %s: %w", key, err) 
        
           }

We can implement batch decoding of atree payloads at atree.SlabStorage level. Currently most of atree register decoding happens during value.Clone() one at a time. By decoding all payloads in batch, value.Clone() would be faster and memory use is the same eventually. This would benefit all accounts. 😄

Thoughts?

This is a temporary (and inefficient) workaround to check number of fields in Cadence CompositeValue. In the future, CompositeValue.FieldCount() will be added to Cadence, so that function can be used to improve this workaround.

…e-value-validation Replace hash-based validation of migrated Cadence values to use `Equal()`

fxamacker

I took another look and left some minor comments. Also ran some dry-runs this week with/without non-hash validation (PR 5204) and will take a closer look at the dry-run logs.

The dry runs used this PR updated with followup PRs 5122, 5128, 5143, and 5204. Total dry-run duration took 5 hours with new validation enabled. Without validation enabled, it took about 4 hours using mainnet24 root. Peak memory use (max RSS) was 1.95 TB - 2.05 TB with/without validation.

Nice improvements on the migration duration!

cmd/util/ledger/migrations/migrator_runtime.go

cmd/util/ledger/migrations/atree_register_migration.go

fxamacker · 2024-01-11T21:25:49Z

Migration dry-run without/without domain inbox produced:

163 accounts with unmigrated payloads when run without inbox domain
4 accounts with unmigrated payloads when run with inbox domain added

fxamacker

Nice! LGTM!

Remove cricket moments references from atree migration

zhangchiqing · 2024-01-18T19:22:24Z

cmd/util/ledger/migrations/cadence_value_validation.go

+	}
+
+	// Iterate through all domains and compare cadence values.
+	for _, domain := range domains {


Can we validate different storage domain in parallel?

We can but most of the data is in one domain, so it is likely to remain a bottleneck (not much to gain by parallelism).

Could we validate the storage map keys in parallel though?

Could we validate the storage map keys in parallel though?

Good question. That would allow parallelizing within same domain but we would need to determine if updating atree read cache needs to support parallelism for this.

cmd/util/ledger/migrations/cadence_data_validation.go

cmd/util/ledger/migrations/atree_register_migration.go

zhangchiqing · 2024-01-19T01:20:09Z

cmd/util/ledger/migrations/cadence_value_validation.go

+		return true
+	})
+
+	if len(vFieldNames) != len(otherFieldNames) {


Do we need to compare if they contain the same items?

If so, what about building a map, and remove each field from the map when iterating through otherComposite, and verify if there is any field left.

Do we need to compare if they contain the same items?

Not needed here because:

we compare number of fields for both composite values

we compare fields in 2 composite values by iterating over fields of 1st composite value and comparing field values (of matching field name) in 2nd composite value

if composite values have different fields, we would fail to get field value by the same field name in the previous step.

janezpodhostnik · 2024-01-23T20:30:54Z

cmd/util/cmd/execution-state-extract/execution_state_extract.go

 	nWorker int, // number of concurrent worker to migation payloads
+	runMigrations bool,


@zhangchiqing as discussed, re-adding this

turbolent · 2024-01-23T21:28:05Z

cmd/util/ledger/migrations/cadence_data_validation.go

+	var s string
+	err := capturePanic(
+		func() {
+			s = value.RecursiveString(interpreter.SeenReferences{})


Not 100% if the string description of all values includes all data of the value.

That file (cadence_data_validation.go) isn't used any more.

So maybe the file can be removed since we're using cadence_value_validation.go instead.

cmd/util/ledger/migrations/cadence_value_validation.go

turbolent · 2024-01-23T21:32:28Z

cmd/util/ledger/migrations/cadence_value_validation.go

+	}
+
+	// Iterate through all domains and compare cadence values.
+	for _, domain := range domains {


Could we validate the storage map keys in parallel though?

cmd/util/ledger/migrations/cadence_value_validation.go

cmd/util/ledger/migrations/cadence_data_validation.go

turbolent · 2024-01-24T00:51:31Z

cmd/util/ledger/migrations/atree_register_migration.go

+}
+
+var domainsLookupMap = map[string]struct{}{
+	common.PathDomainStorage.Identifier(): {},
+	common.PathDomainPrivate.Identifier(): {},
+	common.PathDomainPublic.Identifier():  {},
+	runtime.StorageDomainContract:         {},
+	stdlib.InboxStorageDomain:             {},
+}


Given that Testnet already had Cap Cons enabled, those also have to be migrated, i.e. stdlib.CapabilityControllerStorageDomain needs to be added

Suggested change

}

var domainsLookupMap = map[string]struct{}{

common.PathDomainStorage.Identifier(): {},

common.PathDomainPrivate.Identifier(): {},

common.PathDomainPublic.Identifier(): {},

runtime.StorageDomainContract: {},

stdlib.InboxStorageDomain: {},

}

stdlib.CapabilityControllerStorageDomain,

}

var domainsLookupMap = map[string]struct{}{

common.PathDomainStorage.Identifier(): {},

common.PathDomainPrivate.Identifier(): {},

common.PathDomainPublic.Identifier(): {},

runtime.StorageDomainContract: {},

stdlib.InboxStorageDomain: {},

stdlib.CapabilityControllerStorageDomain: {},

}

janezpodhostnik requested a review from fxamacker August 16, 2023 16:48

janezpodhostnik self-assigned this Aug 16, 2023

janezpodhostnik changed the title ~~Atree migration initial commit~~ Atree migration Aug 16, 2023

janezpodhostnik changed the title ~~Atree migration~~ Atree storage migration Aug 16, 2023

fxamacker reviewed Aug 17, 2023

View reviewed changes

janezpodhostnik force-pushed the janez/atree-migration branch from 6767f31 to b8f10e9 Compare August 25, 2023 14:45

fxamacker mentioned this pull request Aug 25, 2023

Storage migration for Atree register inlining #4576

Closed

22 tasks

janezpodhostnik force-pushed the janez/atree-migration branch from b8f10e9 to e795e53 Compare August 25, 2023 15:12

janezpodhostnik mentioned this pull request Aug 25, 2023

Make CodesAndPrograms public onflow/cadence#2746

Merged

6 tasks

janezpodhostnik force-pushed the janez/atree-migration branch 3 times, most recently from 349f70e to 8a4e274 Compare August 28, 2023 15:41

fxamacker reviewed Aug 31, 2023

View reviewed changes

janezpodhostnik force-pushed the janez/atree-migration branch 2 times, most recently from 27f763a to c1d7866 Compare September 1, 2023 11:35

janezpodhostnik changed the base branch from master to janez/upgrade-cadence-and-change-to-read-random September 1, 2023 11:35

fxamacker reviewed Sep 6, 2023

View reviewed changes

cmd/util/ledger/migrations/atree_register_migration.go Outdated Show resolved Hide resolved

fxamacker reviewed Sep 7, 2023

View reviewed changes

cmd/util/ledger/migrations/account_based_migration.go Outdated Show resolved Hide resolved

cmd/util/ledger/migrations/account_based_migration.go Outdated Show resolved Hide resolved

cmd/util/ledger/migrations/account_based_migration.go Outdated Show resolved Hide resolved

dsainati1 reviewed Sep 14, 2023

View reviewed changes

Base automatically changed from janez/upgrade-cadence-and-change-to-read-random to master September 14, 2023 19:46

fxamacker added 2 commits January 9, 2024 14:35

Check composite field count during validation

d56df32

This is a temporary (and inefficient) workaround to check number of fields in Cadence CompositeValue. In the future, CompositeValue.FieldCount() will be added to Cadence, so that function can be used to improve this workaround.

Merge pull request #5204 from onflow/fxamacker/atree-migration-cadenc…

00cdc35

…e-value-validation Replace hash-based validation of migrated Cadence values to use `Equal()`

fxamacker reviewed Jan 11, 2024

View reviewed changes

cmd/util/ledger/migrations/atree_register_migration.go Outdated Show resolved Hide resolved

janezpodhostnik added 2 commits January 16, 2024 16:28

review fixes

3374a26

remove cricket moments references from atree migration

a5275dd

janezpodhostnik requested a review from fxamacker January 16, 2024 16:32

fxamacker approved these changes Jan 17, 2024

View reviewed changes

janezpodhostnik requested review from dsainati1, zhangchiqing, SupunS and turbolent January 17, 2024 20:51

Merge pull request #5242 from onflow/janez/atree-migration-cleanup

8d74789

Remove cricket moments references from atree migration

zhangchiqing reviewed Jan 19, 2024

View reviewed changes

janezpodhostnik added 3 commits January 23, 2024 17:20

Increase log level for cadence data validation

d2d88f2

state payload is not found is an error

9746b7f

re-add the --no-migration flag

759ba39

janezpodhostnik commented Jan 23, 2024

View reviewed changes

janezpodhostnik requested a review from zhangchiqing January 23, 2024 20:31

zhangchiqing approved these changes Jan 23, 2024

View reviewed changes

janezpodhostnik added this pull request to the merge queue Jan 23, 2024

turbolent reviewed Jan 23, 2024

View reviewed changes

Merged via the queue into master with commit 0979fb2 Jan 23, 2024
51 checks passed

janezpodhostnik deleted the janez/atree-migration branch January 23, 2024 22:03

turbolent reviewed Jan 24, 2024

View reviewed changes

janezpodhostnik mentioned this pull request Jan 24, 2024

Atree Migration cleanup #5284

Merged

This was referenced Feb 15, 2024

Integrate Atree inlining and Cadence v0.42.x with flow-go #5296

Closed

Integrate Atree inlining and Cadence 1.0 with flow-go #5403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atree storage migration #4633

Atree storage migration #4633

janezpodhostnik commented Aug 16, 2023 •

edited

Loading

codecov-commenter commented Aug 16, 2023 •

edited

Loading

fxamacker left a comment

janezpodhostnik commented Aug 18, 2023

bluesign commented Aug 18, 2023

fxamacker commented Aug 18, 2023

fxamacker commented Aug 18, 2023

janezpodhostnik commented Aug 18, 2023

fxamacker left a comment •

edited

Loading

fxamacker left a comment

bluesign commented Sep 8, 2023 •

edited

Loading

janezpodhostnik commented Sep 8, 2023

fxamacker commented Sep 8, 2023 •

edited

Loading

dsainati1 left a comment

dsainati1 Sep 14, 2023

dsainati1 Sep 14, 2023

fxamacker commented Sep 15, 2023

fxamacker left a comment

fxamacker commented Jan 11, 2024

fxamacker left a comment

zhangchiqing Jan 18, 2024

fxamacker Jan 19, 2024

turbolent Jan 23, 2024

fxamacker Jan 23, 2024

zhangchiqing Jan 19, 2024

fxamacker Jan 19, 2024

janezpodhostnik Jan 23, 2024

turbolent Jan 23, 2024

fxamacker Jan 23, 2024

turbolent Jan 23, 2024

turbolent Jan 24, 2024

		nWorker int, // number of concurrent worker to migation payloads
		runMigrations bool,

Atree storage migration #4633

Atree storage migration #4633

Conversation

janezpodhostnik commented Aug 16, 2023 • edited Loading

Changes and Migrations

Current state of the code

codecov-commenter commented Aug 16, 2023 • edited Loading

Codecov Report

fxamacker left a comment

Choose a reason for hiding this comment

janezpodhostnik commented Aug 18, 2023

bluesign commented Aug 18, 2023

fxamacker commented Aug 18, 2023

fxamacker commented Aug 18, 2023

janezpodhostnik commented Aug 18, 2023

fxamacker left a comment • edited Loading

Choose a reason for hiding this comment

fxamacker left a comment

Choose a reason for hiding this comment

bluesign commented Sep 8, 2023 • edited Loading

janezpodhostnik commented Sep 8, 2023

fxamacker commented Sep 8, 2023 • edited Loading

dsainati1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxamacker commented Sep 15, 2023

fxamacker left a comment

Choose a reason for hiding this comment

fxamacker commented Jan 11, 2024

fxamacker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janezpodhostnik commented Aug 16, 2023 •

edited

Loading

codecov-commenter commented Aug 16, 2023 •

edited

Loading

fxamacker left a comment •

edited

Loading

bluesign commented Sep 8, 2023 •

edited

Loading

fxamacker commented Sep 8, 2023 •

edited

Loading