channeldb: add optional migration to prune revocation logs #6469

yyforyongyu · 2022-04-27T23:22:29Z

This is a follow-up PR on #6347.

This PR introduces an optional migration that's fault-tolerant. Since the potential scale is quite large, to prevent the OOM issue we need to perform the update in different db transactions. We cannot safely assume the process won't be interrupted, thus we need to be able to pick up where we left off and continue the migration if it does happen.

This PR also introduces optional meta that's similar to meta but saves the optional meta info. We need to separate the management of the two because when we don't, after a future mandatory migration is applied, the user would have trouble applying the optional migration.

yyforyongyu · 2022-04-27T23:23:43Z

Here're some benchmark results, ~~maybe there's something wrong with the setup but the migration is really slow.~~

Benchmark for 2000 logs

    2000     214401347 ns/op     7181999 B/op      92167 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   576.254s



Benchmark for 5000 logs

    5000     250081490 ns/op     9413385 B/op     125960 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   1549.976s

Most of the time is taken to derive keys and compute scripts to find the output indexes, and it's slow. A progress log is added to help with the process.

channeldb/migration26/migration.go

yyforyongyu · 2022-05-01T07:55:22Z

Decided to put the script computation inside a goroutine, plus a bug fixed in iterator, the speed is much faster now.

Building test data for 1000 logs...
Finished building test data, migrating...
    1000        195708 ns/op       11526 B/op        178 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   0.626s

Building test data for 5000 logs...
Finished building test data, migrating...
    5000         82152 ns/op       11615 B/op        177 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   0.873s


Building test data for 1000000 logs...
Finished building test data, migrating...
 1000000         56819 ns/op       11370 B/op        175 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   71.224s

Roasbeef

Completed a very cursory pass over to get a better grasp of the proposed approach and also how the new optional migration scaffolding has been added. Great job on the set of test cases! They're very comprehensive, and I also like that we started to benchmark the migrations as well to get a handle on how long it'll take a canned node to apply the migration. Ideally in the near future we never need to do such manual migrations again, but this feature is certainly urgently needed.

Will circle back with a deeper review, and also some test runs (the --dry-run-migration flag can be super useful here). Speaking of --dry-run-migration, as is it the migration in this PR isn't strictly compatible with that, as the PR will do smaller transactions to commit the progress within, vs the existing approach where one mega transaction is used for everything.

channeldb/migration26/iterator.go

Roasbeef · 2022-05-03T23:13:15Z

channeldb/migration26/iterator.go

+
+	// Iterate the buckets. If we received an exit signal, return the
+	// locator.
+	err := iterateBuckets(openChanBucket, nil, cb)


Shouldn't we be passing in a specific node's bucket here rather than the entire open channel bucket? As we need to know the last location for each channel, not the last global key we migrated?

Actually it's designed to locate the global key. Maybe I should've named the functions better, but findNextMigrateHeight is designed to find the next key within the scope of a channel, and locateNextUpdateNum is to find the global position.

Suppose we restart a previously interrupted migration, we need to,

-> iterate the node's buckets until, -> iterate the chainhash buckets until, -> iterate the channel buckets until, -> a channel's next revocation log height is located, which signals the exit.

In other words a restart means we need to find out which channel inside which node has unfinished jobs and continue the migration there.

channeldb/migration26/iterator.go

channeldb/migration26/migration.go

yyforyongyu · 2022-05-05T09:49:05Z

Speaking of --dry-run-migration, as is it the migration in this PR isn't strictly compatible with that, as the PR will do smaller transactions to commit the progress within, vs the existing approach where one mega transaction is used for everything.

Yeah that's indeed an issue as the migration relies on what's already written on disk to decide its next move. Maybe we could delete the new buckets if we are in dry-run mode? While it complicates things more, I think we can leverage the reset callback inside kvdb.Update to wipe out data that's previously written.

channeldb/migration26/migration.go

TonyGiorgio · 2022-06-03T23:41:11Z

Ran this PR on my 30GB channel.db that wouldn't compact down any further and it was able to compact it down to 129MB, giving a thumbs up, thanks 👍

Roasbeef · 2022-07-01T02:13:23Z

channeldb/migration29/iterator.go

+	cb := func(chanBucket kvdb.RwBucket, l *updateLocator) error {
+		locator = l
+		updateNum := findNextMigrateHeight(chanBucket)
+


I think we should log the height here, so we can use it to print some more progress during the migration. Otherwise, users may just see it sitting there for minutes and wonder what's going on.

Added more verbose logs and debug logs.

channeldb/migration29/migration.go

channeldb/migration29/iterator.go

channeldb/migration29/migration.go

Roasbeef · 2022-07-01T02:47:18Z

channeldb/db.go

+// migrations will be applied.
+//
+// NOTE: only support the prune_revocation_log optional migration atm.
+func (d *DB) applyOptionalVersions(cfg OptionalMiragtionConfig) error {


So we'd thread through that quit channel here to make the migration resumable.

So I guess to achieve this, we'll pass the ctx used in BuildDatabase through BuildDatabase -> CreateWithBackend -> applyOptionalVersions -> version.migration and catch the cancel signal there?

Roasbeef · 2022-07-01T02:51:01Z

channeldb/migration29/lnwallet.go

+
+	// With the commitment point generated, we can now derive the king ring
+	// which will be used to generate the output scripts.
+	keyRing := DeriveCommitmentKeys(


If we can perform this same check w/o needing to derive the commitment keys, then we should be able to speed up the migration significantly.

Not sure I understand. I think we need to derive the king ring here to compute the scripts so we can derive the output indexes?

Roasbeef · 2022-07-01T03:16:46Z

So I tried to profile an active run to see what all the goroutines were up to, but I think the usage of internal closures is messing things up, rather than see all the stack traces, all I see instead is:

1090 @ 0x4710a1

Roasbeef · 2022-07-01T04:00:23Z

Will do a deeper dive tomorrow, but something is up w/ the progress logging, and due to the issue I mentioned above, I'm not able to get a proper stack trace or profile to see what's up:

2022-06-30 23:10:45.215 [INF] CHDB: Performing database optional migration: prune revocation log
2022-06-30 23:10:45.216 [INF] CHDB: Migrating revocation logs, might take a while...
2022-06-30 23:10:45.224 [INF] CHDB: Total un-migrated logs: 59661
2022-06-30 23:11:21.543 [INF] CHDB: Migration progress: 33.523%, still have: 39661
2022-06-30 23:11:58.590 [INF] CHDB: Migration progress: 67.045%, still have: 19661

It's been sitting there for 30 mins or so now as well. FWIW, this is a pretty tiny box (1 vCPU), 2 GB of RAM.

It also picked back up from zero when it was restarted prior:

2022-06-30 22:16:44.410 [INF] CHDB: Performing database optional migration: prune revocation log
2022-06-30 22:16:44.411 [INF] CHDB: Migrating revocation logs, might take a while...
2022-06-30 22:16:44.419 [INF] CHDB: Total un-migrated logs: 59661
2022-06-30 22:17:19.624 [INF] CHDB: Migration progress: 33.523%, still have: 39661
2022-06-30 22:17:54.893 [INF] CHDB: Migration progress: 67.045%, still have: 19661
2022-06-30 23:10:17.330 [INF] LTND: Received interrupt
2022-06-30 23:10:17.332 [INF] LTND: Shutting down...
2022-06-30 23:10:17.333 [INF] LTND: Gracefully shutting down.

FWIW, this is a mega old node. I think it turned 5 earlier this year...

This commit adds relevant code from the revocation_log.go and the package lnwallet. The code is needed to migrate the data, and we choose to copy the code instead of importing to preserve the version such that a future change won't affect current migration. An alternative would be tagging each of the packages imported.

This commit adds several utility functions to assist the migration. In particular, an updateLocator is added to gives us the next un-migration position in our buckets. This locator helps us to continue the job in case of an interrupted migration. It also serves as an indicator on whether the migration is finished or not.

This commit adds supporting functions that will be used in the unit test. The testing data are also added as hard-coded. We choose to copy the most of the testing data from our itest results such that a) they are "real" data that can be used to calculate scripts and b) we preserve the result generated by the current code so a future change won't affect our test.

This commit adds the migration that's used to convert the old revocation logs into the new format. The migration is fault-tolerant, meaning the process can be interrupted and the migration will pick up what's left when running again. We also cap how many records to be processed in each db transaction to prevent OOM.

This commit enables the db to run optional migrations that are specified by config flags. To achieve this, an optional meta is introduced to manage the optional migrations. We distinguish the two types of migrations here so it's easier to manage them for the concern a future migration can cause trouble for us to determine the db version if we don't.

This commit changes how we locate the next migration height by including the scenario where `lnd@v0.15.0` is active. In the new version, we will see a mixed of new and old logs under the same open channel bucket. Hence, we need to alter how we locate the next un-migrated height.

This commit adds a new method `validateMigration` to be used prior to the deletion of the old buckets to be extraly cautious.

yyforyongyu · 2022-07-12T13:01:24Z

It's been sitting there for 30 mins or so now as well. FWIW, this is a pretty tiny box (1 vCPU), 2 GB of RAM.

Here's what went wrong,

previously the migration had a defense check when it tried to locate the next migration height. When the height key found in the new bucket was greater, the bucket was considered corrupted and the migration would be applied again.
this would be fine prior to v0.15.0 as no new logs were expected, hence the height saved in the new buckets won't be greater than the old un-migrated heights.
with v0.15.0 things are different, new logs are saved to the new buckets already, and their heights must be greater.
The above case will cause an infinite loop where 1) new heights are greater -> 2) redo the current migration.
This is now fixed by taking the above situation into account.

Also tested with the channel.db file you sent, it now can finish the whole thing. The dummy unit test script,

func TestFoo(t *testing.T) {
	dbPath := "" // point this to the channel.db file.
	db, err := kvdb.Open(
		kvdb.BoltBackendName, dbPath, true, kvdb.DefaultDBTimeout,
	)
	defer db.Close()
	require.NoError(t, err)

	backend := btclog.NewBackend(&build.LogWriter{})
	logger := backend.Logger("test")
	logger.SetLevel(btclog.LevelDebug)
	UseLogger(logger)

	err = MigrateRevocationLog(db)
	require.NoError(t, err)
}

Roasbeef · 2022-07-13T00:37:39Z

@yyforyongyu great work! Confirmed that I was able to successfully rerun the migration locally. This time around the entire thing was done in less than a minute or so. Now that this is fixed, I'll circle back again re review+testing for this PR.

lightninglabs-deploy · 2022-07-27T01:44:00Z

@bhandras: review reminder
@Roasbeef: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

Roasbeef

LGTM 🧜‍♀️

The latest diff fixed the issue on my testnet node, and I've run this on one of my mainnet nodes successfully!

channeldb/migration29/migration.go

bhandras

LGTM 🎉 Some parts of the code are not known by me, so tried to do a more code oriented review, however there's not much to add, beautiful work 🥇

bhandras · 2022-07-04T08:56:22Z

channeldb/migration29/lnwallet.go

+	// We use the remote's channel config for the csv delay.
+	theirDelay := uint32(chanState.RemoteChanCfg.CsvDelay)
+
+	// If we are the initiator of this channel, then it's be false from the


nit: it's => it'll

bhandras · 2022-08-03T15:06:39Z

channeldb/meta.go

+		}
+
+		return nil
+	}, func() {})


Even though we don't know of people running nodes on etcd if they do, it's safer if we reset any external state properly, like in this current block we update the om.Versions map which is external to the closure.

C-Otto · 2022-08-05T14:59:50Z

For those who are curious: I was able to run the migration (via the TestFoo script above) on a copy of my channel.db (16 GByte, c-otto.de). It took 249 seconds on my Intel i5 12600k for around 3.7 million logs.

rkfg · 2022-08-05T15:25:33Z

Out of curiosity, what was the size of the database after migration? My db is only 3.8 Gb so probably compacting it isn't an emergency, I'll wait for more reports.

C-Otto · 2022-08-05T15:30:51Z

[INF] CHDB: DB compaction of /.../channel.db successful, 16960901120 -> 4039794688 bytes (gain=4.20x)

15.8 GByte down to 3.8 GByte

rkfg · 2022-08-05T15:33:51Z

Nice, though not as drastic as in #6469 (comment), I wonder why... has something changed since that? Maybe the final version of the PR prunes less data?

C-Otto · 2022-08-05T15:36:16Z

My database didn't include a lot of other cruft (that a regular compaction would remove). Also, I've been using lnd 0.15 for a while now, so several newer channels don't benefit from this patch.

yyforyongyu · 2022-08-08T01:49:05Z

I wonder why... has something changed since that? Maybe the final version of the PR prunes less data?

Nope it's the same. You may not see such a gain because you probably have lots of payments and invoices, which can take up some disk space.

yyforyongyu commented Apr 27, 2022

View reviewed changes

channeldb/migration26/migration.go Outdated Show resolved Hide resolved

yyforyongyu commented Apr 27, 2022

View reviewed changes

channeldb/migration26/migration.go Outdated Show resolved Hide resolved

yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from 28c2169 to 31ff6b7 Compare April 28, 2022 13:33

yyforyongyu requested review from Roasbeef and Crypt-iQ April 28, 2022 13:42

yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from df3e677 to 334f458 Compare May 1, 2022 07:53

Roasbeef reviewed May 3, 2022

View reviewed changes

yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from 4605f4f to f74363d Compare May 6, 2022 08:40

Crypt-iQ reviewed May 6, 2022

View reviewed changes

channeldb/migration26/migration.go Outdated Show resolved Hide resolved

Roasbeef added this to the v0.15.1 milestone May 12, 2022

yyforyongyu force-pushed the 5388-migration branch 6 times, most recently from 3e58b0d to b3343d3 Compare May 20, 2022 03:31

Roasbeef requested a review from bhandras June 28, 2022 22:22

Roasbeef reviewed Jul 1, 2022

View reviewed changes

yyforyongyu force-pushed the 5388-migration branch from b3343d3 to 262bf43 Compare July 12, 2022 12:42

yyforyongyu added 12 commits July 12, 2022 20:51

migration30: add unit tests for iterator

2a06145

migration30+migtest: add unit tests for migration

5316fcd

migration30: add benchmark test

1832a93

multi: add the flag prune-revocation to perform the optional migration

87f58a2

migration30: validate migration results before deleting old buckets

b391503

This commit adds a new method `validateMigration` to be used prior to the deletion of the old buckets to be extraly cautious.

channeldb: skip dry run mode for optional migrations

ac6e1a8

docs: add release note for optional migration

d391514

yyforyongyu force-pushed the 5388-migration branch from 262bf43 to d391514 Compare July 12, 2022 13:05

yyforyongyu requested a review from Roasbeef July 12, 2022 13:06

Roasbeef approved these changes Aug 3, 2022

View reviewed changes

channeldb/migration29/migration.go Outdated Show resolved Hide resolved

bhandras approved these changes Aug 3, 2022

View reviewed changes

Roasbeef merged commit 115b041 into lightningnetwork:master Aug 5, 2022

Roasbeef mentioned this pull request Aug 5, 2022

channeldb: master tracking disk-space reduction issue #5388

Closed

4 tasks

yyforyongyu deleted the 5388-migration branch August 5, 2022 07:37

positiveblue mentioned this pull request Aug 5, 2022

channel.db massive (22GB) #6800

Closed

yyforyongyu mentioned this pull request Aug 24, 2023

[Epic] channeldb+payment: create native SQL schema for payment storage #7920

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channeldb: add optional migration to prune revocation logs #6469

channeldb: add optional migration to prune revocation logs #6469

yyforyongyu commented Apr 27, 2022 •

edited

Loading

yyforyongyu commented Apr 27, 2022 •

edited

Loading

yyforyongyu commented May 1, 2022

Roasbeef left a comment

Roasbeef May 3, 2022

yyforyongyu May 5, 2022

yyforyongyu commented May 5, 2022

TonyGiorgio commented Jun 3, 2022

Roasbeef Jul 1, 2022

yyforyongyu Jul 11, 2022

Roasbeef Jul 1, 2022

yyforyongyu Jul 12, 2022

Roasbeef Jul 1, 2022

yyforyongyu Jul 11, 2022

Roasbeef commented Jul 1, 2022

Roasbeef commented Jul 1, 2022 •

edited

Loading

yyforyongyu commented Jul 12, 2022 •

edited

Loading

Roasbeef commented Jul 13, 2022

lightninglabs-deploy commented Jul 27, 2022

Roasbeef left a comment

bhandras left a comment

bhandras Jul 4, 2022

bhandras Aug 3, 2022

C-Otto commented Aug 5, 2022 •

edited

Loading

rkfg commented Aug 5, 2022

C-Otto commented Aug 5, 2022 •

edited

Loading

rkfg commented Aug 5, 2022

C-Otto commented Aug 5, 2022

yyforyongyu commented Aug 8, 2022

channeldb: add optional migration to prune revocation logs #6469

channeldb: add optional migration to prune revocation logs #6469

Conversation

yyforyongyu commented Apr 27, 2022 • edited Loading

yyforyongyu commented Apr 27, 2022 • edited Loading

yyforyongyu commented May 1, 2022

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyforyongyu commented May 5, 2022

TonyGiorgio commented Jun 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Jul 1, 2022

Roasbeef commented Jul 1, 2022 • edited Loading

yyforyongyu commented Jul 12, 2022 • edited Loading

Roasbeef commented Jul 13, 2022

lightninglabs-deploy commented Jul 27, 2022

Roasbeef left a comment

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

C-Otto commented Aug 5, 2022 • edited Loading

rkfg commented Aug 5, 2022

C-Otto commented Aug 5, 2022 • edited Loading

rkfg commented Aug 5, 2022

C-Otto commented Aug 5, 2022

yyforyongyu commented Aug 8, 2022

yyforyongyu commented Apr 27, 2022 •

edited

Loading

yyforyongyu commented Apr 27, 2022 •

edited

Loading

Roasbeef commented Jul 1, 2022 •

edited

Loading

yyforyongyu commented Jul 12, 2022 •

edited

Loading

C-Otto commented Aug 5, 2022 •

edited

Loading

C-Otto commented Aug 5, 2022 •

edited

Loading