Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channeldb: add optional migration to prune revocation logs #6469

Merged
merged 13 commits into from
Aug 5, 2022

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Apr 27, 2022

This is a follow-up PR on #6347.

This PR introduces an optional migration that's fault-tolerant. Since the potential scale is quite large, to prevent the OOM issue we need to perform the update in different db transactions. We cannot safely assume the process won't be interrupted, thus we need to be able to pick up where we left off and continue the migration if it does happen.

This PR also introduces optional meta that's similar to meta but saves the optional meta info. We need to separate the management of the two because when we don't, after a future mandatory migration is applied, the user would have trouble applying the optional migration.

@yyforyongyu
Copy link
Member Author

yyforyongyu commented Apr 27, 2022

Here're some benchmark results, maybe there's something wrong with the setup but the migration is really slow.

Benchmark for 2000 logs

    2000     214401347 ns/op     7181999 B/op      92167 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   576.254s



Benchmark for 5000 logs

    5000     250081490 ns/op     9413385 B/op     125960 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   1549.976s

Most of the time is taken to derive keys and compute scripts to find the output indexes, and it's slow. A progress log is added to help with the process.

@yyforyongyu yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from 28c2169 to 31ff6b7 Compare April 28, 2022 13:33
@yyforyongyu yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from df3e677 to 334f458 Compare May 1, 2022 07:53
@yyforyongyu
Copy link
Member Author

Decided to put the script computation inside a goroutine, plus a bug fixed in iterator, the speed is much faster now.

Building test data for 1000 logs...
Finished building test data, migrating...
    1000        195708 ns/op       11526 B/op        178 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   0.626s

Building test data for 5000 logs...
Finished building test data, migrating...
    5000         82152 ns/op       11615 B/op        177 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   0.873s


Building test data for 1000000 logs...
Finished building test data, migrating...
 1000000         56819 ns/op       11370 B/op        175 allocs/op
PASS
ok      github.com/lightningnetwork/lnd/channeldb/migration26   71.224s

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a very cursory pass over to get a better grasp of the proposed approach and also how the new optional migration scaffolding has been added. Great job on the set of test cases! They're very comprehensive, and I also like that we started to benchmark the migrations as well to get a handle on how long it'll take a canned node to apply the migration. Ideally in the near future we never need to do such manual migrations again, but this feature is certainly urgently needed.

Will circle back with a deeper review, and also some test runs (the --dry-run-migration flag can be super useful here). Speaking of --dry-run-migration, as is it the migration in this PR isn't strictly compatible with that, as the PR will do smaller transactions to commit the progress within, vs the existing approach where one mega transaction is used for everything.

channeldb/migration26/iterator.go Outdated Show resolved Hide resolved

// Iterate the buckets. If we received an exit signal, return the
// locator.
err := iterateBuckets(openChanBucket, nil, cb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be passing in a specific node's bucket here rather than the entire open channel bucket? As we need to know the last location for each channel, not the last global key we migrated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's designed to locate the global key. Maybe I should've named the functions better, but findNextMigrateHeight is designed to find the next key within the scope of a channel, and locateNextUpdateNum is to find the global position.

Suppose we restart a previously interrupted migration, we need to,

-> iterate the node's buckets until,
    -> iterate the chainhash buckets until,
        -> iterate the channel buckets until,
            -> a channel's next revocation log height is located, which signals the exit.

In other words a restart means we need to find out which channel inside which node has unfinished jobs and continue the migration there.

channeldb/migration26/iterator.go Outdated Show resolved Hide resolved
channeldb/migration26/migration.go Outdated Show resolved Hide resolved
@yyforyongyu
Copy link
Member Author

Speaking of --dry-run-migration, as is it the migration in this PR isn't strictly compatible with that, as the PR will do smaller transactions to commit the progress within, vs the existing approach where one mega transaction is used for everything.

Yeah that's indeed an issue as the migration relies on what's already written on disk to decide its next move. Maybe we could delete the new buckets if we are in dry-run mode? While it complicates things more, I think we can leverage the reset callback inside kvdb.Update to wipe out data that's previously written.

@yyforyongyu yyforyongyu force-pushed the 5388-migration branch 2 times, most recently from 4605f4f to f74363d Compare May 6, 2022 08:40
@Roasbeef Roasbeef added this to the v0.15.1 milestone May 12, 2022
@yyforyongyu yyforyongyu force-pushed the 5388-migration branch 6 times, most recently from 3e58b0d to b3343d3 Compare May 20, 2022 03:31
@TonyGiorgio
Copy link
Contributor

Ran this PR on my 30GB channel.db that wouldn't compact down any further and it was able to compact it down to 129MB, giving a thumbs up, thanks 👍

@Roasbeef Roasbeef requested a review from bhandras June 28, 2022 22:22
cb := func(chanBucket kvdb.RwBucket, l *updateLocator) error {
locator = l
updateNum := findNextMigrateHeight(chanBucket)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log the height here, so we can use it to print some more progress during the migration. Otherwise, users may just see it sitting there for minutes and wonder what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more verbose logs and debug logs.

channeldb/migration29/migration.go Outdated Show resolved Hide resolved
channeldb/migration29/migration.go Outdated Show resolved Hide resolved
channeldb/migration29/iterator.go Outdated Show resolved Hide resolved
channeldb/migration29/migration.go Outdated Show resolved Hide resolved
channeldb/migration29/migration.go Outdated Show resolved Hide resolved
// migrations will be applied.
//
// NOTE: only support the prune_revocation_log optional migration atm.
func (d *DB) applyOptionalVersions(cfg OptionalMiragtionConfig) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd thread through that quit channel here to make the migration resumable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess to achieve this, we'll pass the ctx used in BuildDatabase through BuildDatabase -> CreateWithBackend -> applyOptionalVersions -> version.migration and catch the cancel signal there?


// With the commitment point generated, we can now derive the king ring
// which will be used to generate the output scripts.
keyRing := DeriveCommitmentKeys(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can perform this same check w/o needing to derive the commitment keys, then we should be able to speed up the migration significantly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. I think we need to derive the king ring here to compute the scripts so we can derive the output indexes?

@Roasbeef
Copy link
Member

Roasbeef commented Jul 1, 2022

So I tried to profile an active run to see what all the goroutines were up to, but I think the usage of internal closures is messing things up, rather than see all the stack traces, all I see instead is:

1090 @ 0x4710a1

@Roasbeef
Copy link
Member

Roasbeef commented Jul 1, 2022

Will do a deeper dive tomorrow, but something is up w/ the progress logging, and due to the issue I mentioned above, I'm not able to get a proper stack trace or profile to see what's up:

2022-06-30 23:10:45.215 [INF] CHDB: Performing database optional migration: prune revocation log
2022-06-30 23:10:45.216 [INF] CHDB: Migrating revocation logs, might take a while...
2022-06-30 23:10:45.224 [INF] CHDB: Total un-migrated logs: 59661
2022-06-30 23:11:21.543 [INF] CHDB: Migration progress: 33.523%, still have: 39661
2022-06-30 23:11:58.590 [INF] CHDB: Migration progress: 67.045%, still have: 19661

It's been sitting there for 30 mins or so now as well. FWIW, this is a pretty tiny box (1 vCPU), 2 GB of RAM.

It also picked back up from zero when it was restarted prior:

2022-06-30 22:16:44.410 [INF] CHDB: Performing database optional migration: prune revocation log
2022-06-30 22:16:44.411 [INF] CHDB: Migrating revocation logs, might take a while...
2022-06-30 22:16:44.419 [INF] CHDB: Total un-migrated logs: 59661
2022-06-30 22:17:19.624 [INF] CHDB: Migration progress: 33.523%, still have: 39661
2022-06-30 22:17:54.893 [INF] CHDB: Migration progress: 67.045%, still have: 19661
2022-06-30 23:10:17.330 [INF] LTND: Received interrupt
2022-06-30 23:10:17.332 [INF] LTND: Shutting down...
2022-06-30 23:10:17.333 [INF] LTND: Gracefully shutting down.

FWIW, this is a mega old node. I think it turned 5 earlier this year...

This commit adds relevant code from the revocation_log.go and the
package lnwallet. The code is needed to migrate the data, and we choose
to copy the code instead of importing to preserve the version such that
a future change won't affect current migration. An alternative would be
tagging each of the packages imported.
This commit adds several utility functions to assist the migration. In
particular, an updateLocator is added to gives us the next un-migration
position in our buckets. This locator helps us to continue the job
in case of an interrupted migration. It also serves as an indicator on
whether the migration is finished or not.
This commit adds supporting functions that will be used in the unit
test. The testing data are also added as hard-coded. We choose to copy
the most of the testing data from our itest results such that a) they
are "real" data that can be used to calculate scripts and b) we preserve
the result generated by the current code so a future change won't affect
our test.
This commit adds the migration that's used to convert the old revocation
logs into the new format. The migration is fault-tolerant, meaning the
process can be interrupted and the migration will pick up what's left
when running again. We also cap how many records to be processed in each
db transaction to prevent OOM.
This commit enables the db to run optional migrations that are specified
by config flags. To achieve this, an optional meta is introduced to
manage the optional migrations. We distinguish the two types of
migrations here so it's easier to manage them for the concern a future
migration can cause trouble for us to determine the db version if we
don't.
This commit changes how we locate the next migration height by including
the scenario where `lnd@v0.15.0` is active. In the new version, we will
see a mixed of new and old logs under the same open channel bucket.
Hence, we need to alter how we locate the next un-migrated height.
This commit adds a new method `validateMigration` to be used prior to
the deletion of the old buckets to be extraly cautious.
@yyforyongyu
Copy link
Member Author

yyforyongyu commented Jul 12, 2022

It's been sitting there for 30 mins or so now as well. FWIW, this is a pretty tiny box (1 vCPU), 2 GB of RAM.

Here's what went wrong,

  • previously the migration had a defense check when it tried to locate the next migration height. When the height key found in the new bucket was greater, the bucket was considered corrupted and the migration would be applied again.
  • this would be fine prior to v0.15.0 as no new logs were expected, hence the height saved in the new buckets won't be greater than the old un-migrated heights.
  • with v0.15.0 things are different, new logs are saved to the new buckets already, and their heights must be greater.
  • The above case will cause an infinite loop where 1) new heights are greater -> 2) redo the current migration.
  • This is now fixed by taking the above situation into account.

Also tested with the channel.db file you sent, it now can finish the whole thing. The dummy unit test script,

func TestFoo(t *testing.T) {
	dbPath := "" // point this to the channel.db file.
	db, err := kvdb.Open(
		kvdb.BoltBackendName, dbPath, true, kvdb.DefaultDBTimeout,
	)
	defer db.Close()
	require.NoError(t, err)

	backend := btclog.NewBackend(&build.LogWriter{})
	logger := backend.Logger("test")
	logger.SetLevel(btclog.LevelDebug)
	UseLogger(logger)

	err = MigrateRevocationLog(db)
	require.NoError(t, err)
}

@Roasbeef
Copy link
Member

@yyforyongyu great work! Confirmed that I was able to successfully rerun the migration locally. This time around the entire thing was done in less than a minute or so. Now that this is fixed, I'll circle back again re review+testing for this PR.

@lightninglabs-deploy
Copy link

@bhandras: review reminder
@Roasbeef: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🧜‍♀️

The latest diff fixed the issue on my testnet node, and I've run this on one of my mainnet nodes successfully!

channeldb/migration29/migration.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 Some parts of the code are not known by me, so tried to do a more code oriented review, however there's not much to add, beautiful work 🥇

// We use the remote's channel config for the csv delay.
theirDelay := uint32(chanState.RemoteChanCfg.CsvDelay)

// If we are the initiator of this channel, then it's be false from the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's => it'll

}

return nil
}, func() {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we don't know of people running nodes on etcd if they do, it's safer if we reset any external state properly, like in this current block we update the om.Versions map which is external to the closure.

@Roasbeef Roasbeef merged commit 115b041 into lightningnetwork:master Aug 5, 2022
@yyforyongyu yyforyongyu deleted the 5388-migration branch August 5, 2022 07:37
@C-Otto
Copy link
Contributor

C-Otto commented Aug 5, 2022

For those who are curious: I was able to run the migration (via the TestFoo script above) on a copy of my channel.db (16 GByte, c-otto.de). It took 249 seconds on my Intel i5 12600k for around 3.7 million logs.

@rkfg
Copy link

rkfg commented Aug 5, 2022

Out of curiosity, what was the size of the database after migration? My db is only 3.8 Gb so probably compacting it isn't an emergency, I'll wait for more reports.

@C-Otto
Copy link
Contributor

C-Otto commented Aug 5, 2022

[INF] CHDB: DB compaction of /.../channel.db successful, 16960901120 -> 4039794688 bytes (gain=4.20x)

15.8 GByte down to 3.8 GByte

@rkfg
Copy link

rkfg commented Aug 5, 2022

Nice, though not as drastic as in #6469 (comment), I wonder why... has something changed since that? Maybe the final version of the PR prunes less data?

@C-Otto
Copy link
Contributor

C-Otto commented Aug 5, 2022

My database didn't include a lot of other cruft (that a regular compaction would remove). Also, I've been using lnd 0.15 for a while now, so several newer channels don't benefit from this patch.

@yyforyongyu
Copy link
Member Author

I wonder why... has something changed since that? Maybe the final version of the PR prunes less data?

Nope it's the same. You may not see such a gain because you probably have lots of payments and invoices, which can take up some disk space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants