-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channeldb: add optional migration to prune revocation logs #6469
Conversation
Here're some benchmark results,
Most of the time is taken to derive keys and compute scripts to find the output indexes, and it's slow. A progress log is added to help with the process. |
28c2169
to
31ff6b7
Compare
df3e677
to
334f458
Compare
Decided to put the script computation inside a goroutine, plus a bug fixed in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed a very cursory pass over to get a better grasp of the proposed approach and also how the new optional migration scaffolding has been added. Great job on the set of test cases! They're very comprehensive, and I also like that we started to benchmark the migrations as well to get a handle on how long it'll take a canned node to apply the migration. Ideally in the near future we never need to do such manual migrations again, but this feature is certainly urgently needed.
Will circle back with a deeper review, and also some test runs (the --dry-run-migration
flag can be super useful here). Speaking of --dry-run-migration
, as is it the migration in this PR isn't strictly compatible with that, as the PR will do smaller transactions to commit the progress within, vs the existing approach where one mega transaction is used for everything.
channeldb/migration26/iterator.go
Outdated
|
||
// Iterate the buckets. If we received an exit signal, return the | ||
// locator. | ||
err := iterateBuckets(openChanBucket, nil, cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be passing in a specific node's bucket here rather than the entire open channel bucket? As we need to know the last location for each channel, not the last global key we migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's designed to locate the global key. Maybe I should've named the functions better, but findNextMigrateHeight
is designed to find the next key within the scope of a channel, and locateNextUpdateNum
is to find the global position.
Suppose we restart a previously interrupted migration, we need to,
-> iterate the node's buckets until,
-> iterate the chainhash buckets until,
-> iterate the channel buckets until,
-> a channel's next revocation log height is located, which signals the exit.
In other words a restart means we need to find out which channel inside which node has unfinished jobs and continue the migration there.
Yeah that's indeed an issue as the migration relies on what's already written on disk to decide its next move. Maybe we could delete the new buckets if we are in dry-run mode? While it complicates things more, I think we can leverage the reset callback inside |
4605f4f
to
f74363d
Compare
3e58b0d
to
b3343d3
Compare
Ran this PR on my 30GB channel.db that wouldn't compact down any further and it was able to compact it down to 129MB, giving a thumbs up, thanks 👍 |
channeldb/migration29/iterator.go
Outdated
cb := func(chanBucket kvdb.RwBucket, l *updateLocator) error { | ||
locator = l | ||
updateNum := findNextMigrateHeight(chanBucket) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should log the height here, so we can use it to print some more progress during the migration. Otherwise, users may just see it sitting there for minutes and wonder what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more verbose logs and debug logs.
// migrations will be applied. | ||
// | ||
// NOTE: only support the prune_revocation_log optional migration atm. | ||
func (d *DB) applyOptionalVersions(cfg OptionalMiragtionConfig) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we'd thread through that quit channel here to make the migration resumable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess to achieve this, we'll pass the ctx
used in BuildDatabase
through BuildDatabase -> CreateWithBackend -> applyOptionalVersions -> version.migration
and catch the cancel signal there?
channeldb/migration29/lnwallet.go
Outdated
|
||
// With the commitment point generated, we can now derive the king ring | ||
// which will be used to generate the output scripts. | ||
keyRing := DeriveCommitmentKeys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can perform this same check w/o needing to derive the commitment keys, then we should be able to speed up the migration significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. I think we need to derive the king ring here to compute the scripts so we can derive the output indexes?
So I tried to profile an active run to see what all the goroutines were up to, but I think the usage of internal closures is messing things up, rather than see all the stack traces, all I see instead is:
|
Will do a deeper dive tomorrow, but something is up w/ the progress logging, and due to the issue I mentioned above, I'm not able to get a proper stack trace or profile to see what's up:
It's been sitting there for 30 mins or so now as well. FWIW, this is a pretty tiny box (1 vCPU), 2 GB of RAM. It also picked back up from zero when it was restarted prior:
FWIW, this is a mega old node. I think it turned 5 earlier this year... |
This commit adds relevant code from the revocation_log.go and the package lnwallet. The code is needed to migrate the data, and we choose to copy the code instead of importing to preserve the version such that a future change won't affect current migration. An alternative would be tagging each of the packages imported.
b3343d3
to
262bf43
Compare
This commit adds several utility functions to assist the migration. In particular, an updateLocator is added to gives us the next un-migration position in our buckets. This locator helps us to continue the job in case of an interrupted migration. It also serves as an indicator on whether the migration is finished or not.
This commit adds supporting functions that will be used in the unit test. The testing data are also added as hard-coded. We choose to copy the most of the testing data from our itest results such that a) they are "real" data that can be used to calculate scripts and b) we preserve the result generated by the current code so a future change won't affect our test.
This commit adds the migration that's used to convert the old revocation logs into the new format. The migration is fault-tolerant, meaning the process can be interrupted and the migration will pick up what's left when running again. We also cap how many records to be processed in each db transaction to prevent OOM.
This commit enables the db to run optional migrations that are specified by config flags. To achieve this, an optional meta is introduced to manage the optional migrations. We distinguish the two types of migrations here so it's easier to manage them for the concern a future migration can cause trouble for us to determine the db version if we don't.
This commit changes how we locate the next migration height by including the scenario where `lnd@v0.15.0` is active. In the new version, we will see a mixed of new and old logs under the same open channel bucket. Hence, we need to alter how we locate the next un-migrated height.
This commit adds a new method `validateMigration` to be used prior to the deletion of the old buckets to be extraly cautious.
Here's what went wrong,
Also tested with the func TestFoo(t *testing.T) {
dbPath := "" // point this to the channel.db file.
db, err := kvdb.Open(
kvdb.BoltBackendName, dbPath, true, kvdb.DefaultDBTimeout,
)
defer db.Close()
require.NoError(t, err)
backend := btclog.NewBackend(&build.LogWriter{})
logger := backend.Logger("test")
logger.SetLevel(btclog.LevelDebug)
UseLogger(logger)
err = MigrateRevocationLog(db)
require.NoError(t, err)
} |
262bf43
to
d391514
Compare
@yyforyongyu great work! Confirmed that I was able to successfully rerun the migration locally. This time around the entire thing was done in less than a minute or so. Now that this is fixed, I'll circle back again re review+testing for this PR. |
@bhandras: review reminder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🧜♀️
The latest diff fixed the issue on my testnet node, and I've run this on one of my mainnet nodes successfully!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉 Some parts of the code are not known by me, so tried to do a more code oriented review, however there's not much to add, beautiful work 🥇
channeldb/migration29/lnwallet.go
Outdated
// We use the remote's channel config for the csv delay. | ||
theirDelay := uint32(chanState.RemoteChanCfg.CsvDelay) | ||
|
||
// If we are the initiator of this channel, then it's be false from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's => it'll
} | ||
|
||
return nil | ||
}, func() {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we don't know of people running nodes on etcd if they do, it's safer if we reset any external state properly, like in this current block we update the om.Versions
map which is external to the closure.
For those who are curious: I was able to run the migration (via the |
Out of curiosity, what was the size of the database after migration? My db is only 3.8 Gb so probably compacting it isn't an emergency, I'll wait for more reports. |
15.8 GByte down to 3.8 GByte |
Nice, though not as drastic as in #6469 (comment), I wonder why... has something changed since that? Maybe the final version of the PR prunes less data? |
My database didn't include a lot of other cruft (that a regular compaction would remove). Also, I've been using lnd 0.15 for a while now, so several newer channels don't benefit from this patch. |
Nope it's the same. You may not see such a gain because you probably have lots of payments and invoices, which can take up some disk space. |
This is a follow-up PR on #6347.
This PR introduces an optional migration that's fault-tolerant. Since the potential scale is quite large, to prevent the OOM issue we need to perform the update in different db transactions. We cannot safely assume the process won't be interrupted, thus we need to be able to pick up where we left off and continue the migration if it does happen.
This PR also introduces
optional meta
that's similar tometa
but saves the optional meta info. We need to separate the management of the two because when we don't, after a future mandatory migration is applied, the user would have trouble applying the optional migration.