Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lnd v0.12.1-beta consumes 3-4 GB against 1-2 GB of lnd v0.11 #5066

Closed
LNBIG-COM opened this issue Mar 3, 2021 · 26 comments
Closed

lnd v0.12.1-beta consumes 3-4 GB against 1-2 GB of lnd v0.11 #5066

LNBIG-COM opened this issue Mar 3, 2021 · 26 comments

Comments

@LNBIG-COM
Copy link

LNBIG-COM commented Mar 3, 2021

I noticed the strange thing.

I upgraded 5 servers to LND v0.12.1-beta and all 5 servers have the lnd processes with 3-4 GB resident memory
Other servers with LND v0.11.* have 1-1.5 GB same resident memory

Is it OK for the same version?

@gkrizek
Copy link
Contributor

gkrizek commented Mar 3, 2021

I saw this issue and it made me want to check my node. 0.12.1 has not been kind to me either. CPU is actually running was more hot and I've got a memory leak. I'll do some profiles a little later. My upgrade was from 0.12.0 to 0.12.1
Screen Shot 2021-03-03 at 1 46 44 PM

Logs:
Tons of:

2021-03-03 19:35:33.374 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): fetching chan anns for 7 chans
2021-03-03 19:35:33.323 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): sending final chan range chunk, size=7620
2021-03-03 19:35:33.320 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): sending range chunk of size=7999
2021-03-03 19:35:33.317 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): sending range chunk of size=7997
2021-03-03 19:35:33.314 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): sending range chunk of size=8000
2021-03-03 19:35:33.312 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): sending range chunk of size=8000
2021-03-03 19:35:33.293 [INF] DISC: GossipSyncer(02a140fa83a6fec74c748b642950bb22ac28992352ab9f6a4695577270488d4bfa): filtering chan range: start_height=0, num_blocks=673029
2021-03-03 19:35:17.310 [INF] CRTR: Processed channels=0 updates=59 nodes=6 in last 59.999938893s
2021-03-03 19:35:13.172 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): fetching chan anns for 8 chans
2021-03-03 19:35:13.007 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): sending final chan range chunk, size=7620
2021-03-03 19:35:13.004 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): sending range chunk of size=7999
2021-03-03 19:35:13.002 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): sending range chunk of size=7997
2021-03-03 19:35:12.999 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): sending range chunk of size=8000
2021-03-03 19:35:12.996 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): sending range chunk of size=8000
2021-03-03 19:35:12.978 [INF] DISC: GossipSyncer(025134c77d8c8c5b16a31020bda9f081bc9547120c3e2846c05c73738afd93c4f9): filtering chan range: start_height=0, num_blocks=673029
2021-03-03 19:34:37.903 [INF] DISC: Broadcasting 108 new announcements in 11 sub batches

@gkrizek
Copy link
Contributor

gkrizek commented Mar 3, 2021

Here's some profiles and logs.

lnd-0121-profile.zip

@Roasbeef
Copy link
Member

Roasbeef commented Mar 3, 2021

@gkrizek have you tweaked any settings related to pinned syncers or the historical sync interval (so gossip sync related params)?

@gkrizek
Copy link
Contributor

gkrizek commented Mar 3, 2021

@Roasbeef I can say for sure when I get back to my computer but I don't remember ever adjusting gossip settings.

@gkrizek
Copy link
Contributor

gkrizek commented Mar 4, 2021

@Roasbeef I just confirmed I have no settings for gossip. Is there additional information I can give for investigation?

@gkrizek
Copy link
Contributor

gkrizek commented Mar 4, 2021

Adding some more info here. This node has about 400 channels. I have another node (running 0.12.1 as well) that has 5 channels and doesn't experience this problem. So I wonder if the problem gets worse with number of channels. I also tried to restart this node and it cleared out the memory obviously but CPU shot right back up and you can see memory slowly starting to climb.

Screen Shot 2021-03-03 at 11 25 24 PM

@gkrizek
Copy link
Contributor

gkrizek commented Mar 6, 2021

Anything I can do to assist in the debugging on this? I have another server that runs lots of small nodes as well that is running out of memory. You can see the upgrade to 0.12.1 on 03/01
Screen Shot 2021-03-05 at 10 42 40 PM

I enabled profiling on one of those small nodes but it will be a few days before it starts piling memory. It seems like there's a memory leak that's wide spread but the really high CPU usage I think is only on my 400+ channel node. I can confirm in the coming days, but here's a goroutine profile that looks troubling from the large node:

goroutine profile: total 8253
798 @ 0x47220 0x7591c 0x758e9 0x8c000 0xa56de0 0x79454
#	0x758e8		sync.runtime_notifyListWait+0xc8					/usr/local/go/src/runtime/sema.go:513
#	0x8bfff		sync.(*Cond).Wait+0xaf							/usr/local/go/src/sync/cond.go:56
#	0xa56ddf	github.com/lightningnetwork/lnd/peer.(*msgStream).msgConsumer+0x12f	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1058

525 @ 0x47220 0x57000 0x6d2458 0x714710 0x79454
#	0x6d2457	github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants+0x187		/go/src/github.com/getvoltage/lnd/routing/validation_barrier.go:204
#	0x71470f	github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler.func1+0x9f	/go/src/github.com/getvoltage/lnd/discovery/gossiper.go:1009

430 @ 0x47220 0x57000 0x6cf678 0x706eb8 0x714868 0x79454
#	0x6cf677	github.com/lightningnetwork/lnd/routing.(*ChannelRouter).AddEdge+0x117					/go/src/github.com/getvoltage/lnd/routing/router.go:2213
#	0x706eb7	github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement+0x4ef7	/go/src/github.com/getvoltage/lnd/discovery/gossiper.go:1694
#	0x714867	github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler.func1+0x1f7		/go/src/github.com/getvoltage/lnd/discovery/gossiper.go:1027

413 @ 0x47220 0x57000 0x485234 0x79454
#	0x485233	github.com/lightningnetwork/lnd/queue.(*ConcurrentQueue).start.func1+0x573	/go/src/github.com/getvoltage/lnd/queue/queue.go:68

404 @ 0x47220 0x40764 0x73d94 0xd4f08 0xd5e80 0xd5e65 0x1abbf4 0x1bcfd4 0xb3748 0x737cc4 0x737c8d 0x733174 0xa56774 0xa57b54 0x79454
#	0x73d93		internal/poll.runtime_pollWait+0x43					/usr/local/go/src/runtime/netpoll.go:222
#	0xd4f07		internal/poll.(*pollDesc).wait+0x37					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0xd5e7f		internal/poll.(*pollDesc).waitRead+0x15f				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0xd5e64		internal/poll.(*FD).Read+0x144						/usr/local/go/src/internal/poll/fd_unix.go:159
#	0x1abbf3	net.(*netFD).Read+0x43							/usr/local/go/src/net/fd_posix.go:55
#	0x1bcfd3	net.(*conn).Read+0x73							/usr/local/go/src/net/net.go:182
#	0xb3747		io.ReadAtLeast+0x77							/usr/local/go/src/io/io.go:314
#	0x737cc3	io.ReadFull+0x53							/usr/local/go/src/io/io.go:333
#	0x737c8c	github.com/lightningnetwork/lnd/brontide.(*Machine).ReadHeader+0x1c	/go/src/github.com/getvoltage/lnd/brontide/noise.go:851
#	0x733173	github.com/lightningnetwork/lnd/brontide.(*Conn).ReadNextHeader+0x63	/go/src/github.com/getvoltage/lnd/brontide/conn.go:127
#	0xa56773	github.com/lightningnetwork/lnd/peer.(*Brontide).readNextMessage+0x63	/go/src/github.com/getvoltage/lnd/peer/brontide.go:924
#	0xa57b53	github.com/lightningnetwork/lnd/peer.(*Brontide).readHandler+0x123	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1280

routing-lnd.zip

@Roasbeef
Copy link
Member

@gkrizek this should help things, we realized we accidentally reverted some of the gossip throttling while improving the algo: #5068

@gkrizek
Copy link
Contributor

gkrizek commented Mar 10, 2021

Sweet! I'll try it out tonight.

@gkrizek
Copy link
Contributor

gkrizek commented Mar 10, 2021

@Roasbeef I cherry-picked #5068 onto v0.12.1 and it seems to be working. CPU has dropped way down. I'll watch if the memory stays low. So far so good.

@gkrizek
Copy link
Contributor

gkrizek commented Mar 13, 2021

Been running this for a few days. CPU looks way better but Memory is still an issue.

Screen Shot 2021-03-13 at 10 52 53 AM

I'm trying to debug what's using so much memory but I can't find the root cause. I think I'm at the limits of my Go memory debugging knowledge. How can we best debug this? Send you guys a heap snapshot?

@wpaulino
Copy link
Contributor

I'm trying to debug what's using so much memory but I can't find the root cause. I think I'm at the limits of my Go memory debugging knowledge. How can we best debug this? Send you guys a heap snapshot?

You can provide us with a heap profile the same way as the other profiles through http://localhost:PROFILE_PORT/debug/pprof/heap.

@gkrizek
Copy link
Contributor

gkrizek commented Mar 16, 2021

@wpaulino Here's a PNG. Let me know if alloc_space is helpful as well. This was also from a few days ago. LND reached ~5.5 GB before I had to restart it to prevent a crash.

inuse_space

@gkrizek
Copy link
Contributor

gkrizek commented Mar 16, 2021

Well, everything was running fine until I restarted the node because the memory was almost out. After a restart CPU is back to being really high again. No changes to the code (still running the patch for #5068).

Screen Shot 2021-03-16 at 12 30 19 PM

profile001


goroutine profile: total 8438
828 @ 0x47220 0x7591c 0x758e9 0x8c000 0xa56de0 0x79454
#	0x758e8		sync.runtime_notifyListWait+0xc8					/usr/local/go/src/runtime/sema.go:513
#	0x8bfff		sync.(*Cond).Wait+0xaf							/usr/local/go/src/sync/cond.go:56
#	0xa56ddf	github.com/lightningnetwork/lnd/peer.(*msgStream).msgConsumer+0x12f	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1058

567 @ 0x47220 0x57000 0x6cf678 0x706eb8 0x714868 0x79454
#	0x6cf677	github.com/lightningnetwork/lnd/routing.(*ChannelRouter).AddEdge+0x117					/go/src/github.com/getvoltage/lnd/routing/router.go:2213
#	0x706eb7	github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement+0x4ef7	/go/src/github.com/getvoltage/lnd/discovery/gossiper.go:1694
#	0x714867	github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler.func1+0x1f7		/go/src/github.com/getvoltage/lnd/discovery/gossiper.go:1027

420 @ 0x47220 0x40764 0x73d94 0xd4f08 0xd5e80 0xd5e65 0x1abbf4 0x1bcfd4 0xb3748 0x737cc4 0x737c8d 0x733174 0xa56774 0xa57b54 0x79454
#	0x73d93		internal/poll.runtime_pollWait+0x43					/usr/local/go/src/runtime/netpoll.go:222
#	0xd4f07		internal/poll.(*pollDesc).wait+0x37					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0xd5e7f		internal/poll.(*pollDesc).waitRead+0x15f				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0xd5e64		internal/poll.(*FD).Read+0x144						/usr/local/go/src/internal/poll/fd_unix.go:159
#	0x1abbf3	net.(*netFD).Read+0x43							/usr/local/go/src/net/fd_posix.go:55
#	0x1bcfd3	net.(*conn).Read+0x73							/usr/local/go/src/net/net.go:182
#	0xb3747		io.ReadAtLeast+0x77							/usr/local/go/src/io/io.go:314
#	0x737cc3	io.ReadFull+0x53							/usr/local/go/src/io/io.go:333
#	0x737c8c	github.com/lightningnetwork/lnd/brontide.(*Machine).ReadHeader+0x1c	/go/src/github.com/getvoltage/lnd/brontide/noise.go:851
#	0x733173	github.com/lightningnetwork/lnd/brontide.(*Conn).ReadNextHeader+0x63	/go/src/github.com/getvoltage/lnd/brontide/conn.go:127
#	0xa56773	github.com/lightningnetwork/lnd/peer.(*Brontide).readNextMessage+0x63	/go/src/github.com/getvoltage/lnd/peer/brontide.go:924
#	0xa57b53	github.com/lightningnetwork/lnd/peer.(*Brontide).readHandler+0x123	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1280

420 @ 0x47220 0x57000 0x485234 0x79454
#	0x485233	github.com/lightningnetwork/lnd/queue.(*ConcurrentQueue).start.func1+0x573	/go/src/github.com/getvoltage/lnd/queue/queue.go:68

420 @ 0x47220 0x57000 0x70f904 0x79454
#	0x70f903	github.com/lightningnetwork/lnd/discovery.(*GossipSyncer).replyHandler+0xb3	/go/src/github.com/getvoltage/lnd/discovery/syncer.go:648

420 @ 0x47220 0x57000 0xa5b770 0x79454
#	0xa5b76f	github.com/lightningnetwork/lnd/peer.(*Brontide).writeHandler+0x28f	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1805

420 @ 0x47220 0x57000 0xa5c138 0x79454
#	0xa5c137	github.com/lightningnetwork/lnd/peer.(*Brontide).queueHandler+0x737	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1936

420 @ 0x47220 0x57000 0xa5c670 0x79454
#	0xa5c66f	github.com/lightningnetwork/lnd/peer.(*Brontide).pingHandler+0x13f	/go/src/github.com/getvoltage/lnd/peer/brontide.go:1966

420 @ 0x47220 0x57000 0xa5ce28 0x79454
#	0xa5ce27	github.com/lightningnetwork/lnd/peer.(*Brontide).channelManager+0x137	/go/src/github.com/getvoltage/lnd/peer/brontide.go:2067

@wpaulino
Copy link
Contributor

@gkrizek could you provide the raw profile files instead for both heap and CPU?

@gkrizek
Copy link
Contributor

gkrizek commented Mar 17, 2021

@wpaulino probably stupid question, do I need to password protect the heap snapshot?

@wpaulino
Copy link
Contributor

@gkrizek I don't think so, but feel free to DM it over Slack just to be safe.

@Roasbeef
Copy link
Member

@gkrizek the number of goroutines there handing on AddEdge looks relatively high, can you confirm that you have the commits in this PR? #5043

@Roasbeef
Copy link
Member

The optimizations laid out in #5075 should address some of the heap usage you're seeing w.r.t keeping blocks around in memory as well.

@gkrizek
Copy link
Contributor

gkrizek commented Mar 18, 2021

@Roasbeef , #5043 is included in 0.12.1, right? If so, yes I'm including that. I'm running 0.12.1 with #5068 on top.

Also, is #5075 specific to btcd? If I used bitcoind would I avoid the heap growth problem?

@gkrizek
Copy link
Contributor

gkrizek commented Apr 3, 2021

Following up here. CPU seems fine but every node I have connecting to btcd seems to be leaking memory. (I don't have any connected to bitcoind so can only compare to neutrino). I've attached a heap snapshot of a node that's been growing.

Screen Shot 2021-04-03 at 7 47 32 AM

lnd.heap.zip

@wpaulino
Copy link
Contributor

wpaulino commented Apr 5, 2021

The profiles show that it's all due to network updates. #5138 will help avoid propagating a good portion of the graph and #5156 will cache blocks to help reduce memory pressure.

@Talkless
Copy link

Talkless commented Apr 24, 2021

Same issue on my Debian 10 buster arm machine:

image

I'm using bitcoind v0.21.0.

@Talkless
Copy link

Talkless commented Apr 24, 2021

My profile, at the moment when htop shows ~850MB usage:
profile

@wpaulino
Copy link
Contributor

@Talkless we have several fixes in the upcoming release that should address this and the high CPU usage also commonly seen.

@Roasbeef
Copy link
Member

Closing for now since 0.13.1 has been released, we've received reports that prior memory blow ups due to p2p activity have been remedied. 0.14 will have more optimizations that reduce burst GC activity as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants