Memory leak on go-ethereum 1.8.7 #16728

vogelito · 2018-05-11T00:27:43Z

System information

Geth version:

Geth
Version: 1.8.7-stable
Git Commit: 66432f3821badf24d526f2d9205f36c0543219de
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

OS & Version: Linux

Expected behaviour

Memory usage of go-ethereum should be stable.

Actual behaviour

Something happens which causes memory usage to increase substantially.

Steps to reproduce the behaviour

Unknown

pprof top

This wasn't hugely useful to us, but hoping it means something to the folks in here. Happy to provide additional information. This is currently happening ~1/day

# go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.geth.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: geth
Build ID: fb815f00e769fd84f7336a86cf929a66ba8ab66f
Type: inuse_space
Time: May 9, 2018 at 9:40pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5332.62MB, 84.39% of 6318.78MB total
Dropped 349 nodes (cum <= 31.59MB)
Showing top 10 nodes out of 145
      flat  flat%   sum%        cum   cum%
 1186.93MB 18.78% 18.78%  1186.93MB 18.78%  runtime.malg
  861.97MB 13.64% 32.43%   861.97MB 13.64%  github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/util.(*BufferPool).Get
  708.04MB 11.21% 43.63%   708.04MB 11.21%  math/big.nat.setBytes
  529.06MB  8.37% 52.00%   529.06MB  8.37%  reflect.unsafe_New
  509.52MB  8.06% 60.07%  1251.55MB 19.81%  github.com/ethereum/go-ethereum/rlp.decodeBigInt
  453.14MB  7.17% 67.24%   464.65MB  7.35%  github.com/ethereum/go-ethereum/trie.(*hasher).store
     388MB  6.14% 73.38%      388MB  6.14%  github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/memdb.New
  264.43MB  4.18% 77.56%   264.43MB  4.18%  github.com/ethereum/go-ethereum/vendor/gopkg.in/fatih/set%2ev0.(*Set).Add
  228.51MB  3.62% 81.18%   482.26MB  7.63%  github.com/ethereum/go-ethereum/eth.(*peer).SendTransactions
  203.02MB  3.21% 84.39%   203.02MB  3.21%  github.com/ethereum/go-ethereum/event.(*Feed).Send

Metrics

Last month:

Last 2 months:

Additional info

For reference, this host was migrated to run:
v1.8.2 on March 5
v1.8.3 on March 27
v1.8.6 on Apr 26
v1.8.6 with PR #16576 on May 1
v1.8.7 on May 4

We run our node with the following cmd line arguments --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25

On May 1st we added --metrics --pprof --pprofaddr=127.0.0.1 and started tracking various metrics. We thought this was the source of the problem so earlier today we restarted the node without the additional flags, but unfortunately the memory leak happened again.

This is the memory breakdown the last 48 hours:

And since upgrading to 1.8.7:

The text was updated successfully, but these errors were encountered:

vogelito · 2018-05-11T00:31:25Z

Note: geth is basically the only thing running on this box and we have made sure the memory increases in the box are all geth. The sawtooth shape in the memory graphs are geth restarts.

corverroos · 2018-05-11T08:33:39Z

We are experiencing a similar issue:

Geth started in docker container on Ubuntu with:

docker run \
    --memory=8g \
    -v /mnt/ethereum:/root/.ethereum \
    -p 8545:8545 \
    -p 30303:30303 \
    -p 6060:6060 \
    -d \
    ethereum/client-go:v1.8.7 \
    --pprof --pprofaddr=0.0.0.0 --rpc --rpcapi=eth,personal,txpool,debug,admin,net --rpcaddr=0.0.0.0 --cache=1024 --datadir=/root/.ethereum --gpopercentile=50 --syncmode=full --gcmode=archive --rpcvhosts=*

We introduced the 8GB memory limit as work around for docker to auto restart the container.

Output of memStats (time in UTC+2)

Output of /debug/pprof/heap at 2018-05-10 17:00 UTC+2:

Output of /debug/pprof/heap at 2018-05-11 02:00 UTC+2:

vogelito · 2018-05-11T16:56:40Z

We just downgraded to v1.8.6 with PR #16576. Will keep you posted

mykolasmith · 2018-05-11T20:44:52Z

+1, we are also experiencing this issue.

vogelito · 2018-05-11T21:31:05Z

Confirmed that 1.8.6 with PR #16576 also sees this issue. We're downgrading to 1.8.3 which is the last stable version we ran.

@ryanschneider suspects that the memory issues were always there (at least in 1.8.x) but have been exacerbated by the recent uptick in txs.

mykolasmith · 2018-05-11T22:41:37Z

@vogelito Just wondering, are you using filter queries? (experimental feature)

vogelito · 2018-05-12T00:09:04Z

@mykolasmith - no we are most definitely not.

vogelito · 2018-05-13T04:05:23Z

This is the node's memory consumption since Friday morning.

1.8.3 seems to be doing better, but if the issue is indeed caused by an uptick in txs, then it's possible things only look like this because it's the weekend.

vogelito · 2018-05-14T16:39:04Z

Update on memory usage. 1.8.3 still doing ok:

vogelito · 2018-05-14T16:42:14Z

@karalabe - I saw some memory-management related improvements on the latest v1.8.8 release
Do you think there's a chance they address the memory-related problems we're seeing here?

karalabe · 2018-05-14T20:41:07Z

Unfortunately not.

karalabe · 2018-05-14T20:44:01Z

@vogelito Out of curiosity. Could you try reverting #16494 locally for yourself from master and seeing if that fixes it. I don't see how it would, just playing guessing game as to what we've changed in 1.8.4 that might cause this.

vogelito · 2018-05-14T21:32:34Z

@karalabe - happy to. However, I was planning on waiting a bit more to see if the problem is indeed gone in 1.8.3

We ran 1.8.6 without problems for a while until there was an uptick in network transactions.

karalabe · 2018-05-14T21:47:23Z

Another interesting experiment might be to run 1.8.7 or 1.8.8 with --pprof, and when you see the memory spiking, load http://localhost:6060/memsize and generate a report (Scan "node"). This will use some experimental memory stats gathering written by @fjl to try and report how much memory different parts of our code use. That might help pinpoint what part is leaky.

karalabe · 2018-05-14T21:48:03Z

Do note however that 'scanning all objects and measuring their memory use' will probably hit the node hard for 30+ seconds. So, take care with doing this on a production system.

vogelito · 2018-05-14T21:58:52Z

Happy to do whichever one you find more useful, just let me know. (and thanks for the heads up!)

vogelito · 2018-05-17T17:42:18Z

I can confirm that this is still an issue in 1.8.3. We just started seeing the leak happening on 1.8.3 today.

We are restarting the node on 1.8.8 and will run with --pprof

vogelito · 2018-05-17T23:55:28Z

our node (now running 1.8.8) is not exhibiting the parabolic memory usage behavior, yet. However, memory usage is already quite high (~14GB) and this is the report.

Edit: for some reason this message wasn't posted when I first intended to, so please check the date timestamp on the image.

vogelito · 2018-05-18T00:01:49Z

The node is now at 18.69 and this is what the report shows now:

vogelito · 2018-05-18T15:57:20Z

last 24 hours running 1.8.8.

Process killed by the OS:

# dmesg | grep oom
[14196605.693635] geth invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[14196605.693664]  [<ffffffff81192e72>] oom_kill_process+0x202/0x3c0
[14196605.693750] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

karalabe · 2018-05-18T16:39:40Z

We're running a few tests ourselves too now. A PR from Gary #16720 could help with the tons of goroutines by making less of them and also fixes some pending block generation issues that made tx posting slow. I can't give you meaningful numbers since the syncs just finished now, but lets see how the two behave in comparison after a day or so.

After this is in, I also have an update lined up that removes a blockage from block and transaction propagation into the network. That should avoid any goroutine leak caused by slow remote peers. Will set up a benchmark for it on Monday (want to run the above PR until then).

vogelito · 2018-05-18T16:55:19Z

Thanks. I will try to provide you with a memory report of the node later today. If there's anything else I can do to help you track this problem, please let me know.

mykolasmith · 2018-05-18T17:48:45Z

@karalabe We're going to try this out as well. Will report back results in a few hours!

vogelito · 2018-05-18T21:22:23Z

Current situation and memory report

vogelito · 2018-05-19T11:12:18Z

Update. Node exhausted memory again.

Memory scan

Last month

Last week

Last 2 days

Last 24 hours

mykolasmith · 2018-05-19T16:23:17Z

Still seeing memory leak as of #16720.

vogelito · 2018-05-20T17:55:53Z

Per @holiman's post in #16673, I'm attaching the output of http://localhost:6060/debug/pprof/goroutine?debug=1

This was with 1.8.8 at over 24GB. The node had 2.3M goroutines on the event feed (99.9934% of all goroutines).

goroutine profile: total 2364583
2364427 @ 0x44272a 0x4427de 0x419c42 0x4198fb 0x6c1fec 0x4706e1
#	0x6c1feb	github.com/ethereum/go-ethereum/event.(*Feed).Send+0xcb	/home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

17 @ 0x44272a 0x4427de 0x453344 0x45305d 0x4871b8 0x4880bd 0x886b44 0x8868ce 0xa8f821 0xa8f0a6 0xa9d90b 0x985ce6 0x4706e1
#	0x45305c	sync.runtime_SemacquireMutex+0x3c					/usr/local/go/src/runtime/sema.go:71
#	0x4871b7	sync.(*Mutex).Lock+0x107						/usr/local/go/src/sync/mutex.go:134
#	0x4880bc	sync.(*RWMutex).Lock+0x2c						/usr/local/go/src/sync/rwmutex.go:93
#	0x886b43	github.com/ethereum/go-ethereum/core.(*TxPool).addTxs+0x53		/home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:801

The full output file here

karalabe · 2018-05-21T08:41:44Z

Master currently contains some optimizations that should help avoid one potential bottleneck from the transaction feed. Trying that might be a good data point for us.

On top of master, I have a PR #16769 which removes another potential congestion points from the transaction feed. Trying out this might also be a valuable data point.

No promise that these will fix the issue, but is should most definitely fix some issue that could lead to this scenario in insanely busy nodes.

Quick question. Could you detail a bit your setup? How many peers do you run with? What is your API usage? What kind of requests are you serving? Anything else that might be helpful to find congested pathways?

vogelito · 2018-05-21T13:23:34Z

I see that #16769 has been merged into master. I'll try a new build off of master and report back.

Regarding our setup, here are the details:

startup args: --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25 --metrics --pprof --pprofaddr=127.0.0.1
We heavily use the JSONRPC API with the following calls:
- eth_blockNumber: Continuously
- eth_getBlockByNumber: Once a new block we haven't seen is mined
- eth_getTransactionReceipt: Low volume (on transactions we care about)
- eth_getTransactionByHash: Low volume (on transactions we care about)
- eth_getBalance: twice / minute
- eth_gasPrice: varies but constantly
- personal_unlockAccount: varies but constantly
- eth_sendTransaction: varies but constantly
- personal_newAccount: Low volume

I hope this is useful. We're restarting our node now with a build from master and will report back.

vogelito · 2018-05-21T13:37:52Z

For reference, this is what we're running now

# /home/ubuntu/geth/current/geth version
Geth
Version: 1.8.9-unstable
Git Commit: 415969f534d8122e717e18abafdc8ec7bb913a84
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

crypt1d · 2018-05-21T13:56:05Z

I've been having this issue on both 1.8.1 and 1.8.8, but it only started appearing recently (last couple of weeks I think) and has gotten to a point where I have to restart geth on a daily basis to prevent it from crashing unexpectedly. FWIW I dont think 1.8.x should be considered 'stable' until this is addressed.

karalabe · 2018-05-21T19:56:17Z

@crypt1d I don't think it's 1.8.x being unstable, rather the network usage pattern is changing and surfacing a dormant bug.

vogelito · 2018-05-21T20:35:44Z

Things are looking better but perhaps still too early to tell.

Reminder:

We downgraded to 1.8.3 on Friday May 11th and it ran "ok(ish)" until Thursday May 17th.
We upgraded to 1.8.8 on Thursday May 17th and we had to restart our node daily until this morning (May 21st) when we upgraded to master on 415969f

Memory graph of last week:

vogelito · 2018-05-22T13:03:55Z

Daily update: Memory still looking ok here.

mykolasmith · 2018-05-22T16:16:20Z

Memory looking OK for us as well.

vogelito · 2018-05-23T12:13:50Z

Daily update: Memory still looking ok here.

mykolasmith · 2018-05-23T19:00:10Z

Update: memory seems to have been stable as of #16769, however we experienced several spikes in memory usage prior to 18:00EST yesterday, after which all peers began to timeout and the node never recovered.

May 23 15:51:21 geth[22809]: WARN [05-23|15:51:21] Synchronisation failed, dropping peer    peer=90bccbf1fccc80ae err=timeout
May 23 16:49:58 geth[22809]: WARN [05-23|16:49:58] Synchronisation failed, dropping peer    peer=6011df96954d3708 err=timeout

vogelito · 2018-05-24T13:41:42Z

Things still looking OK for us memory-wise and no known other issues at the moment.

vogelito · 2018-05-29T02:24:30Z

Closing this as latest code fixes it.

karalabe · 2018-05-30T07:33:07Z

@mykolasmith We may have just found the hang issue you experienced: #16840.

holiman mentioned this issue May 17, 2018

core: use channel for promoted transactions to improve order during high volume #16673

Closed

karalabe mentioned this issue May 23, 2018

Fatal: out of memory issue #15428

Closed

zzd1990421 mentioned this issue May 23, 2018

Geth 1.7.0: fatal error: runtime: out of memory #15157

Closed

vogelito closed this as completed May 29, 2018

mykolasmith mentioned this issue Jun 1, 2018

Memory Leak Still Present in TxPool as of 1.8.10 #16859

Closed

zulhfreelancer mentioned this issue Sep 11, 2018

Geth 1.8.15 - Memory Leak? #17646

Closed

Memory leak on go-ethereum 1.8.7 #16728

Memory leak on go-ethereum 1.8.7 #16728

Comments

vogelito commented May 11, 2018 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

pprof top

Metrics

Additional info

vogelito commented May 11, 2018

corverroos commented May 11, 2018 • edited Loading

vogelito commented May 11, 2018

mykolasmith commented May 11, 2018

vogelito commented May 11, 2018

mykolasmith commented May 11, 2018

vogelito commented May 12, 2018

vogelito commented May 13, 2018

vogelito commented May 14, 2018

vogelito commented May 14, 2018

karalabe commented May 14, 2018

karalabe commented May 14, 2018

vogelito commented May 14, 2018

karalabe commented May 14, 2018

karalabe commented May 14, 2018

vogelito commented May 14, 2018 • edited Loading

vogelito commented May 17, 2018

vogelito commented May 17, 2018 • edited Loading

vogelito commented May 18, 2018

vogelito commented May 18, 2018

karalabe commented May 18, 2018 • edited Loading

vogelito commented May 18, 2018

mykolasmith commented May 18, 2018

vogelito commented May 18, 2018

vogelito commented May 19, 2018

mykolasmith commented May 19, 2018

vogelito commented May 20, 2018

karalabe commented May 21, 2018

vogelito commented May 21, 2018

vogelito commented May 21, 2018 • edited Loading

crypt1d commented May 21, 2018

karalabe commented May 21, 2018

vogelito commented May 21, 2018

vogelito commented May 22, 2018

mykolasmith commented May 22, 2018

vogelito commented May 23, 2018

mykolasmith commented May 23, 2018

vogelito commented May 24, 2018

vogelito commented May 29, 2018

karalabe commented May 30, 2018

vogelito commented May 11, 2018 •

edited

Loading

corverroos commented May 11, 2018 •

edited

Loading

vogelito commented May 14, 2018 •

edited

Loading

vogelito commented May 17, 2018 •

edited

Loading

karalabe commented May 18, 2018 •

edited

Loading

vogelito commented May 21, 2018 •

edited

Loading