Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on go-ethereum 1.8.7 #16728

Closed
vogelito opened this issue May 11, 2018 · 40 comments
Closed

Memory leak on go-ethereum 1.8.7 #16728

vogelito opened this issue May 11, 2018 · 40 comments

Comments

@vogelito
Copy link

vogelito commented May 11, 2018

System information

Geth version:

Geth
Version: 1.8.7-stable
Git Commit: 66432f3821badf24d526f2d9205f36c0543219de
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

OS & Version: Linux

Expected behaviour

Memory usage of go-ethereum should be stable.

Actual behaviour

Something happens which causes memory usage to increase substantially.

Steps to reproduce the behaviour

Unknown

pprof top

This wasn't hugely useful to us, but hoping it means something to the folks in here. Happy to provide additional information. This is currently happening ~1/day

# go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.geth.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: geth
Build ID: fb815f00e769fd84f7336a86cf929a66ba8ab66f
Type: inuse_space
Time: May 9, 2018 at 9:40pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5332.62MB, 84.39% of 6318.78MB total
Dropped 349 nodes (cum <= 31.59MB)
Showing top 10 nodes out of 145
      flat  flat%   sum%        cum   cum%
 1186.93MB 18.78% 18.78%  1186.93MB 18.78%  runtime.malg
  861.97MB 13.64% 32.43%   861.97MB 13.64%  github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/util.(*BufferPool).Get
  708.04MB 11.21% 43.63%   708.04MB 11.21%  math/big.nat.setBytes
  529.06MB  8.37% 52.00%   529.06MB  8.37%  reflect.unsafe_New
  509.52MB  8.06% 60.07%  1251.55MB 19.81%  github.com/ethereum/go-ethereum/rlp.decodeBigInt
  453.14MB  7.17% 67.24%   464.65MB  7.35%  github.com/ethereum/go-ethereum/trie.(*hasher).store
     388MB  6.14% 73.38%      388MB  6.14%  github.com/ethereum/go-ethereum/vendor/github.com/syndtr/goleveldb/leveldb/memdb.New
  264.43MB  4.18% 77.56%   264.43MB  4.18%  github.com/ethereum/go-ethereum/vendor/gopkg.in/fatih/set%2ev0.(*Set).Add
  228.51MB  3.62% 81.18%   482.26MB  7.63%  github.com/ethereum/go-ethereum/eth.(*peer).SendTransactions
  203.02MB  3.21% 84.39%   203.02MB  3.21%  github.com/ethereum/go-ethereum/event.(*Feed).Send

Metrics

Last month:
screen shot 2018-05-10 at 7 17 03 pm

Last 2 months:
screen shot 2018-05-10 at 7 19 15 pm

Additional info

For reference, this host was migrated to run:
v1.8.2 on March 5
v1.8.3 on March 27
v1.8.6 on Apr 26
v1.8.6 with PR #16576 on May 1
v1.8.7 on May 4

We run our node with the following cmd line arguments --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25

On May 1st we added --metrics --pprof --pprofaddr=127.0.0.1 and started tracking various metrics. We thought this was the source of the problem so earlier today we restarted the node without the additional flags, but unfortunately the memory leak happened again.

This is the memory breakdown the last 48 hours:
screen shot 2018-05-10 at 7 25 28 pm

And since upgrading to 1.8.7:
screen shot 2018-05-10 at 7 26 39 pm

@vogelito
Copy link
Author

Note: geth is basically the only thing running on this box and we have made sure the memory increases in the box are all geth. The sawtooth shape in the memory graphs are geth restarts.

@corverroos
Copy link

corverroos commented May 11, 2018

We are experiencing a similar issue:

Geth started in docker container on Ubuntu with:

docker run \
    --memory=8g \
    -v /mnt/ethereum:/root/.ethereum \
    -p 8545:8545 \
    -p 30303:30303 \
    -p 6060:6060 \
    -d \
    ethereum/client-go:v1.8.7 \
    --pprof --pprofaddr=0.0.0.0 --rpc --rpcapi=eth,personal,txpool,debug,admin,net --rpcaddr=0.0.0.0 --cache=1024 --datadir=/root/.ethereum --gpopercentile=50 --syncmode=full --gcmode=archive --rpcvhosts=*

We introduced the 8GB memory limit as work around for docker to auto restart the container.

Output of memStats (time in UTC+2)
screen shot on may 11th at 10_23 am

Output of /debug/pprof/heap at 2018-05-10 17:00 UTC+2:
geth_heap_20180510_1515

Output of /debug/pprof/heap at 2018-05-11 02:00 UTC+2:
geth_heap_20180511_0005

@vogelito
Copy link
Author

We just downgraded to v1.8.6 with PR #16576. Will keep you posted

@mykolasmith
Copy link

+1, we are also experiencing this issue.

@vogelito
Copy link
Author

Confirmed that 1.8.6 with PR #16576 also sees this issue. We're downgrading to 1.8.3 which is the last stable version we ran.

@ryanschneider suspects that the memory issues were always there (at least in 1.8.x) but have been exacerbated by the recent uptick in txs.

@mykolasmith
Copy link

@vogelito Just wondering, are you using filter queries? (experimental feature)

@vogelito
Copy link
Author

@mykolasmith - no we are most definitely not.

@vogelito
Copy link
Author

This is the node's memory consumption since Friday morning.

screen shot 2018-05-12 at 10 59 04 pm

1.8.3 seems to be doing better, but if the issue is indeed caused by an uptick in txs, then it's possible things only look like this because it's the weekend.

@vogelito
Copy link
Author

Update on memory usage. 1.8.3 still doing ok:

image

@vogelito
Copy link
Author

@karalabe - I saw some memory-management related improvements on the latest v1.8.8 release
Do you think there's a chance they address the memory-related problems we're seeing here?

@karalabe
Copy link
Member

Unfortunately not.

@karalabe
Copy link
Member

@vogelito Out of curiosity. Could you try reverting #16494 locally for yourself from master and seeing if that fixes it. I don't see how it would, just playing guessing game as to what we've changed in 1.8.4 that might cause this.

@vogelito
Copy link
Author

@karalabe - happy to. However, I was planning on waiting a bit more to see if the problem is indeed gone in 1.8.3

We ran 1.8.6 without problems for a while until there was an uptick in network transactions.

@karalabe
Copy link
Member

Another interesting experiment might be to run 1.8.7 or 1.8.8 with --pprof, and when you see the memory spiking, load http://localhost:6060/memsize and generate a report (Scan "node"). This will use some experimental memory stats gathering written by @fjl to try and report how much memory different parts of our code use. That might help pinpoint what part is leaky.

@karalabe
Copy link
Member

Do note however that 'scanning all objects and measuring their memory use' will probably hit the node hard for 30+ seconds. So, take care with doing this on a production system.

@vogelito
Copy link
Author

vogelito commented May 14, 2018

Happy to do whichever one you find more useful, just let me know. (and thanks for the heads up!)

@vogelito
Copy link
Author

I can confirm that this is still an issue in 1.8.3. We just started seeing the leak happening on 1.8.3 today.

screen shot 2018-05-17 at 12 41 15 pm

We are restarting the node on 1.8.8 and will run with --pprof

@vogelito
Copy link
Author

vogelito commented May 17, 2018

our node (now running 1.8.8) is not exhibiting the parabolic memory usage behavior, yet. However, memory usage is already quite high (~14GB) and this is the report.

screen shot 2018-05-17 at 3 09 27 pm

Edit: for some reason this message wasn't posted when I first intended to, so please check the date timestamp on the image.

@vogelito
Copy link
Author

The node is now at 18.69 and this is what the report shows now:
screen shot 2018-05-17 at 6 59 27 pm

@vogelito
Copy link
Author

last 24 hours running 1.8.8.

image

Process killed by the OS:

# dmesg | grep oom
[14196605.693635] geth invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[14196605.693664]  [<ffffffff81192e72>] oom_kill_process+0x202/0x3c0
[14196605.693750] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

@karalabe
Copy link
Member

karalabe commented May 18, 2018

We're running a few tests ourselves too now. A PR from Gary #16720 could help with the tons of goroutines by making less of them and also fixes some pending block generation issues that made tx posting slow. I can't give you meaningful numbers since the syncs just finished now, but lets see how the two behave in comparison after a day or so.

After this is in, I also have an update lined up that removes a blockage from block and transaction propagation into the network. That should avoid any goroutine leak caused by slow remote peers. Will set up a benchmark for it on Monday (want to run the above PR until then).

@vogelito
Copy link
Author

Thanks. I will try to provide you with a memory report of the node later today. If there's anything else I can do to help you track this problem, please let me know.

@mykolasmith
Copy link

@karalabe We're going to try this out as well. Will report back results in a few hours!

@vogelito
Copy link
Author

Current situation and memory report

screen shot 2018-05-18 at 3 16 42 pm

screen shot 2018-05-18 at 3 20 22 pm

@vogelito
Copy link
Author

Update. Node exhausted memory again.

Memory scan
screen shot 2018-05-19 at 5 06 32 am

Last month
screen shot 2018-05-19 at 5 03 59 am

Last week

screen shot 2018-05-19 at 5 04 15 am

Last 2 days
screen shot 2018-05-19 at 5 04 41 am

Last 24 hours
screen shot 2018-05-19 at 5 05 14 am

@mykolasmith
Copy link

screen shot 2018-05-19 at 12 22 21 pm

Still seeing memory leak as of #16720.

@vogelito
Copy link
Author

Per @holiman's post in #16673, I'm attaching the output of http://localhost:6060/debug/pprof/goroutine?debug=1

This was with 1.8.8 at over 24GB. The node had 2.3M goroutines on the event feed (99.9934% of all goroutines).

goroutine profile: total 2364583
2364427 @ 0x44272a 0x4427de 0x419c42 0x4198fb 0x6c1fec 0x4706e1
#	0x6c1feb	github.com/ethereum/go-ethereum/event.(*Feed).Send+0xcb	/home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

17 @ 0x44272a 0x4427de 0x453344 0x45305d 0x4871b8 0x4880bd 0x886b44 0x8868ce 0xa8f821 0xa8f0a6 0xa9d90b 0x985ce6 0x4706e1
#	0x45305c	sync.runtime_SemacquireMutex+0x3c					/usr/local/go/src/runtime/sema.go:71
#	0x4871b7	sync.(*Mutex).Lock+0x107						/usr/local/go/src/sync/mutex.go:134
#	0x4880bc	sync.(*RWMutex).Lock+0x2c						/usr/local/go/src/sync/rwmutex.go:93
#	0x886b43	github.com/ethereum/go-ethereum/core.(*TxPool).addTxs+0x53		/home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:801

The full output file here

@karalabe
Copy link
Member

Master currently contains some optimizations that should help avoid one potential bottleneck from the transaction feed. Trying that might be a good data point for us.

On top of master, I have a PR #16769 which removes another potential congestion points from the transaction feed. Trying out this might also be a valuable data point.

No promise that these will fix the issue, but is should most definitely fix some issue that could lead to this scenario in insanely busy nodes.


Quick question. Could you detail a bit your setup? How many peers do you run with? What is your API usage? What kind of requests are you serving? Anything else that might be helpful to find congested pathways?

@vogelito
Copy link
Author

I see that #16769 has been merged into master. I'll try a new build off of master and report back.

Regarding our setup, here are the details:

  • startup args: --rpc --rpcapi "debug,personal,eth,web3" --targetgaslimit 1000000 --cache 2048 --gpopercentile 30 --maxpeers 25 --metrics --pprof --pprofaddr=127.0.0.1
  • We heavily use the JSONRPC API with the following calls:
    • eth_blockNumber: Continuously
    • eth_getBlockByNumber: Once a new block we haven't seen is mined
    • eth_getTransactionReceipt: Low volume (on transactions we care about)
    • eth_getTransactionByHash: Low volume (on transactions we care about)
    • eth_getBalance: twice / minute
    • eth_gasPrice: varies but constantly
    • personal_unlockAccount: varies but constantly
    • eth_sendTransaction: varies but constantly
    • personal_newAccount: Low volume

I hope this is useful. We're restarting our node now with a build from master and will report back.

@vogelito
Copy link
Author

vogelito commented May 21, 2018

For reference, this is what we're running now

# /home/ubuntu/geth/current/geth version
Geth
Version: 1.8.9-unstable
Git Commit: 415969f534d8122e717e18abafdc8ec7bb913a84
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

@crypt1d
Copy link

crypt1d commented May 21, 2018

I've been having this issue on both 1.8.1 and 1.8.8, but it only started appearing recently (last couple of weeks I think) and has gotten to a point where I have to restart geth on a daily basis to prevent it from crashing unexpectedly. FWIW I dont think 1.8.x should be considered 'stable' until this is addressed.

@karalabe
Copy link
Member

@crypt1d I don't think it's 1.8.x being unstable, rather the network usage pattern is changing and surfacing a dormant bug.

@vogelito
Copy link
Author

Things are looking better but perhaps still too early to tell.

Reminder:

  • We downgraded to 1.8.3 on Friday May 11th and it ran "ok(ish)" until Thursday May 17th.
  • We upgraded to 1.8.8 on Thursday May 17th and we had to restart our node daily until this morning (May 21st) when we upgraded to master on 415969f

Memory graph of last week:
screen shot 2018-05-21 at 3 30 42 pm

@vogelito
Copy link
Author

Daily update: Memory still looking ok here.

@mykolasmith
Copy link

screen shot 2018-05-22 at 12 15 50 pm

Memory looking OK for us as well.

@vogelito
Copy link
Author

Daily update: Memory still looking ok here.

@mykolasmith
Copy link

Update: memory seems to have been stable as of #16769, however we experienced several spikes in memory usage prior to 18:00EST yesterday, after which all peers began to timeout and the node never recovered.

screen shot 2018-05-23 at 2 55 20 pm

screen shot 2018-05-23 at 2 55 14 pm

May 23 15:51:21 geth[22809]: WARN [05-23|15:51:21] Synchronisation failed, dropping peer    peer=90bccbf1fccc80ae err=timeout
May 23 16:49:58 geth[22809]: WARN [05-23|16:49:58] Synchronisation failed, dropping peer    peer=6011df96954d3708 err=timeout

@vogelito
Copy link
Author

Things still looking OK for us memory-wise and no known other issues at the moment.

image

@vogelito
Copy link
Author

Closing this as latest code fixes it.

@karalabe
Copy link
Member

@mykolasmith We may have just found the hang issue you experienced: #16840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants