Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geth --fast stalls before crossing finish line #15001

Closed
Dirksterson opened this issue Aug 18, 2017 · 152 comments
Closed

geth --fast stalls before crossing finish line #15001

Dirksterson opened this issue Aug 18, 2017 · 152 comments

Comments

@Dirksterson
Copy link

Dirksterson commented Aug 18, 2017

System information

Geth version: geth version 1.5.9-stable, Go1.7.4
OS & Version: OSX 10.12.6 MacMini 4GB RAM (latest MacMini doesn't support field RAM upgrade anymore) VDSL connection with an average of 20-40Mbit throughput. Ethereum Wallet 0.9.0
Commit hash : (if develop)

Expected behaviour

fast sync to current latest block followed by auto disabling

Actual behaviour

stalling from a few thousand blocks up to a few hundred to current latest block. Tries to catch up to latest block, but number of new blocks is greater than the speed of adding fast blocks. Never auto disables fast sync mode.

Steps to reproduce the behaviour

Removedb and geth --fast --cache=1024. 5 times on that machine over the last weeks.

Fast sync is already my workaround, starting a fresh fast sync from scratch. Before I was unsuccessful on that machine trying to sync with existing blockchain data instead. This was also a lost race of catching up to the latest block on that machine. This workaround was good until now.

Today even the workaround in fast sync mode (cache -1024) will not completely load the blockchain anymore. It catches up some hundred blocks to the latest block and stalls for hours. By the time it catches up a few hundred blocks, the latest block moved ahead again. The closer geth is getting to import to the latest block (at time of writing 4173161), the slower it gets. It does not catch up anymore. Tried 5 times now over the last weeks and giving up at around 4-5 days each.

Does the machine not meet todays minimum hardware requirement anymore or is this a major bug?

Backtrace

latest block 13 hours ago (!)

I0818 00:15:26.444933 core/blockchain.go:805] imported 148 receipts in 2.775s. #4169952 [e3f556fc… / 36f4d3c9…]

...

latest header chain 50 minutes ago

I0818 12:47:45.107445 core/headerchain.go:342] imported 1 headers in 4.954ms. #4173009 [350d1426… / 350d1426…]

...

currently only importing nothing but state entries

I0818 13:36:41.103101 eth/downloader/downloader.go:966] imported 172 state entries in 10.009s: processed 10010213, pending at least 129361
I0818 13:36:41.103131 eth/downloader/downloader.go:966] imported 384 state entries in 783.519ms: processed 10010597, pending at least 129361
I0818 13:36:41.103154 eth/downloader/downloader.go:966] imported 381 state entries in 6.963s: processed 10010978, pending at least 129361
I0818 13:36:41.103167 eth/downloader/downloader.go:966] imported 25 state entries in 87.654ms: processed 10011003, pending at least 129360
I0818 13:36:46.014244 eth/downloader/downloader.go:966] imported 384 state entries in 2.482s: processed 10011387, pending at least 127584
I0818 13:36:49.074483 eth/downloader/downloader.go:966] imported 381 state entries in 7.082s: processed 10011768, pending at least 127105
I0818 13:36:49.074553 eth/downloader/downloader.go:966] imported 384 state entries in 7.971s: processed 10012152, pending at least 127105
I0818 13:36:49.074574 eth/downloader/downloader.go:966] imported 384 state entries in 3.772s: processed 10012536, pending at least 127105
I0818 13:36:49.074603 eth/downloader/downloader.go:966] imported 162 state entries in 5.822s: processed 10012698, pending at least 127105
I0818 13:36:49.074622 eth/downloader/downloader.go:966] imported 25 state entries in 4.050s: processed 10012723, pending at least 127105
I0818 13:36:49.074639 eth/downloader/downloader.go:966] imported 381 state entries in 3.060s: processed 10013104, pending at least 127105
I0818 13:36:49.074742 eth/downloader/downloader.go:966] imported 85 state entries in 7.117s: processed 10013189, pending at least 127105
I0818 13:36:49.074765 eth/downloader/downloader.go:966] imported 375 state entries in 2.219s: processed 10013564, pending at least 127105
I0818 13:36:49.074782 eth/downloader/downloader.go:966] imported 87 state entries in 3.915s: processed 10013651, pending at least 127105
I0818 13:36:49.074795 eth/downloader/downloader.go:966] imported 23 state entries in 271.734ms: processed 10013674, pending at least 127104

@kevingentile
Copy link

I have been having a similar issue recently. Ubuntu 16.04. Stalling on the last ~100-200 blocks. Restarting the geth client has allowed for some of those missing blocks to be processed but it does not keep up with the highest block. The only fluctuation I see in eth.syncing is the number of knownStates and pulledStates.

@Shem-Tov
Copy link

I am having the exact same issue as Laughing Cabbage has described, also on Ubuntu 16.04, and also stuck on the last few hundred blocks.
I am running geth1.6.7, at the moment. I have also tried versions 1.7.0, 1.6.6 and 1.6.5, with the same issue. I have tried applying --fast, and have tried without it.
When I restart geth, it usually gets a few more blocks in, and starts "downloading" the chain structure from 0. Downloading is in parenthesis, because when checking the folder into which it should be downloading, the folder access date and time does not change, nor can I find any other folder to which it saves the chain structure to.
Leaving it overnight, will get a chain structure in the millions, but the blocks will still not sync.
Searching the web, I have seen this problem exists for many people, for a very long time, across every platform, and with every version of geth, and no one has come up with any kind of solution. And since I am at best an amateur programmer, I have given up with geth.
I will try parity.io now, hopefully they have allowed people with little and no programming skills to connect to ethereum, and if not, then my solution is to give up on ethereum all together. That will solve the headache this issue is starting to create :-)
I'll check back with geth when in reaches version 2.

@kevingentile
Copy link

If any current devs think they might have a lead as to where a good starting point might be for tracking this issue I'm happy to do some bug hunting, please let me know.

@tomtom87
Copy link

tomtom87 commented Aug 21, 2017

@Dirksterson @laughingcabbage I have exactly the same issues for past week and so do many of my colleagues.

After latest advice to run --fast --cache=1024 i now get the following:

WARN [08-21|11:48:26] Stalling state sync, dropping peer       peer=655c0278c317a012
WARN [08-21|11:48:26] Stalling state sync, dropping peer       peer=f26dce0aea871dc8
WARN [08-21|11:48:26] Stalling state sync, dropping peer       peer=0fb49536fda319d3
WARN [08-21|11:48:27] Stalling state sync, dropping peer       peer=ae8de9feee4df4e6
WARN [08-21|11:48:27] Stalling state sync, dropping peer       peer=e7a69c447cb83857
WARN [08-21|11:48:30] Stalling state sync, dropping peer       peer=8e8edc9627fedc6b
WARN [08-21|11:48:32] Stalling state sync, dropping peer       peer=606587b48a16fd10
WARN [08-21|11:48:32] Node data write error                    err="state node 638deb…cf0f09 failed with all peers (4 tries, 4 peers)"

@csillag
Copy link

csillag commented Aug 21, 2017

Same here. I am also on v1.6.7.

Current status, after running it for more than a week:

Downloading block 4,179,697 of 4,179,911,
Downloading chain structure 8,242,414 of 8,246,476

@csillag
Copy link

csillag commented Aug 21, 2017

Isn't this a duplicate of #14988 and also #14995?

@darksh1ne
Copy link

The similar issue here. On Aug, 16th I had almost fully synced blockchain, just 10-20 hours behind the current block. I then started geth as:

$ geth --syncmode=fast --cache=$(( 1024 + 512 ))

All the time geth is behind the current block. Currently (Aug, 21st) its state is:

> eth.syncing
{
  currentBlock: 4181084,
  highestBlock: 4182536,
  knownStates: 0,
  pulledStates: 0,
  startingBlock: 4179967
}

whereas etherscan.io shows 4185672 as the last block.

There are no errors in geth's output, it is in its normal state of slowly importing new segments and using HDD at speed 5-10 MB/s (both reading and writting). No high CPU usage.

INFO [08-21|14:27:00] Imported new chain segment               blocks=1 txs=60  mgas=6.645  elapsed=26.766s   mgasps=0.248  number=4181082 hash=036737…8ef0ce
INFO [08-21|14:27:16] Imported new chain segment               blocks=1 txs=77  mgas=1.748  elapsed=16.123s   mgasps=0.108  number=4181083 hash=d498b7…8c64a9
INFO [08-21|14:28:44] Imported new chain segment               blocks=1 txs=137 mgas=6.699  elapsed=1m28.060s mgasps=0.076  number=4181084 hash=b8153c…a3bcbf
INFO [08-21|14:30:44] Imported new chain segment               blocks=1 txs=62  mgas=6.691  elapsed=1m59.831s mgasps=0.056  number=4181085 hash=4e7b58…7f71d5

My geth is:

$ geth attach
Welcome to the Geth JavaScript console!

instance: Geth/v1.6.7-stable/linux-amd64/go1.8
coinbase: <hidden>
at block: 4166508 (Wed, 16 Aug 2017 23:59:48 EEST)
 datadir: <hidden>
 modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
$ geth version
Geth
Version: 1.6.7-stable
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.8
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go

@tomtom87
Copy link

same issue here, started around the same time. Looks like this is throughout everyone and affecting parity users also now

@gdassori
Copy link

Hello, Ubuntu 16.04 here and same issue: got stuck on the last ~2000 blocks.

@tomtom87
Copy link

tomtom87 commented Aug 29, 2017 via email

@MrHash
Copy link

MrHash commented Sep 1, 2017

Same problem. Can't sync last ~100 blocks on 1.6.7. Restarting gets close but lots of Stalling state sync, dropping peer messages. SSD and fibre connection.

@tomtom87
Copy link

tomtom87 commented Sep 1, 2017 via email

@wtfiwtz
Copy link

wtfiwtz commented Sep 2, 2017

Is this related? 0042f13 and #14460

@alfkors
Copy link

alfkors commented Sep 3, 2017

@wtfiwtz I don't really know enough about the whole process, but I would say yeah probably... for what it's worth...

@wtfiwtz
Copy link

wtfiwtz commented Sep 6, 2017

I was able to get it to successfully sync up, after switching from fast sync to normal sync and giving it a day or two to catch up on the last 35,000 or so blocks - using a 2012-era MacBook Pro with an SSD drive. It was necessary to be on the the latest block to be able to successfully submit a transaction with the Ethereum "Mist" wallet (or you get an error about insufficient gas).

Not sure if the light mode would make any difference, but I think you need to do it will a brand new wallet, not an existing blockchain download.

@wtfiwtz
Copy link

wtfiwtz commented Sep 23, 2017

Ok I had to restart the sync from the beginning and have hit this problem again...

This is what I have found... Blocks are getting discarded from peers because the chain height is incorrectly set to 0.

wtfiwtz@f34b775

INFO [09-23|23:16:30] Loaded most recent local full block      number=0       hash=d4e567…cb8fa3 td=17179869184
INFO [09-23|23:16:30] Loaded most recent local fast block      number=4304570 hash=657bf3…912f25 td=1006522706491316931004
INFO [09-23|23:21:06] Peer discarded announcement              peer=d9c3012a7a0dfb3f number=4304681 hash=7e9da0…8db154 distance=4304681
INFO [09-23|23:21:06] ** Block number                          num=4304681
INFO [09-23|23:21:06] ** Chain height                          num=0
WARN [09-23|23:21:06] Discarded propagated block, too far away peer=d9c3012a7a0dfb3f number=4304681 hash=7e9da0…8db154 distance=4304681
INFO [09-23|23:21:06] Peer discarded announcement              peer=a8aafc6f4437be4f number=4304681 hash=7e9da0…8db154 distance=4304681
INFO [09-23|23:21:06] Peer discarded announcement              peer=ea4587bcfb02c92d number=4304681 hash=7e9da0…8db154 distance=4304681
INFO [09-23|23:21:06] ** Block number                          num=4304681
INFO [09-23|23:21:06] ** Chain height                          num=0
WARN [09-23|23:21:06] Discarded propagated block, too far away peer=ea4587bcfb02c92d number=4304681 hash=7e9da0…8db154 distance=4304681

The height is retrieved from a callback function such as this:

	heighter := func() uint64 {
		return blockchain.CurrentBlock().NumberU64()
	}

So this is probably an issue with switching between fast and normal sync modes, where the chain height is assumed to be 0 when it should be equal to the fast chain height on initialization.

Is this an area you are familiar with @karalabe since you did the original fast sync implementation?

@wtfiwtz
Copy link

wtfiwtz commented Sep 23, 2017

If the peer's total diffficulty is much lower, does that mean they are only on the full sync mode and won't work with a fast sync peer?

INFO [09-24|07:31:28] ** Total difficulty                      ours="{neg:false abs:[13700445755005557100 54]}" theirs=17179869184
INFO [09-24|07:31:28] ** fast sync?                            peer=3000f1cf9e63ce38 enabled=1

Pretty much can't find any peers that are not with a significantly lower total difficulty!

This worries me because of the following comment:

// synchronise will select the peer and use it for synchronising. If an empty string is given
// it will use the best peer possible and synchronize if it's TD is higher than our own. If any of the
// checks fail an error will be returned. This method is synchronous
func (d *Downloader) synchronise(id string, hash common.Hash, td *big.Int, mode SyncMode) error {

@wtfiwtz
Copy link

wtfiwtz commented Sep 24, 2017

Ok I left it running this morning, and at some point, it flipped from fast to full mode when it received just 1 more chain segment or block receipts (it hadn't received any for at least 1.5 hours since I last restarted):

INFO [09-24|09:07:27] Peer discarded announcement              peer=b057fec043b525ed number=4305853 hash=37f333…4a74d9 distance=4305853
INFO [09-24|09:07:27] ** Total difficulty                      ours="{neg:false abs:[14195334218426315772 54]}" theirs=17179869184
INFO [09-24|09:07:27] ** fast sync?                            peer=b057fec043b525ed enabled=1
INFO [09-24|09:07:27] ** Block number                          num=4305854
INFO [09-24|09:07:27] ** Chain height                          num=0
WARN [09-24|09:07:27] Discarded propagated block, too far away peer=b057fec043b525ed number=4305854 hash=2e8a61…ae021f distance=4305854
INFO [09-24|09:07:27] Imported new state entries               count=448  elapsed=1.479ms   processed=2608239 pending=2047  retry=2   duplicate=2846 unexpected=8434
INFO [09-24|09:07:29] Imported new state entries               count=779  elapsed=3.995ms   processed=2609018 pending=2225  retry=22  duplicate=2846 unexpected=8434
INFO [09-24|09:07:29] ** Total difficulty                      ours="{neg:false abs:[14195334218426315772 54]}" theirs=17179869184
INFO [09-24|09:07:29] ** fast sync?                            peer=479032d8362da82d enabled=1
INFO [09-24|09:07:31] Imported new state entries               count=1089 elapsed=10.173ms  processed=2610107 pending=1483  retry=1   duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new state entries               count=1081 elapsed=14.713ms  processed=2611188 pending=48    retry=0   duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new state entries               count=35   elapsed=853.5µs   processed=2611223 pending=0     retry=0   duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new block receipts              count=0    elapsed=3.752ms   bytes=0 number=4305451 hash=ac92d6…397f6c ignored=1
INFO [09-24|09:07:35] Committed new head block                 number=4305451 hash=ac92d6…397f6c
INFO [09-24|09:07:35] Imported new chain segment               blocks=1 txs=17 mgas=0.442 elapsed=28.174ms  mgasps=15.701 number=4305452 hash=4a61da…5f72e4
ERROR[09-24|09:07:35]
########## BAD BLOCK #########
Chain config: {ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Byzantium: 9223372036854775807 Engine: ethash}

Number: 4305453
Hash: 0x6c4471bed33ac85f132153650f4f69230e9ef972ff33cba1e79795fb72130c66


Error: unknown ancestor
##############################

WARN [09-24|09:07:35] Synchronisation failed, dropping peer    peer=cb8ebbf8130355a7 err="retrieved hash chain is invalid"
ERROR[09-24|09:07:35] Fast sync complete, auto disabling
INFO [09-24|09:07:35] Removing p2p peer                        id=cb8ebbf8130355a7 conn=inbound duration=1h32m36.442s peers=24 req=false err="useless peer"
INFO [09-24|09:07:36] Ethereum peer connected                  id=8453dbef52518caf conn=dyndial name=Geth/v1.6.7-stable-ab5646c5/linux-amd64/go1.8.1
INFO [09-24|09:07:36] ** Total difficulty                      ours="{neg:false abs:[14195334218426315772 54]}" theirs=1009137134152556054860
INFO [09-24|09:07:36] ** fast sync?                            peer=479032d8362da82d enabled=0
WARN [09-24|09:07:36] Ethereum handshake failed                id=8453dbef52518caf conn=dyndial err="Genesis block mismatch - 6577484f58748da6 (!= d4e56740f876aef8)"
INFO [09-24|09:07:36] Removing p2p peer                        id=8453dbef52518caf conn=dyndial duration=279.836ms    peers=24 req=false err="Genesis block mismatch - 6577484f58748da6 (!= d4e56740f876aef8)"
INFO [09-24|09:07:37] Peer discarded announcement              peer=ca40c7662d6ac5ed number=4305853 hash=37f333…4a74d9 distance=402
INFO [09-24|09:07:37] Peer discarded announcement              peer=ca40c7662d6ac5ed number=4305854 hash=2e8a61…ae021f distance=403
INFO [09-24|09:07:38] Ethereum peer connected                  id=6949cab8fc6d09bd conn=inbound name=Geth/v1.6.2-unstable-2a41e76b/linux-amd64/go1.8.3

The key log messages here are Committed new head block, Imported new block receipts and Imported new chain segment, which allows the full head blockchain count to update.

So I'm guessing that the network is starved of fast blocks, and they haven't yet reached their intended pivot point... before they flip to full mode.

Also note that you can't force it to use full mode on the command line, it doesn't work.

Is there some way to force this flipping from fast to full mode prematurely? Perhaps if we haven't received a new chain segment for over an hour? Or find a peer that has what we are looking for with a more broader peer search?

@tomtom87
Copy link

tomtom87 commented Sep 24, 2017 via email

@wtfiwtz
Copy link

wtfiwtz commented Sep 25, 2017

You'll probably find it much easier to be on Parity (https://parity.io) - the wallet can do a light-mode sync in around 20-30 minutes... this is a good short-to-medium term solution. However, on Mac you need to be on OS X Sierra (or use the brew install instead)

I think someone needs to re-architect the fast sync in geth as the client needs to reach out to more diverse peers when it gets "stuck" for long periods of time like this. I have a few ideas, but very limited time, and it really needs to be done (or reviewed) by someone who knows what they are doing :P

@tomtom87
Copy link

tomtom87 commented Sep 25, 2017 via email

@wtfiwtz
Copy link

wtfiwtz commented Apr 15, 2018

@Mergathal that is the nature of a blockchain-based approach. Since BitCoin only has blocks targeting every 10 minutes, the throughput is lower and the number of blocks is lower.

Ethereum generates a new block every 30-60 seconds, allowing more transactions and faster response times. There will naturally be more data generated due to this approach. The data would need to be pruned somehow to keep it at a reasonable level.

Interestingly, in http://www.freekpaans.nl/2018/04/anatomy-geth-fast-sync/, it only took 77Gb of data in the blockchain stored locally for a completed fast sync. I've routinely destroyed fast syncs with much more data than that (... I have limited space on MacBook Pro). It seems to me that the longer that you are pulling down the state tries, the more data that is stored locally. It may also depend on how long you are "full syncing" for as well, once the fast sync is complete. I'm yet to fully understand why but it's an interesting observation.

@garyng2000
Copy link

we constantly 'refresh' by fast sync from scratch to keep the size in check. An initial fast sync is only around 60G(as of may be a month ago) then the size grow. after one month we are seeing 140G. Not sure if it is because older state needs to be pulled in or what. Does anyone with 'true' full sync knows the current disk size ?

@wtfiwtz
Copy link

wtfiwtz commented Apr 15, 2018

@garyng2000 a full sync took 220Gb according to the articles linked above. So it would be approximately 80Gb a month as a "fast sync" switches to a "full sync".

@garyng2000
Copy link

@wtfiwtz
that is something puzzle me, if it is 80GB a month we are talking TB data soon but how come a 'true' full sync is only 220G ? If that is the case, may be I should do a true full sync(from scratch) that can take a bit of time but the disk growth rate would be slower ? strange.

@wtfiwtz
Copy link

wtfiwtz commented Apr 15, 2018

@garyng2000 it could be because the accumulated state is bigger as you participate in the immediate verification of the transactions, where as post-verification is not as much information to download from peers. However, you would need someone more knowledgeable about Ethereum's inner workings to confirm or deny that.

@CryptoKiddies
Copy link

CryptoKiddies commented Apr 20, 2018

I'm on geth v1.8.4 and Ubuntu 16.04. Not only is geth stopping before final sync, but it completely stalls around 30-60 minutes after starting a sync. The CPU usage drops to ~3% of capacity and stays there.

screen shot 2018-04-19 at 6 10 37 pm

I see continuous error messages for connecting to nodes, and the state and blocks completely stop updating. I have to restart geth (I use systemd restart). This is very concerning because I don't want my node to stall in the middle of serving our dapp.

@suspended
Copy link

@GeeeCoin you might want to try v1.8.3 - have a simular issue to yours when I moved from .3 to .4

@CryptoKiddies
Copy link

CryptoKiddies commented Apr 21, 2018

@suspended v1.8.6 has the same unresolved issue. **downgrading to geth v1.8.3 worked for about 3 weeks, but now facing the same issues

@mtj151
Copy link

mtj151 commented May 13, 2018

I am also having the same sync problems... dropping peers etc. I am almost synced (about 50-100 blocks behind if I let it run). If I restart geth it catches up until peers start to drop again.

Using Ubuntu 16.04. I have tried different versions of Geth down to 1.8.2. Built the dev version too with no change.

I have lots of experience running a node having done it since the start... but I did re-download the block chain a month or 2 ago.

I use a SATA 500GB SSD but it is encrypted on the drive level and the home directory which is where the blockchain is stored. The encryption means that the read/write abilities are slower and using a disk monitor it shows a high level of activity constantly while geth is running.

I understand storing/using the blockchain on encrypted drive is probably not the best setup (for speed and amount of read writes/life of SSD) so I'm guessing the next thing I should try is a new separate un-encrypted SSD to store the chain... but I have not got round to doing so yet (having another SSD purely for eth blockchain is fairly expensive option). Currently my chaindata folder is 358.8GB

Looks like Ubuntu 16.04 is a consistent part of this thread/problem?

@CryptoKiddies
Copy link

@mtj151 good observation. I'm not ruling out any factors at this point. Is anyone using AWS by any chance?

@mtj151
Copy link

mtj151 commented May 21, 2018

I have also noticed that I am unable to send transactions while I am getting the "Synchronisation failed, retrying err="block download cancelled (requested)"" warnings.

I sent one transaction fine but then the warnings come up and it wouldn't let me send another transaction (even after the messages stopped and syncing started again). I had to completely restart geth to be able to send the transaction.

@ghost
Copy link

ghost commented May 25, 2018

@GeeeCoin I was unable to get a Geth node to stay up to date with chaintip on AWS in any meaningful time without using Provisioned IOPS SSDs on EBS-optimized instances or the i3 storage-optimized instances with 8GB RAM or greater. Even then, I had to write a watchdog to kick geth over every now and then for when it would drop all its peers or lag too much behind the chaintip. Now I just have dedicated boxes for geth nodes running NVMe SSD in the datacenter, and a NUC for LAN dev which has a 1 TB SATA SSD, 8GB RAM and a quad-core processor.

@CryptoKiddies
Copy link

@10A7 appreciate the data point. If NUC is outperforming a quad core with 8GB in AWS, that's a problem. Amazon may have network latency that hasn't been optimized with the t. class. The i3 looks like an option. We're taking a look at Quarian; thanks for building that out!

@mtj151
Copy link

mtj151 commented Jun 7, 2018

Sounds like 10a7 had the same problem with lagging behind the chain tip... good description of the problem. Did NVMe SSD fix the problem?? I'm looking at getting one in the coming weeks to run geth.

@ghost
Copy link

ghost commented Jun 7, 2018

@mtj151 NVMe SSD doesn't seem to matter. I have no trouble keeping SATA SSDs and bcache-fronted magnetic arrays intact and synced I/O wise.

If you are synced and "importing new chain segment", it seems to mostly be network issues that cause my nodes to fall behind. Restarting geth often helps to get different peers. Geth sync-after-fast-pivot is also much more reliable for me if I am not behind a NAT, and can forward/open 30303/tcp.

@jdowning100
Copy link

FWIW I was able to get geth to fully sync by waiting until eth.blockNumber is near the numbers in eth.syncing and then restarting geth. I was able to do this at ~160m states. After restarting geth, it took about 20 min to catch up to the blockchain and now eth.syncing is false and the only output now is 'imported new chain segment' every time a new block is found.

@karalabe
Copy link
Member

karalabe commented Oct 4, 2018

@
Syncing Ethereum is a pain point for many people, so I'll try to detail what's happening behind the scenes so there might be a bit less confusion.

The current default mode of sync for Geth is called fast sync. Instead of starting from the genesis block and reprocessing all the transactions that ever occurred (which could take weeks), fast sync downloads the blocks, and only verifies the associated proof-of-works. Downloading all the blocks is a straightforward and fast procedure and will relatively quickly reassemble the entire chain.

Many people falsely assume that because they have the blocks, they are in sync. Unfortunately this is not the case, since no transaction was executed, so we do not have any account state available (ie. balances, nonces, smart contract code and data). These need to be downloaded separately and cross checked with the latest blocks. This phase is called the state trie download and it actually runs concurrently with the block downloads; alas it take a lot longer nowadays than downloading the blocks.

So, what's the state trie? In the Ethereum mainnet, there are a ton of accounts already, which track the balance, nonce, etc of each user/contract. The accounts themselves are however insufficient to run a node, they need to be cryptographically linked to each block so that nodes can actually verify that the account's are not tampered with. This cryptographic linking is done by creating a tree data structure above the accounts, each level aggregating the layer below it into an ever smaller layer, until you reach the single root. This gigantic data structure containing all the accounts and the intermediate cryptographic proofs is called the state trie.

Ok, so why does this pose a problem? This trie data structure is an intricate interlink of hundreds of millions of tiny cryptographic proofs (trie nodes). To truly have a synchronized node, you need to download all the account data, as well as all the tiny cryptographic proofs to verify that noone in the network is trying to cheat you. This itself is already a crazy number of data items. The part where it gets even messier is that this data is constantly morphing: at every block (15s), about 1000 nodes are deleted from this trie and about 2000 new ones are added. This means your node needs to synchronize a dataset that is changing 200 times per second. The worst part is that while you are synchronizing, the network is moving forward, and state that you begun to download might disappear while you're downloading, so your node needs to constantly follow the network while trying to gather all the recent data. But until you actually do gather all the data, your local node is not usable since it cannot cryptographically prove anything about any accounts.

If you see that you are 64 blocks behind mainnet, you aren't yet synchronized, not even close. You are just done with the block download phase and still running the state downloads. You can see this yourself via the seemingly endless Imported state entries [...] stream of logs. You'll need to wait that out too before your node comes truly online.


Q: The node just hangs on importing state enties?!

A: The node doesn't hang, it just doesn't know how large the state trie is in advance so it keeps on going and going and going until it discovers and downloads the entire thing.

The reason is that a block in Ethereum only contains the state root, a single hash of the root node. When the node begins synchronizing, it knows about exactly 1 node and tries to download it. That node, can refer up to 16 new nodes, so in the next step, we'll know about 16 new nodes and try to download those. As we go along the download, most of the nodes will reference new ones that we didn't know about until then. This is why you might be tempted to think it's stuck on the same numbers. It is not, rather it's discovering and downloading the trie as it goes along.

Q: I'm stuck at 64 blocks behind mainnet?!

A: As explained above, you are not stuck, just finished with the block download phase, waiting for the state download phase to complete too. This latter phase nowadays take a lot longer than just getting the blocks.

Q: Why does downloading the state take so long, I have good bandwidth?

A: State sync is mostly limited by disk IO, not bandwidth.

The state trie in Ethereum contains hundreds of millions of nodes, most of which take the form of a single hash referencing up to 16 other hashes. This is a horrible way to store data on a disk, because there's almost no structure in it, just random numbers referencing even more random numbers. This makes any underlying database weep, as it cannot optimize storing and looking up the data in any meaningful way.

Not only is storing the data very suboptimal, but due to the 200 modification / second and pruning of past data, we cannot even download it is a properly pre-processed way to make it import faster without the underlying database shuffling it around too much. The end result is that even a fast sync nowadays incurs a huge disk IO cost, which is too much for a mechanical hard drive.

Q: Wait, so I can't run a full node on an HDD?

A: Unfortunately not. Doing a fast sync on an HDD will take more time than you're willing to wait with the current data schema. Even if you do wait it out, an HDD will not be able to keep up with the read/write requirements of transaction processing on mainnet.

You however should be able to run a light client on an HDD with minimal impact on system resources. If you wish to run a full node however, an SSD is your only option.

@karalabe karalabe closed this as completed Oct 4, 2018
@CryptoKiddies
Copy link

@karalabe Thanks for breaking this down again. We knew most of this about Geth/Eth already, but I'm really surprised as to how suboptimal the state trie system is at being stored to disk; I thought the whole point of building ethereum this way (with modified patricia trees etc.) was to minimize footprint/disk mods, but looks like innovation in storage structures is still needed.

@hustnn
Copy link

hustnn commented Oct 11, 2018

@karalabe . Nice introduction. Understanding fast sync internal better.

@quietnan
Copy link

@karalabe So is there any way of knowing how close you are to being finished syncing? None of the metrics from the eth_syncing call seems to carry meaningful information about this.

@nyetwurk
Copy link
Contributor

nyetwurk commented Jan 27, 2020

@karalabe So is there any way of knowing how close you are to being finished syncing? None of the metrics from the eth_syncing call seems to carry meaningful information about this.

#16558
https://eips.ethereum.org/EIPS/eip-2029

If those are actually implemented, you'll at least be able to scrape the number of states from an external reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests