Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI - Sync mode SNAP does not work #23228

Closed
ryny24 opened this issue Jul 17, 2021 · 44 comments
Closed

FYI - Sync mode SNAP does not work #23228

ryny24 opened this issue Jul 17, 2021 · 44 comments
Labels

Comments

@ryny24
Copy link

ryny24 commented Jul 17, 2021

I run several geth nodes, but I've always used Sync mode fast. I am building a new machine, so I figured I would try snap (the new default). This is running on a HP DL 380 G5 with 24 GB of RAM and a 1TB SSD.

Using SNAP sync, geth will go into these modes of compacting the database and it completely halts the sync process. It does this for over an hour sometimes. But, I let it continue. Now, this morning my sync appears to be stuck in "state heal in progress". It looks like it's been at this for hours. I give up. I am going back to Fast.

@ryny24 ryny24 changed the title FYI - Sync mode SNAP fails FYI - Sync mode SNAP does not work Jul 17, 2021
@holiman
Copy link
Contributor

holiman commented Jul 17, 2021 via email

@mohamedmansour
Copy link
Member

mohamedmansour commented Jul 17, 2021

I am having the same issue, ever since the latest update, i cannot sync my node completely
06D48D59-6ADB-4878-9EBF-C0E72B7AF92D

i will go back to fast sync or should I wipe db and try again? Is there anything i can do to help get more info for geth to understand why?

@holiman
Copy link
Contributor

holiman commented Jul 17, 2021 via email

@mohamedmansour
Copy link
Member

The unexpected trienode heal packet is that normal? That is like 90% of my logs at this time, if it is normal, should we make it into an "INFO" instead.

@holiman
Copy link
Contributor

holiman commented Jul 17, 2021 via email

@mohamedmansour
Copy link
Member

mohamedmansour commented Jul 18, 2021

This is my second time doing a fresh sync, and I am getting many Unexpected trienode heal packet "warnings", it seems we are getting a lot of them, anything incorrect with my setup that would lead to this?

WARN [07-18|16:45:26.500] Unexpected trienode heal packet          peer=a0378d3b         reqid=4,622,035,716,730,946,788
WARN [07-18|16:45:26.565] Unexpected trienode heal packet          peer=d9d3b409         reqid=5,186,876,070,523,757,744
WARN [07-18|16:45:26.784] Unexpected trienode heal packet          peer=5532d114         reqid=4,913,062,635,039,342,268
WARN [07-18|16:45:27.002] Unexpected trienode heal packet          peer=765491aa         reqid=8,537,381,729,530,772,935
WARN [07-18|16:45:27.134] Unexpected trienode heal packet          peer=17935a2e         reqid=398,999,258,026,622,278
WARN [07-18|16:45:27.285] Unexpected trienode heal packet          peer=fb395765         reqid=2,540,630,659,186,956,989
WARN [07-18|16:45:27.394] Unexpected trienode heal packet          peer=6d63f63c         reqid=3,060,663,354,523,276,049
WARN [07-18|16:45:29.757] Unexpected trienode heal packet          peer=a8523f94         reqid=9,026,963,612,786,922,325
WARN [07-18|16:45:29.859] Unexpected trienode heal packet          peer=889cc785         reqid=1,566,881,240,675,074,312
WARN [07-18|16:45:30.026] Unexpected trienode heal packet          peer=a6d04daf         reqid=2,158,928,981,022,095,841
WARN [07-18|16:45:30.152] Unexpected trienode heal packet          peer=9e9cac22         reqid=3,592,232,586,991,666,835
WARN [07-18|16:45:31.019] Unexpected trienode heal packet          peer=5d8589a4         reqid=393,438,786,560,958,185
WARN [07-18|16:45:31.181] Unexpected trienode heal packet          peer=91bbf60e         reqid=4,523,948,219,739,688,930
WARN [07-18|16:45:31.372] Unexpected trienode heal packet          peer=d775bde9         reqid=4,708,330,904,872,538,623
WARN [07-18|16:45:32.379] Unexpected trienode heal packet          peer=c4532ff9         reqid=6,874,370,675,488,852,123
WARN [07-18|16:45:32.380] Unexpected trienode heal packet          peer=4fa42e20         reqid=3,267,196,924,822,701,039
WARN [07-18|16:45:32.542] Unexpected trienode heal packet          peer=f1b13a02         reqid=884,270,897,609,630,641
INFO [07-18|16:45:43.977] State heal in progress                   accounts=36222@1.73MiB        slots=8316@630.08KiB       codes=12@101.38KiB       nodes=397,802@152.75MiB pending=123,495
WARN [07-18|16:45:43.978] Unexpected trienode heal packet          peer=5d664149         reqid=3,710,167,488,769,797,833
WARN [07-18|16:45:43.981] Unexpected trienode heal packet          peer=e3dbfe41         reqid=2,073,374,780,961,819,677
WARN [07-18|16:45:43.983] Unexpected trienode heal packet          peer=bb71328b         reqid=8,121,458,505,920,535,863
WARN [07-18|16:45:43.985] Unexpected trienode heal packet          peer=a1ea248b         reqid=267,783,068,321,922,388
WARN [07-18|16:45:43.991] Unexpected trienode heal packet          peer=a5201f7c         reqid=1,439,199,076,782,710,385
WARN [07-18|16:45:44.056] Unexpected trienode heal packet          peer=61551118         reqid=3,545,386,287,666,985,539
WARN [07-18|16:45:44.056] Unexpected trienode heal packet          peer=85c41b4d         reqid=2,200,301,475,214,792,901
WARN [07-18|16:45:44.057] Unexpected trienode heal packet          peer=cb4a957a         reqid=151,712,146,799,871,033
WARN [07-18|16:45:44.062] Unexpected trienode heal packet          peer=8320ab75         reqid=7,465,864,124,670,898,925
WARN [07-18|16:45:44.064] Unexpected trienode heal packet          peer=3319911a         reqid=1,676,101,697,579,000,381
WARN [07-18|16:45:44.065] Unexpected trienode heal packet          peer=bc0a22b2         reqid=7,641,458,910,862,074,927
WARN [07-18|16:45:44.125] Unexpected trienode heal packet          peer=3fe6f503         reqid=7,863,934,179,564,415,364
WARN [07-18|16:45:44.129] Unexpected trienode heal packet          peer=22350643         reqid=4,517,462,049,592,695,959
WARN [07-18|16:45:44.130] Unexpected trienode heal packet          peer=6db8fffa         reqid=3,966,230,954,419,389,835
WARN [07-18|16:45:44.452] Unexpected trienode heal packet          peer=b6dc91f8         reqid=1,988,555,966,599,462,679
WARN [07-18|16:45:44.502] Unexpected trienode heal packet          peer=dd94e72d         reqid=4,450,099,697,928,108,082
WARN [07-18|16:45:44.504] Unexpected trienode heal packet          peer=5b858358         reqid=8,925,946,666,813,154,737
WARN [07-18|16:45:44.509] Unexpected trienode heal packet          peer=23a85aa5         reqid=3,080,196,199,118,744,386
WARN [07-18|16:45:44.511] Unexpected trienode heal packet          peer=eba98804         reqid=8,349,544,363,420,424,856
WARN [07-18|16:45:44.513] Unexpected trienode heal packet          peer=ea43a6fe         reqid=1,184,655,664,638,635,660
WARN [07-18|16:45:44.516] Unexpected trienode heal packet          peer=12b7db81         reqid=5,793,738,489,717,339,364
WARN [07-18|16:45:44.516] Unexpected trienode heal packet          peer=e70811b4         reqid=2,177,353,076,724,056,341
WARN [07-18|16:45:44.763] Unexpected trienode heal packet          peer=22e17f9c         reqid=8,853,990,770,070,346,560
WARN [07-18|16:45:44.811] Unexpected trienode heal packet          peer=e6382c70         reqid=739,770,737,457,955,743
INFO [07-18|16:45:44.850] Imported new block headers               count=1    elapsed=6.160ms     number=12,851,749 hash=7a3243..6dae55 age=1m23s
WARN [07-18|16:45:44.871] Unexpected trienode heal packet          peer=cd828ba4         reqid=8,798,951,724,798,149,997
WARN [07-18|16:45:44.873] Unexpected trienode heal packet          peer=6e5635a0         reqid=8,705,205,715,673,999,671
WARN [07-18|16:45:44.875] Unexpected trienode heal packet          peer=cbdd12a9         reqid=1,750,396,375,655,342,341
WARN [07-18|16:45:44.876] Unexpected trienode heal packet          peer=9f0aaa30         reqid=4,072,861,921,597,398,313
WARN [07-18|16:45:44.879] Unexpected trienode heal packet          peer=a0378d3b         reqid=8,386,183,962,563,361,933
WARN [07-18|16:45:44.879] Unexpected trienode heal packet          peer=d9d3b409         reqid=8,295,232,788,244,893,801
WARN [07-18|16:45:44.882] Unexpected trienode heal packet          peer=5532d114         reqid=3,623,043,430,561,675,816
WARN [07-18|16:45:44.911] Unexpected trienode heal packet          peer=765491aa         reqid=7,752,667,725,631,141,893
WARN [07-18|16:45:44.958] Unexpected trienode heal packet          peer=17935a2e         reqid=4,857,961,192,605,514,856
WARN [07-18|16:45:45.009] Unexpected trienode heal packet          peer=fb395765         reqid=2,054,719,662,305,818,389
WARN [07-18|16:45:45.058] Unexpected trienode heal packet          peer=6d63f63c         reqid=6,657,913,199,915,039,014
WARN [07-18|16:45:45.081] Unexpected trienode heal packet          peer=a8523f94         reqid=7,638,890,637,230,011,266
WARN [07-18|16:45:45.114] Unexpected trienode heal packet          peer=889cc785         reqid=2,929,668,062,071,838,105
WARN [07-18|16:45:45.169] Unexpected trienode heal packet          peer=a6d04daf         reqid=8,942,089,650,430,097,371
WARN [07-18|16:45:45.171] Unexpected trienode heal packet          peer=9e9cac22         reqid=4,503,892,313,274,969,736
WARN [07-18|16:45:45.229] Unexpected trienode heal packet          peer=5d8589a4         reqid=3,594,601,607,111,485,883
WARN [07-18|16:45:45.231] Unexpected trienode heal packet          peer=91bbf60e         reqid=5,875,472,871,662,902,193
WARN [07-18|16:45:45.259] Unexpected trienode heal packet          peer=d775bde9         reqid=1,079,832,331,273,526,425
INFO [07-18|16:45:45.378] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=243.11KiB throttle=1079

@mohamedmansour
Copy link
Member

I checked again, 93% of my logs are “Unexpected trienode heal packet”

if there is anything i can do to help gather more logs for understanding how to fix it, let me know.

@begetan
Copy link

begetan commented Jul 21, 2021

@mohamedmansour

The stage State heal in progress takes several hours depending on your hardware specs. How much CPU core have you? What is the network bandwidth?

@mohamedmansour
Copy link
Member

mohamedmansour commented Jul 21, 2021

@mohamedmansour

The stage State heal in progress takes several hours depending on your hardware specs. How much CPU core have you? What is the network bandwidth?

I have 4 CPUs, 16GB RAM, 2Gbit network on Azure VM https://azure.microsoft.com/en-us/blog/introducing-b-series-our-new-burstable-vm-size/ (Standard_B4ms) = $140/month

@alextes
Copy link

alextes commented Jul 23, 2021

AWS c6g.xlarge. That's 4 high-performance cores, 16GiB, and I added a 6000 IOPS, 1000MiB/s throughput EBS volume, up to 10 Gb network. One night of snapsync.

WARN [07-23|07:18:10.749] Unexpected trienode heal packet          peer=a6d04daf         reqid=5,628,752,454,634,351,532
WARN [07-23|07:18:10.749] Unexpected trienode heal packet          peer=bfdaf535         reqid=2,129,613,245,800,917,217
WARN [07-23|07:18:10.759] Unexpected trienode heal packet          peer=30ac3a13         reqid=2,746,357,763,380,049,576
WARN [07-23|07:18:12.273] Unexpected trienode heal packet          peer=e28df4e4         reqid=1,660,082,010,578,520,634
WARN [07-23|07:18:13.454] Unexpected trienode heal packet          peer=580f3239         reqid=38,957,340,314,670,213
INFO [07-23|07:18:18.379] State heal in progress                   accounts=228,655@11.25MiB     slots=474,354@35.58MiB     codes=104@867.88KiB      nodes=2,520,229@667.74MiB pending=22518
INFO [07-23|07:18:18.821] Imported new block headers               count=1    elapsed=5.535ms   number=12,881,249 hash=1a723b..7d6a91
WARN [07-23|07:18:21.471] Checkpoint challenge timed out, dropping id=4537608e56285b37 conn=inbound addr=65.21.95.47:34618     type=erigon/v2021.07.4/li...
INFO [07-23|07:18:26.424] State heal in progress                   accounts=228,751@11.26MiB     slots=474,672@35.60MiB     codes=104@867.88KiB      nodes=2,521,908@668.17MiB pending=22194
WARN [07-23|07:18:26.453] Unexpected trienode heal packet          peer=e28df4e4         reqid=7,019,785,347,053,342,247
WARN [07-23|07:18:31.475] Unexpected trienode heal packet          peer=30ac3a13         reqid=8,104,986,034,277,065,248
WARN [07-23|07:18:31.492] Unexpected trienode heal packet          peer=805b1ebf         reqid=847,938,211,426,238,144
WARN [07-23|07:18:32.924] Unexpected trienode heal packet          peer=436cfb8d         reqid=5,826,258,251,133,243,579
INFO [07-23|07:18:35.204] State heal in progress                   accounts=228,838@11.26MiB     slots=474,941@35.62MiB     codes=104@867.88KiB      nodes=2,523,516@668.63MiB pending=22234
WARN [07-23|07:18:35.205] Unexpected trienode heal packet          peer=580f3239         reqid=2,897,966,948,454,458,237
WARN [07-23|07:18:35.231] Unexpected trienode heal packet          peer=e28df4e4         reqid=6,338,490,710,152,459,505
WARN [07-23|07:18:35.240] Unexpected trienode heal packet          peer=0f8d93b7         reqid=6,558,336,474,454,724,048
WARN [07-23|07:18:36.250] Unexpected trienode heal packet          peer=c34194de         reqid=4,258,489,952,416,859,887
WARN [07-23|07:18:36.253] Unexpected trienode heal packet          peer=61551118         reqid=2,074,155,154,236,670,882
WARN [07-23|07:18:36.255] Unexpected trienode heal packet          peer=54080ee5         reqid=6,077,024,708,527,433,002
WARN [07-23|07:18:37.657] Unexpected trienode heal packet          peer=01f352da         reqid=9,041,030,120,532,686,969
WARN [07-23|07:18:38.767] Unexpected trienode heal packet          peer=a6d04daf         reqid=9,015,027,624,447,844,791
WARN [07-23|07:18:39.993] Unexpected trienode heal packet          peer=30ac3a13         reqid=708,368,138,623,231,300
INFO [07-23|07:18:40.155] Imported new block headers               count=1    elapsed=5.390ms   number=12,881,250 hash=75e84f..8e55ae
INFO [07-23|07:18:40.196] Imported new block headers               count=1    elapsed=5.099ms   number=12,881,251 hash=203c55..a9c8ba
INFO [07-23|07:18:40.299] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=243.22KiB throttle=1078
INFO [07-23|07:18:43.248] State heal in progress                   accounts=229,070@11.27MiB     slots=475,165@35.64MiB     codes=104@867.88KiB      nodes=2,525,270@669.09MiB pending=22579
INFO [07-23|07:18:51.645] State heal in progress                   accounts=229,577@11.30MiB     slots=475,451@35.66MiB     codes=104@867.88KiB      nodes=2,527,573@669.63MiB pending=22001
INFO [07-23|07:18:59.679] State heal in progress                   accounts=229,769@11.31MiB     slots=475,578@35.67MiB     codes=105@871.31KiB      nodes=2,529,013@670.06MiB pending=22249
INFO [07-23|07:19:04.571] Imported new block headers               count=1    elapsed=5.083ms   number=12,881,252 hash=ade0fb..eb0a79
INFO [07-23|07:19:07.654] Imported new block headers               count=1    elapsed=5.050ms   number=12,881,253 hash=69ff20..5b4202
INFO [07-23|07:19:08.080] State heal in progress                   accounts=230,031@11.32MiB     slots=475,688@35.67MiB     codes=105@871.31KiB      nodes=2,530,515@670.51MiB pending=22411
INFO [07-23|07:19:10.778] Imported new block headers               count=1    elapsed=5.106ms   number=12,881,254 hash=ce4e76..b337cd
WARN [07-23|07:19:15.211] Unexpected bytecode heal packet          peer=a6d04daf         reqid=5,663,218,223,262,397,112
INFO [07-23|07:19:16.520] State heal in progress                   accounts=230,257@11.33MiB     slots=475,836@35.69MiB     codes=105@871.31KiB      nodes=2,532,116@670.95MiB pending=22693
WARN [07-23|07:19:21.465] Unexpected bytecode heal packet          peer=54080ee5         reqid=3,077,866,949,983,522,985
INFO [07-23|07:19:24.589] State heal in progress                   accounts=230,503@11.35MiB     slots=476,007@35.70MiB     codes=105@871.31KiB      nodes=2,533,755@671.39MiB pending=22916
INFO [07-23|07:19:29.071] Imported new block headers               count=1    elapsed=5.203ms   number=12,881,255 hash=260e4c..1afdb0
INFO [07-23|07:19:32.599] State heal in progress                   accounts=230,782@11.36MiB     slots=476,225@35.71MiB     codes=106@876.23KiB      nodes=2,535,524@671.84MiB pending=23256
INFO [07-23|07:19:40.710] State heal in progress                   accounts=230,994@11.37MiB     slots=476,453@35.73MiB     codes=106@876.23KiB      nodes=2,537,311@672.30MiB pending=23830
INFO [07-23|07:19:41.280] Imported new block headers               count=1    elapsed=5.200ms   number=12,881,256 hash=9d6a30..18ed89
INFO [07-23|07:19:41.330] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=236.02KiB throttle=1111
INFO [07-23|07:19:48.871] State heal in progress                   accounts=231,125@11.37MiB     slots=476,910@35.77MiB     codes=106@876.23KiB      nodes=2,539,278@672.76MiB pending=24121
INFO [07-23|07:19:56.999] State heal in progress                   accounts=231,256@11.38MiB     slots=477,398@35.80MiB     codes=106@876.23KiB      nodes=2,541,302@673.21MiB pending=24316
INFO [07-23|07:20:05.161] State heal in progress                   accounts=231,423@11.39MiB     slots=478,108@35.86MiB     codes=106@876.23KiB      nodes=2,543,607@673.68MiB pending=23817
INFO [07-23|07:20:13.392] State heal in progress                   accounts=231,622@11.40MiB     slots=478,790@35.91MiB     codes=106@876.23KiB      nodes=2,545,936@674.16MiB pending=23275
INFO [07-23|07:20:21.912] State heal in progress                   accounts=231,762@11.41MiB     slots=479,440@35.96MiB     codes=106@876.23KiB      nodes=2,548,160@674.63MiB pending=22951
INFO [07-23|07:20:29.985] State heal in progress                   accounts=231,913@11.41MiB     slots=479,988@36.00MiB     codes=106@876.23KiB      nodes=2,550,053@675.07MiB pending=22448
INFO [07-23|07:20:29.992] Imported new block headers               count=1    elapsed=5.098ms   number=12,881,257 hash=830c05..7bd316 age=1m21s
WARN [07-23|07:20:34.268] Unexpected trienode heal packet          peer=0f8d93b7         reqid=5,110,344,849,506,036,345
WARN [07-23|07:20:34.624] Unexpected trienode heal packet          peer=c34194de         reqid=176,407,516,446,904,836
WARN [07-23|07:20:34.875] Unexpected trienode heal packet          peer=61551118         reqid=7,745,011,524,738,817,427
WARN [07-23|07:20:36.752] Unexpected trienode heal packet          peer=01f352da         reqid=8,576,927,767,064,613,316
WARN [07-23|07:20:37.340] Unexpected trienode heal packet          peer=a6d04daf         reqid=7,790,792,082,625,331,242
INFO [07-23|07:20:38.767] State heal in progress                   accounts=232,034@11.42MiB     slots=480,415@36.03MiB     codes=106@876.23KiB      nodes=2,551,824@675.51MiB pending=22355
WARN [07-23|07:20:40.802] Unexpected trienode heal packet          peer=580f3239         reqid=2,584,556,318,708,171,106
WARN [07-23|07:20:41.044] Unexpected trienode heal packet          peer=e28df4e4         reqid=4,683,100,952,475,139,577
WARN [07-23|07:20:41.063] Unexpected trienode heal packet          peer=0f8d93b7         reqid=7,292,729,526,860,797,397
INFO [07-23|07:20:46.982] State heal in progress                   accounts=232,141@11.42MiB     slots=480,997@36.08MiB     codes=107@881.79KiB      nodes=2,553,866@675.96MiB pending=22692
INFO [07-23|07:20:55.447] State heal in progress                   accounts=232,233@11.43MiB     slots=481,804@36.14MiB     codes=107@881.79KiB      nodes=2,556,200@676.44MiB pending=22977
INFO [07-23|07:21:00.452] Imported new block headers               count=1    elapsed=6.045ms   number=12,881,258 hash=793478..ba029d age=1m33s
INFO [07-23|07:21:01.368] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=237.05KiB throttle=1106
INFO [07-23|07:21:03.583] State heal in progress                   accounts=232,331@11.43MiB     slots=482,485@36.19MiB     codes=107@881.79KiB      nodes=2,558,193@676.88MiB pending=22942
INFO [07-23|07:21:12.387] State heal in progress                   accounts=232,439@11.44MiB     slots=483,346@36.25MiB     codes=107@881.79KiB      nodes=2,560,507@677.37MiB pending=22521
INFO [07-23|07:21:20.939] State heal in progress                   accounts=232,555@11.44MiB     slots=484,069@36.31MiB     codes=107@881.79KiB      nodes=2,562,649@677.83MiB pending=22119
INFO [07-23|07:21:29.133] State heal in progress                   accounts=232,663@11.45MiB     slots=484,669@36.35MiB     codes=107@881.79KiB      nodes=2,564,592@678.28MiB pending=21908

Same result as @mohamedmansour (which also hey, love watchtheburn ❤️ ! happen to be building something v. similar).

If I'm reading the resource use correctly (everything way under capacity, chain data 376GiB large only), but the node count going up, this node is now completely constrained by how quickly it can get the data it needs from other nodes? From the looks of it, I can go down to a machine half this size.

Sounds like what @mohamedmansour and I are seeing is perhaps not entirely expected but unrelated to the bug flagged in this issue. Correct?

Unrelated to the issue, if it helps anyone, the machine is mostly CPU constrained during the sync (fascinating!), sits at 50% RAM use, IO I reserved was waaaaay overkill didn't cross 3000 IOPS, 180MiB/s.

@alextes
Copy link

alextes commented Jul 23, 2021

Heh, doing fast sync now, the machine above is averaging 80% CPU, 80% Mem, IO 100 - 250MB/s, network 200Mb/s, it's basically on fire 🔥 . Hats of to the Geth team for the efficiency of snap sync 👌 .

lmk if I should open a new issue if fast sync ends up beating out snap sync 😬 .

@mohamedmansour
Copy link
Member

I am doubling my cloud now to 20K IOPS and will see how state healing does when it comes to that stage. Bare Metal VMs is going to be the solution perhaps, I wish Geth honesty places logs saying hardware is not good enough if it can detect that.

@alextes
Copy link

alextes commented Jul 25, 2021

Snap sync done 👌 . 3 days it looks like. Surprisingly only 384G disk use.
Fast sync done too, 355G used. I'm not sure it was faster but in any case, the difference is quite small it seems for these 'medium power' machines.

Seems to work fine for me with the above machine 🙌 .

@begetan
Copy link

begetan commented Jul 26, 2021

Snapsync took 1 day on AWS c6g.xlarge machine, while it may run 3 days on the AWS a1.xlarge, which is only 20-30% lower in CPU cores.

4 high-speed (not low end) cpu cores are a bare minimum for the last Geth version.

@f5b7
Copy link

f5b7 commented Aug 18, 2021

What is the "target" number of accounts/slots/codes/nodes, currently? Interested to try to estimate my ETA ;-)

@f5b7
Copy link

f5b7 commented Aug 19, 2021

FWIW: My last State heal in progress log line was accounts=803,857@40.54MiB slots=1,760,361@129.99MiB codes=436@3.18MiB nodes=8,983,522@2.28GiB pending=0. After that it was Rebuilding state snapshot (which seems to need another ~ 24 hours).

@xanoni
Copy link

xanoni commented Aug 20, 2021

So is there some metric we can look at to understand (absolute) sync progress? Some target number it needs to hit? Which? Thank you!

@mohamedmansour
Copy link
Member

I believe snap sync requires at least 6 vcpus and ssd with high iops to work.

@xanoni
Copy link

xanoni commented Aug 24, 2021

This might be related:

#23191 (comment)

Similar problem here ...

I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.

RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz)
Disk: SSD (200+ MiB / sec)
Net: 2 Gbit/s

Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing:
WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ

The Geth folder currently utilizes 400 GiB (no archive).

@xanoni
Copy link

xanoni commented Aug 24, 2021

Is the "accounts" number supposed to just keep increasing? I noticed in the logs that it reset itself to 0 several times in the last 6 days, I think whenenver the node was restarted.

So-what so far:

  • 40 thousand "Unexpected trienode heal packets" in 6 days
  • Accounts go back to 0 whenever the node is restarted

This is what it shows during shutdown ... I'm not sure why it's not considered a clean shutdown, I send it gentle signals. Just the default stuff that systemd sends.

Failed to journal state snapshot         err="snapshot [0xYYYYYYYYYYYYY....YYYYYYYYYYYYYYYYY] missing"
Writing clean trie cache to disk         path=/mnt/ssd/geth-data/geth/triecache threads=4
Persisted the clean trie cache           path=/mnt/ssd/geth-data/geth/triecache elapsed=42.197ms
Blockchain stopped

Once it resumes, I see this:

[...]
Loaded last fast-sync pivot marker       number=13,080,857
Snapshot maintenance disabled (syncing)
Loaded local transaction journal         transactions=0 dropped=0
Upgrading chain index                    type=bloombits percentage=99
Regenerated local transaction journal    transactions=0 accounts=0
[...]
Unclean shutdown detected                booted=2021-08-xxTxx:xx:xx-xxxx age=6d12h9m

and then it restarts here:

State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=1695@12.83MiB nodes=37,526,187@10.08GiB pending=      1

@ttibbetts
Copy link

I'm experiencing this same issue. Same setup as #23191 (comment) (RaspberryPi 8Gb system with SSD) running the latest Etherium on Arm build. Geth version info:

Geth
Version: 1.10.8-stable
Git Commit: 26675454bf93bf904be7a43cce6b3f550115ff90
Git Commit Date: 20210824
Architecture: arm64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

Using the default snap sync and the log is full of "Unexpected trienode heal packet" entries.

#23191 (comment)

Similar problem here ...
I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.
RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz)
Disk: SSD (200+ MiB / sec)
Net: 2 Gbit/s
Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing:
WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ
The Geth folder currently utilizes 400 GiB (no archive).

@xanoni
Copy link

xanoni commented Aug 28, 2021

I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now.

Not ideal, but worked for me.

@lrc123456
Copy link

I run several geth nodes, but I've always used Sync mode fast. I am building a new machine, so I figured I would try snap (the new default). This is running on a HP DL 380 G5 with 24 GB of RAM and a 1TB SSD.

Using SNAP sync, geth will go into these modes of compacting the database and it completely halts the sync process. It does this for over an hour sometimes. But, I let it continue. Now, this morning my sync appears to be stuck in "state heal in progress". It looks like it's been at this for hours. I give up. I am going back to Fast.

How did u go back to fast without abandon the data u have downloaded, whats the command like? thx!

@anhbaysgalan1
Copy link

if my state healing looks like t=2021-10-05T13:53:30+0000 lvl=info msg="State heal in progress" accounts=74838@3.60MiB slots=14839@1.14MiB codes=96@830.12KiB nodes=408,723@122.32MiB pending=55412 this nodes should be 8 million+?

@begetan
Copy link

begetan commented Oct 6, 2021

@ttibbetts

Raspberry Pi with 4 Core is probably not capable of doing snapshot sync. Try to disable this feature and use regular sync.

@Fillicia
Copy link

Fillicia commented Oct 6, 2021

If this can help someone I had this issue and in my case it was because I started geth from the config.toml file and the node was overwriting the cache attribute to default. Starting it with --cache xxx --config /path/to/config solved it for me.

@delbonis
Copy link

delbonis commented Oct 8, 2021

Trying to do a snap sync on a recent Ryzen machine (not just a Pi, it should be performant enough) and it seems to be sitting at this for over a day now.

INFO [10-08|17:56:51.721] State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=107@910.49KiB nodes=1,677,182@354.95MiB pending=1

It seems to have imported receipts and headers to tip (or near to it, at least)

INFO [10-08|17:52:15.561] Imported new block receipts              count=27  elapsed=51.550ms   number=13,379,325 hash=744aa1..d23c2c age=1h5m56s  size=2.88MiB
INFO [10-08|17:52:15.730] Imported new block receipts              count=25  elapsed=56.574ms   number=13,379,350 hash=56a47b..4a45e8 age=1h2m33s  size=2.77MiB
INFO [10-08|17:52:15.887] Imported new block receipts              count=19  elapsed=63.894ms   number=13,379,369 hash=e15d90..6a1567 age=58m20s   size=2.88MiB
INFO [10-08|17:52:16.025] Imported new block receipts              count=23  elapsed=52.420ms   number=13,379,392 hash=2b93bf..9dc915 age=54m      size=2.66MiB
INFO [10-08|17:52:16.302] Imported new block receipts              count=79  elapsed=213.982ms  number=13,379,471 hash=d8a9e1..1b1d62 age=35m11s   size=11.19MiB

@crites
Copy link

crites commented Oct 8, 2021

The nodes-count is increasing -- it's progressing.

@holiman How high does the nodes-count need to go?

I'm at ".. nodes=86,597,863"

@pryce-turner
Copy link

I'd also like to know this.. I'm (sadly) only on nodes=23,591,000. Wondering if I should let the state heal complete or switch to fast sync as it seems others have had success. Thank you for all the hard work you folks put in!!

@derekmahar
Copy link

derekmahar commented Mar 1, 2022

My Geth 1.10.16 node running snap sync mode also reports regular "State heal in progress" and a frequent stream of repeated "Unexpected trienode heal packet" log messages. It also reports regular "Snapshot extension registration failed" messages. I don't recall seeing these messages when I was running Geth 1.10.15. I switched Geth's sync mode from "fast" to "snap" starting with Geth 1.10.15 because "fast" sync mode was no longer available.

@derekmahar
Copy link

After downgrading Geth from 1.10.16 to 1.10.15, Geth still reports frequent "Snapshot extension registration failed" and "State heal in progress" log messages, but reported no "Unexpected trienode heal packet" log messages until after I pressed Ctrl-c to shut down the node.

@anonqoder
Copy link

I am waiting for more than 1 day after a snap sync, in a beefy machine with beefy internet connection
Stuck in "State heal in progress" with no indication of ETA whatsoever
What's the point of downloading the state from the peers to speed up the sync, just to waste time doing "State heal" later?
👎

@anonqoder
Copy link

Mar 14 15:47:21 host1 geth[75528]: INFO [03-14|15:47:21.853] State heal in progress                   accounts=103,774@4.94MiB      slots=180,664@13.09MiB      codes=120@1010.50KiB     nodes=901,246@216.60MiB pending=104,371

If this "heal" is gonna "fix" 175M Ethereum accounts (it is now on 103774), this will take more than a month to complete, what exactly is it going to go?

@nuliknol
Copy link

nuliknol commented Jul 22, 2022

I wish Geth honesty places logs saying hardware is not good enough if it can detect that.

totally concur, if you are providing an app bound to hardware you must benchmark the hardware before you start the app, otherwise your app won't work and since you are the one responsible for providing the app you are also guilty of letting the user running the app on a slow hardware (implicitly). Ethereum team has been giving us low quality engineering since 2015

@leontodd
Copy link

leontodd commented Aug 24, 2022

Mar 14 15:47:21 host1 geth[75528]: INFO [03-14|15:47:21.853] State heal in progress                   accounts=103,774@4.94MiB      slots=180,664@13.09MiB      codes=120@1010.50KiB     nodes=901,246@216.60MiB pending=104,371

If this "heal" is gonna "fix" 175M Ethereum accounts (it is now on 103774), this will take more than a month to complete, what exactly is it going to go?

The state heal process is a little deceiving. I was at first worried that my Raspberry Pi 4B + 2TB SSD was having trouble syncing as during state heal the accounts, slots, and nodes were going up slowly relative to how many accounts/slots there are in total.

My state heal finished in about 21 hours at 1,182,357 accounts, 2,327,247 slots, and 12,186,486 nodes. I also had many instances of Unexpected trienode heal packet which didn't seem to affect anything. I have seen other people finish at numbers larger than those so I think its very dependant on your own sync, but you definitely don't need to reach 175M accounts!

@kazazor
Copy link

kazazor commented Aug 25, 2022

That's good to know @leontodd .

But the question remains - how do I get any indication on the progress/ETA of the state heal progress?

@Yamakuzure
Copy link

I would like to know this too.
On my first attempt geth sudenly reported that something was not in order and that the blockchain had to be rewinded with target=0 (aka "Start From Scratch") ; Which I then did and am now in state heal.
Sync started (again) on August 4th, sync was done in three days, state heal runs since August 7th (24 days now).

Output looks like this:

INFO [08-31|08:40:43.441] State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=42@208.94KiB nodes=12,480,934@6.12GiB pending=18993
INFO [08-31|08:40:43.451] State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=42@208.94KiB nodes=12,480,934@6.12GiB pending=18993
INFO [08-31|08:40:43.542] Imported new block headers               count=1   elapsed=3.267ms  number=15,445,101 hash=ee34c1..82f2c5
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=4efd3711 reqid=7,328,963,860,793,291,079
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=9bf27c44 reqid=934,092,868,083,411,094
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=5a070bad reqid=7,559,473,032,996,682,308
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=1290f9f1 reqid=7,402,543,419,581,238,141
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=a85cfbac reqid=5,676,764,028,462,137,471
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=de7b486d reqid=3,356,032,926,609,925,242
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=9bd73479 reqid=3,878,834,481,290,902,674
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=468924e2 reqid=6,636,285,917,303,600,443
WARN [08-31|08:40:43.553] Unexpected trienode heal packet          peer=92adf5a4 reqid=3,533,469,970,467,382,671
WARN [08-31|08:40:43.554] Unexpected trienode heal packet          peer=703154b2 reqid=4,629,776,562,703,112,673
WARN [08-31|08:40:43.698] Unexpected trienode heal packet          peer=b70c1667 reqid=367,965,809,434,615,351
INFO [08-31|08:40:44.452] Downloader queue stats                   receiptTasks=0   blockTasks=0   itemSize=173.41KiB throttle=1512
WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039
INFO [08-31|08:40:44.536] Imported new block receipts              count=79  elapsed=83.833ms number=15,445,038 hash=d3de2b..2731b3 age=12m51s   size=11.35MiB
INFO [08-31|08:40:44.536] State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=42@208.94KiB nodes=12,480,934@6.12GiB pending=1
WARN [08-31|08:40:44.812] Unexpected trienode heal packet          peer=4883dc3a reqid=3,218,495,347,431,903,237
INFO [08-31|08:40:47.007] Imported new block headers               count=1   elapsed=3.586ms  number=15,445,102 hash=d8caad..e1f5ef
INFO [08-31|08:40:53.705] Imported new block headers               count=1   elapsed=3.250ms  number=15,445,103 hash=4eae4e..75c64a
INFO [08-31|08:40:58.106] State heal in progress                   accounts=0@0.00B slots=0@0.00B codes=42@208.94KiB nodes=12,480,951@6.12GiB pending=273

"pending" normally rises until it is somewhere between 110,000 and 130,000. Then a bunch of "Unexpected trienode heal packages" lines appear, and "pending" is restarted at 1. It is rare that this happens with "pending" being below 20k like above.

I also get lines telling me that geth doesn't like nimbus. (It is running and fully synced. Connection to geth, according to nimbus output, works)
WARN [08-31|08:41:47.076] Merge is configured, but no beacon client seen. Please ensure you have one available before the transition arrives!

What really puzzles me is the output accounts=0@0.00B slots=0@0.00B codes=42@208.94KiB in the state heal progress. I remember that these had much higher numbers on my first try.

And I am really anxious about the disk space used. This goes beyond anything I read online an eth1 node would need:

 $ LC_ALL=C du -hsc * 2>/dev/null
628G    chaindata
5.0K    jwtsecret
512     lightchaindata
512     LOCK
4.5K    nodekey
5.7M    nodes
512     transactions.rlp
2.8M    triecache
628G    total

(But it has not been growing much in the last 2 weeks.)

This is my current sync state, and while there is some progress, it has been minimal over the past 3 weeks.

Welcome to the Geth JavaScript console!

instance: Geth/v1.10.23-stable/linux-amd64/go1.19
coinbase: <removed>
at block: 0 (Thu Jan 01 1970 01:00:00 GMT+0100 (CET))
 datadir: <removed>
 modules: admin:1.0 debug:1.0 engine:1.0 eth:1.0 ethash:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

To exit, press ctrl-d or type exit
> eth.syncing
{
  currentBlock: 15445102,
  healedBytecodeBytes: 213959,
  healedBytecodes: 42,
  healedTrienodeBytes: 6573144668,
  healedTrienodes: 12481207,
  healingBytecode: 0,
  healingTrienodes: 0,
  highestBlock: 15445199,
  startingBlock: 15442752,
  syncedAccountBytes: 44351333509,
  syncedAccounts: 229300287,
  syncedBytecodeBytes: 4036242554,
  syncedBytecodes: 622973,
  syncedStorage: 788616132,
  syncedStorageBytes: 168788862624
}

I am really thinking about starting from scratch...

@holiman
Copy link
Contributor

holiman commented Aug 31, 2022

Sync started (again) on August 4th, sync was done in three days, state heal runs since August 7th (24 days now).

If it's been trying to heal for 24 days, then it's not going to succeed. The thing is, with this phase, that it has a nearly accurate state trie, with some errors in it. And it starts trying to patch it up. The errors are due to the block progression during the earlier phases, so if the first part wad done "in three days", then basically it's three days of state changes that needs to be applied to the state trie.

Now, while the state is healing, blocks are also progressed. So the state keeps moving. Therefore, every ~192 blocks or so, we need to move the 'pivot point' along, to target a more recent state.

WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039

Whenever we do so, we need to start from the root again, and go downwards while healing the trie. That is why pending becomes zero, so that's fine and as it should be.

But, if the blockchain state changes more rapidly than you are able to heal, then you will never catch up. These are the things that factor in:

  • The amount of gas spent in each block (more gas == more changes)
  • The speed of your disk (healing the trie is heavy on disk)
  • The speed/latency of your internet connection

@Yamakuzure
Copy link

Now, while the state is healing, blocks are also progressed. So the state keeps moving. Therefore, every ~192 blocks or so, we need to move the 'pivot point' along, to target a more recent state.

WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039

Whenever we do so, we need to start from the root again, and go downwards while healing the trie. That is why pending becomes zero, so that's fine and as it should be.

Thanks for the explanation!

But, if the blockchain state changes more rapidly than you are able to heal, then you will never catch up. These are the things that factor in:

  • The amount of gas spent in each block (more gas == more changes)
  • The speed of your disk (healing the trie is heavy on disk)
 $ LC_ALL=C dd if=/dev/zero of=.gethdata/test1.img bs=4096 count=262144 
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.84334 s, 1.3 GB/s
  • The speed/latency of your internet connection
    Using github.com/taganaka/SpeedTest the results in download speed vary between 100 and 380 Mbit/s. Upstream is always about 52Mbit/s.
    Ping varies between 8-12 ms and jitter is always <5ms (Usually 0).

This all lets me think that there must be something wrong within my database which causes the heal to stall and never to catch up. Best would be to throw it all away I guess... 😒

It would be nice if geth could detect such a state, though, and inform the user. Whether that's even possible is another question. 😉

@leontodd
Copy link

I ran geth removedb and removed both the local database and ancient database. When resyncing, the state heal was done in under 24 hours.

I tried this again but this time only removed the local database, and kept the ancient database. This time when resyncing the state heal went on for days without completing.

I suspect that there are too many states that need healing when resyncing from the ancient database, especially on slower hardware. This fits with your endless state heal @Yamakuzure as you resynced by rewinding the chain. I don't know much about the technical aspects of the state heal process, but resyncing from scratch has definitely helped in my case.

@holiman
Copy link
Contributor

holiman commented Aug 31, 2022

Possible fix: #25651 .
People who have reported problems -- did you run with any particular settings for the amount of peers?

@Yamakuzure
Copy link

Yamakuzure commented Sep 4, 2022

I suspect that there are too many states that need healing when resyncing from the ancient database, especially on slower hardware. This fits with your endless state heal @Yamakuzure as you resynced by rewinding the chain. I don't know much about the technical aspects of the state heal process, but resyncing from scratch has definitely helped in my case.

Oh, sorry, I broke off the (super-slow) rewinding, as a target of 0 meant an rm -rf * would do the same anyway, just faster.
I started from scratch (aka empty geth data directory) everytime and never kept the nodes or ancient data directories.

Possible fix: #25651 . People who have reported problems -- did you run with any particular settings for the amount of peers?

No.

@mutazmq1
Copy link

If your healing is taking forever, this is how I solved it after weeks of trying:

  • geth removedb y n as @MariusVanDerWijden recommended
  • Upgrade your geth to the latest unstable version
  • Get a fast hardware around 11k IOPS or above. The higher the better
  • Wait for 24-48 hours and you should be fine

@MariusVanDerWijden
Copy link
Member

This issue seems to be quite stale now. Snap sync has improved significantly since this issue was created. Compaction while sync is still an issue, but not as bad as it was and it will be even less of a problem once we move to pebble. I am going to close this now. Thank you for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests