Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mainnet 0 storage growing 10x faster than 6GB/month norm #4106

Closed
dckc opened this issue Nov 22, 2021 · 35 comments
Closed

mainnet 0 storage growing 10x faster than 6GB/month norm #4106

dckc opened this issue Nov 22, 2021 · 35 comments
Assignees
Labels
agd Agoric (Golang) Daemon
Milestone

Comments

@dckc
Copy link
Member

dckc commented Nov 22, 2021

edit: Diagnosis / Resolution

The 33GB in 15 days likely includes the fast sync period, when storage is consumed considerably faster than steady-state. ... Our experience does support the estimate of 6GB/month in the runbook. discussion of pruning settings continues
-- March 29

Describe the bug

I see reports of storage space growing 33GB in 15 days
https://discord.com/channels/585576150827532298/755164695849205942/911761990349905920

Why is it growing that fast?

Expected behavior

Our hardware baseline estimates ~6GB_month

Platform Environment

default pruning,transaction indexing etc etc

Additional context

mainnet 0

@dckc dckc added bug Something isn't working agd Agoric (Golang) Daemon labels Nov 22, 2021
@shivlim
Copy link

shivlim commented Nov 22, 2021

1.3G /root/.agoric/data/state.db
16K /root/.agoric/data/snapshots/metadata.db
20K /root/.agoric/data/snapshots
39G /root/.agoric/data/application.db
6.4G /root/.agoric/data/blockstore.db
1014M /root/.agoric/data/cs.wal
878M /root/.agoric/data/tx_index.db
20K /root/.agoric/data/evidence.db

Most data is occupied by application db and we dont even have agoric sdk deployed?

@dckc dckc changed the title mainnet 0 storage gowing unexpectedly raplidly mainnet 0 storage growing unexpectedly rapidly Nov 22, 2021
@michaelfig
Copy link
Member

Most data is occupied by application db and we dont even have agoric sdk deployed?

Just to note, the "application" in this context means the "Tendermint Application", which is "Cosmos SDK", nothing to do with the yet-to-be-enabled agoric-sdk layer. So this should be something that folks familiar with Cosmos can help.

@dckc
Copy link
Member Author

dckc commented Dec 8, 2021

It's less clear to me that this is an unusual amount of disk space.

On Nov 28 I asked in #validators: This growth rate is consistent with other cosmos sdk chains? @shivlim replied:

Guess it depends. I have seen some taking less and some takes more.

Another data point, from Nov 22: figment.io reported:

Looks like we've got a 32.2GB growth over the past 2 weeks, ~2.4GB/day. Seems consistent to what's described in the issue.

I asked if they would try to diagnose the problem, but I don't recall seeing any reply.

p.s. this seems relevant to #2292

@nathan-at-least
Copy link

nathan-at-least commented Dec 13, 2021

Hi folks, glancing at disk usage in ~/.agoric/data I see:

~/.agoric/data$ du -sh *
61G     application.db
14G     blockstore.db
1022M   cs.wal
24K     evidence.db
4.0K    priv_validator_state.json
24K     snapshots
2.7G    state.db
1.8G    tx_index.db

I believe blockstore.db, evidence.db, state.db, and tx_index.db are all general to Tendermint Core based on this article. Is application.db specific to Agoric? It's the lion's share of disk space in this point in time.

@dckc
Copy link
Member Author

dckc commented Dec 13, 2021

... Is application.db specific to Agoric?

It's common to Cosmos-SDK chains, as noted in #4106 (comment)

We're looking for an experienced cosmos validator to look into this more closely and let us know whether 61G is normal and if not, suggest a diagnosis.

@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 2, 2022

@dtribble We need to find a validator who can reliably give us statistics.

@dtribble
Copy link
Member

dtribble commented Feb 2, 2022

This seems to not actaully be an issue in practice. @Tomas-Eminger can you provide concrete numbers from your validator? Should we close this issue?

@Tomas-Eminger
Copy link

Tomas-Eminger commented Feb 2, 2022

agoric@agoric-mainnet:~/.agoric/data$ du -sh *
76G     application.db
30G     blockstore.db
1019M   cs.wal
24K     evidence.db
4.0K    priv_validator_state.json
28K     snapshots
6.0G    state.db
3.9G    tx_index.db
4.0K    upgrade-info.json

Its is growing pretty fast IMO....

pruning = "default"

@dckc
Copy link
Member Author

dckc commented Feb 2, 2022

Hi @Tomas-Eminger ... any thoughts on why? It would be great if you could help us diagnose this.

@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 3, 2022

@FUDCo @warner FYI

@FUDCo
Copy link
Contributor

FUDCo commented Feb 3, 2022

I'm not sure what's going on here (and it might be a genuine problem), but it's outside the scope of SwingSet, whose database is in a different set of files entirely.

@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 4, 2022

@FUDCo @warner So who should this be assigned to?

@FUDCo
Copy link
Contributor

FUDCo commented Feb 4, 2022

I'd start with @michaelfig. If he's not the one it's likely he'll have a better idea of who should be.

@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 4, 2022

@michaelfig please put an estimate on this, for at least the work to find the problem, and give the appropriate area label. there are some numbers from one of the validators above.

@michaelfig michaelfig removed the cosmic-swingset package: cosmic-swingset label Feb 7, 2022
@nitronit
Copy link

nitronit commented Feb 11, 2022

@michaelfig

This is our current figures:

84G	application.db
37G	blockstore.db
1017M	cs.wal
24K	evidence.db
4.0K	priv_validator_state.json
309M	snapshots
6.6G	state.db
4.3G	tx_index.db
4.0K	upgrade-info.json

And this is our overall disk space growth last couple of days in gigabytes:

Date Daily growth (GB)
2022-02-01 3.03
2022-02-02 0.57
2022-02-03 0.5
2022-02-04 2.94
2022-02-05 1.66
2022-02-06 1.2
2022-02-07 1.1
2022-02-08 1.29
2022-02-09 1.14
2022-02-10 1.34
2022-02-11 1.2

*Edit:
pruning = "default"

@michaelfig
Copy link
Member

I’d like for someone to enable the ag0 start --trace-store=<filename> output file for running a day or so and post it somewhere for analysis. This can be done on a follower node, rather than a validator.

I'll be running a follower soon, so will be able to see this myself, but other folks' data would be helpful too.

@dckc dckc changed the title mainnet 0 storage growing unexpectedly rapidly mainnet 0 storage growing 10x as fast as runbook says Feb 11, 2022
@shivlim
Copy link

shivlim commented Feb 11, 2022

du -h data
6.6G data/state.db
20K data/snapshots/metadata.db
24K data/snapshots
82G data/application.db
34G data/blockstore.db
1021M data/cs.wal
4.3G data/tx_index.db
24K data/evidence.db

From my Box. 127GB so far

Also take a look at this
cosmos/cosmos-sdk#9859

It could be we ned to set custom pruning strategy as default seems to have issue?

@zmanian
Copy link
Contributor

zmanian commented Feb 12, 2022

Here is a 6hr kv store trace.

stategrowth Warning 3 GB

@michaelfig
Copy link
Member

From @zmanian's --trace-store output, indeed, there is about 1GB of changed writes per day (almost exactly 70kb per block), just from incrementing some per-validator counter and writing out the validator set.

Here is a typical diff between two block's writes: https://gist.github.com/michaelfig/2badd5fd418798acca2ce883a6f48a6b

So it really looks like the default pruning is not working. @shivlim can you try setting some custom pruning?

@gavinly
Copy link

gavinly commented Mar 4, 2022

How are things going here?

@dckc
Copy link
Member Author

dckc commented Mar 4, 2022

tldr: we need a hero to diagnose this.

Agoric has budgeted a day of @michaelfig 's time, but other things are higher priorities, so it's not likely to happen soon.

In discord I saw a new validator ask how much storage would be used and the answer (148GB on March 3) shocked them.

I thought others followed up by sharing their pruning settings, but I can't find them.

informal snapshot sharing

see Agoric/testnet-notes#42

Catching up, but slowly

One validator reported their sync times 10 minutes apart; I projected 56hrs to catch up on 764hrs of chain time.

Block Height block time   elapsed sync time elapsed
3598850 2022-02-03 11:35:29 CST   2022-03-07 14:54:00  
3600083 2022-02-03 13:40:54 CST 02:05:25.00 2022-03-07 15:04:00 00:10:00
4017741 2022-03-07 08:12:31 CST 764:37:02.00 2022-03-09 23:31:19 56:27:20

@dckc
Copy link
Member Author

dckc commented Mar 14, 2022

@warner just set up a follower node. After spending a couple days catching up, he reports:
disk usage is about 165GiB

@dckc
Copy link
Member Author

dckc commented Mar 23, 2022

@arirubinstein asked: how much do our nodes need to keep available?

@michaelfig says for IBC, our RPC nodes need to remember the whole bonding period (21 days).

@gavinly
Copy link

gavinly commented Mar 24, 2022

this could be an upstream Tendermint issue
i've heard that Osmosis is having similar issues and that they're banking on the next Tendermint upgrade

@dckc
Copy link
Member Author

dckc commented Mar 29, 2022

These settings were just shared in a message in discord:

# default: ...
# custom: allow pruning options to be manually specified through 'pruning-keep-recent', 'pruning-keep-every', 'pruning-interval'
pruning = "custom"

# These are applied if and only if the pruning strategy is cusom
pruning-keep-recent = "500"
pruning-keep-every = "1000"
pruning-interval = "10"

unknown

@dckc dckc changed the title mainnet 0 storage growing 10x as fast as runbook says mainnet 0 storage growing 10x faster than 6GB/month norm Mar 29, 2022
@dckc
Copy link
Member Author

dckc commented Mar 29, 2022

@arirubinstein and I talked this over. The 33GB in 15 days likely includes the fast sync period, when storage is consumed considerably faster than steady-state.

Our experience does support the estimate of 6GB/month in the runbook. Our default-pruning node is still syncing, but with the following pruning settings:

    "--pruning=custom",
    "--pruning-interval=10",
    "--pruning-keep-every=0",
    "--pruning-keep-recent=100"

-- https://github.com/Agoric/followers_k8s/blob/main/k8s/overlays/mainnet_prune_custom/config.json#L6-L

... we see 3GB/month (14.35 - 14.25 = 0.1 GB / day):

Kubernetes Pod - Volume usage for  _mainnet-prune-(everything_custom) _  MEAN

For reference, from another Cosmos-SDK chain: Desmos docs on pruning

@dckc dckc closed this as completed Mar 29, 2022
@dckc dckc removed bug Something isn't working needs-design labels Mar 29, 2022
@Tomas-Eminger
Copy link

Recently I was migrating my agoric validator node to new server. Previously I have been using pruning default, my new node was using:
"--pruning=custom",
"--pruning-interval=10",
"--pruning-keep-every=0",
"--pruning-keep-recent=100"

When I have switched the nodes, I was able to sign blocks of other validators, but when my node was responsible for the block I was unable to sign it.

(This error was present before changing the node to validator)

Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=94aab58ef9bfade78470ca635a2ec4a6663cf1e6 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=a26158a5cbb1df581dd6843ac427191af76d6d5d round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=f095bb53006ebddcbbf29c8df70dddcba6419e36 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=b2406ba97421a9030bed25560c99b25965b6c336 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=a45c88e13ba01392693abfb3e36f098277dcb2e2 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=3445f4b73fdc63a1bf78c638afb122f69cb0bd4a round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=6fc23ee451a5969853825d861532676b84d7bf0c round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=ef12448f0f8671a195ab38c590cac713ad703a8b round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=173aced2385b7170fb8c55a0e639441c2e8d42b5 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=875f8b359148f0d2a4bb501f8ae8a0cd4560bff3 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=47c35c8137ad2098e0b2a79077fea93a530034d8 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=c84170667fcf54024b24f05b2f9dd6608570ac8c round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=2c03e71116d1a2f9ba39a63a97058fcdeabfe2be round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=135074060a2b4f0a8a52377ed9c49189843ff898 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=71bd0265037393f31ee9947a8e32fa494e51b637 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=68c9c4e8388ed6936ff147ffe6b9913e79328957 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=33bf3dde71141b5bd2d380cf87c6582762b00e0a round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=f1966845bebd30816f18635a20b86e6781211616 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=128219027a0ac8e964e14bb906d91237bc826932 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=afc01879fbbf293ba94ae6833cdbb6de3dcae2d8 round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=457dd6e21b7470519eb0362a2fca53c6523ca70d round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=99968808ecae7bc41b14df3bcb51b724ee5f782f round=0
Apr 13 14:55:42 agoric-mainnet-val-01 ag0[1547839]: 2:55PM ERR failed to process message err="error invalid proposal signature" height=4511445 module=consensus msg_type=*consensus.ProposalMessage peer=0c370d803934e3273c61b2577a0c6e91b9f677e0 round=0

tmkms told me, that I'm double signing:

ERROR tmkms::session: [agoric-3@tcp://10.37.11.10:26658] attempted double sign PreCommit at h/r/s: 4511506/0/2 (969D5D4981 != 09D3480237)
failed to verify vote with ChainID agoric-3 and PubKey PubKeyEd25519{227506FF3CD1494629991101C0AFDA222B556DE11EA3A99D38B2B779DABD3518}: invalid signatur
e" #033[36mmodule=#033[0mconsensus

IDK if it's relevant but I would rather publish here my experience.

When I started over with unsafe-reset-all, downloaded polkachu snapshot, and set pruning to default, everything started worked normally :)

@nitronit
Copy link

nitronit commented Apr 14, 2022

An "update" of my disk: Delta in the disk looks to be in ≈1.3gb a day in average.
My pruning settings are 'default'

Edit: Added start and end blocks for the period.

Time Delta in GB Start Block End Block
2022-02-28 15:00 1.82 3 911 349 3 924 634
2022-03-01 15:00 1.55 3 924 634 3 937 925
2022-03-02 15:00 1.51 3 937 925 3 951 256
2022-03-03 15:00 1.53 3 951 256 3 964 565
2022-03-04 15:00 1.42 3 964 565 3 977 807
2022-03-05 15:00 1.4 3 977 807 3 991 078
2022-03-06 15:00 1.48 3 991 078 4 004 350
2022-03-07 15:00 1.47 4 004 350 4 017 622
2022-03-08 15:00 1.22 4 017 622 4 030 896
2022-03-09 15:00 1.24 4 030 896 4 044 184
2022-03-10 15:00 1.3 4 044 184 4 057 468
2022-03-11 15:00 1.19 4 057 468 4 070 657
2022-03-12 15:00 1.17 4 070 657 4 083 947
2022-03-13 15:00 1.09 4 083 947 4 097 278
2022-03-14 15:00 1.19 4 097 278 4 110 723
2022-03-15 15:00 1.17 4 110 723 4 124 146
2022-03-16 15:00 1.15 4 124 146 4 137 588
2022-03-17 15:00 1.1 4 137 588 4 151 003
2022-03-18 15:00 1.18 4 151 003 4 164 364
2022-03-19 15:00 1.12 4 164 364 4 177 689
2022-03-20 15:00 1.16 4 177 689 4 191 123
2022-03-21 15:00 1.21 4 191 123 4 204 565
2022-03-22 15:00 1.23 4 204 565 4 218 018
2022-03-23 15:00 1.03 4 218 018 4 231 491
2022-03-24 15:00 1.16 4 231 491 4 244 927
2022-03-25 15:00 1.16 4 244 927 4 258 349
2022-03-26 15:00 1.16 4 258 349 4 271 774
2022-03-27 16:00 1.16 4 271 774 4 285 222
2022-03-28 16:00 1.22 4 285 222 4 298 667
2022-03-29 16:00 1.69 4 298 667 4 312 115
2022-03-30 16:00 1.72 4 312 115 4 325 532
2022-03-31 16:00 1.97 4 325 532 4 338 861
2022-04-01 16:00 1.95 4 338 861 4 352 013
2022-04-02 16:00 1.85 4 352 013 4 365 214
2022-04-03 16:00 1.57 4 365 214 4 378 362
2022-04-04 16:00 1.89 4 378 362 4 391 560
2022-04-05 16:00 1.96 4 391 560 4 404 860
2022-04-06 16:00 1.93 4 404 860 4 418 183
2022-04-07 16:00 1.66 4 418 183 4 431 515
2022-04-08 16:00 1.55 4 431 515 4 444 812
2022-04-09 16:00 1.39 4 444 812 4 458 042
2022-04-10 16:00 1.4 4 458 042 4 471 280
2022-04-11 16:00 1.34 4 471 280 4 484 508
2022-04-12 16:00 0.41 4 484 508 4 497 746
2022-04-13 16:00 0.36 4 497 746 4 510 976

@nitronit
Copy link

nitronit commented May 6, 2022

Know the issue is closed but here is an update on the size of our Agoric data folder.
*Pruning settings are at default.

Time Size in GB Delta in GB
2022-04-15 18:00 134.91 0.36
2022-04-16 18:00 135.41 0.5
2022-04-17 18:00 136 0.58
2022-04-18 18:00 136.43 0.43
2022-04-19 18:00 136.74 0.31
2022-04-20 18:00 137.3 0.56
2022-04-21 18:00 138.03 0.73
2022-04-22 18:00 138.86 0.83
2022-04-23 18:00 140.39 1.53
2022-04-24 18:00 141.83 1.44
2022-04-25 18:00 143.01 1.18
2022-04-26 18:00 144.29 1.28
2022-04-27 18:00 145.67 1.38
2022-04-28 18:00 146.84 1.17
2022-04-29 18:00 148.22 1.37
2022-04-30 18:00 149.37 1.15
2022-05-01 18:00 150.79 1.42
2022-05-02 18:00 151.91 1.12
2022-05-03 18:00 153.12 1.21
2022-05-04 18:00 154.48 1.35
Can't really understand how your internal node (the one on default) is not showing the same growth @michaelfig @dckc

@zmanian
Copy link
Contributor

zmanian commented May 6, 2022

I wonder if this has something to with compaction/ garbage collection in level db and leveldb might be delaying compaction if a lot of disk space is available?

@dckc
Copy link
Member Author

dckc commented May 6, 2022

@nitronit , @arirubinstein is the one who really knows what's going on around here. I'm leaving this to him.

@nitronit
Copy link

nitronit commented May 7, 2022

I wonder if this has something to with compaction/ garbage collection in level db and leveldb might be delaying compaction if a lot of disk space is available?

First of all maybe a good clarification on my side is that we use GoLevelDB.

@zmanian - Thanks, I think you might be onto something since it makes sense if its a garbage collection/compaction error, in terms of the growth 'matches' #4106 (comment). Can theoretically see the benefit with delaying compression as its enhances block speed(?) but cant see any references/mentioning in the GoLevelDB docs/repo of it which makes me rather leaning towards an error than delaying.

In addition with @gavinly also hinting this might be an upstream Tendermint issue. #4106 (comment), makes it more likely in my ('amateur') opinion.

Will keep an eye on if chains using newer TM-versions are seeing the same issues with default settings as well as monitoring our other Agoric nodes running with custom pruning. Which indeed looks to work. Sorry to bother you all.

@dckc
Copy link
Member Author

dckc commented May 7, 2022

... Sorry to bother you all.

On the contrary! Please do continue to share your experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agd Agoric (Golang) Daemon
Projects
None yet
Development

No branches or pull requests