Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce memory watchdog; LOTUS_MAX_HEAP #5101

Merged
merged 7 commits into from
Dec 3, 2020
Merged

Conversation

raulk
Copy link
Member

@raulk raulk commented Dec 2, 2020

Subsumes #4930.

This PR introduces major improvements to memory management:

  1. it introduces the memory watchdog, a utility that observes a user-defined heap limit, falling back to a discovered system memory limit (total memory available), and forces GC based on a watermark policy.
    • see https://github.com/raulk/go-watchdog/ for more information.
    • watermarks are 0.50, 0.60, 0.70, 0.85, 0.90, 0.925, 0.95; this means that the watchdog will trigger GC every time that utilisation grows above one of those marks.
    • emergency watermark is 0.95; this means that if utilisation is above 95%, the watchdog will always force GC even if within the silence period.
    • silence period is 45s; this means that the watchdog will refrain from forcing GC if a GC run finished less than 45s ago.
  2. users can now set a maximum heap size in the LOTUS_MAX_HEAP env variable.
    • format: bytes (12345678) or SI bytes (32GiB).
    • if a maximum heap is set, the watchdog will evaluate the policy against heap utilisation.
    • if a maximum heap is not set, the watchdog will evaluate the policy against total system usage / available.
  3. it sets a size for the badger index cache. This value was previously unset, which badger interpreted as "keep all indices in memory", rather than "do not keep anything in memory" (which is what was reasonable to expect when caches are unset...) See description in badger options: enable IndexCache and CompactL0OnClose. #4930 for more information.
    • the value is set at 20% of the maximum heap, or total system memory, clamped at a minimum bound of 256MiB, and a maximum bound of 1GiB.
    • EDIT: see comments.
  4. it also sets CompactL0OnClose = true, to enable read-only open by processes like Sentinel.

Relates to #4445, #4877, #1876, #1895, #4487.
Fixes #5058.
Epic: #4753.

@raulk
Copy link
Member Author

raulk commented Dec 2, 2020

Ok, discovery about badger:

  • if do not set IndexCacheSize to a value, it falls back to keeping all indices in memory. This is around 2MiB for the index (block offsets) + 5MiB for the bloom filter per table. An archive node may have 3000 tables. That’s around 15GiB of memory gone to indices. We knew this.
  • if you set IndexCacheSize to, for example, 1GiB, badger will keep 1GiB worth of index and bloom filters in memory. However, the cost of a miss is really, really high. It’s what I was seeing in the profiles above. A miss has to deserialise the entire index and the bloom filter (very expensive), again. And it’s likely that the cache will reject that add, so you will keep incurring in this cost over and over. We didn't know this.

There could be a sweet value for IndexCacheSize, but it’s going to vary wildly from store to store, based on the amount of data the store holds, the compaction state, etc. I think there’s no other option than to bite the bullet for now and keep the indices + blooms in memory.

With the hot/cold store, we can revisit this conversation, because it certainly will make zero sense to hold all the cold tables in memory. That is, if we haven't moved away from badger already.

@raulk
Copy link
Member Author

raulk commented Dec 2, 2020

I will back out the change to set the IndexCacheSize to a non-zero value.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one question

// stop isn't nice.
opts.CompactL0OnClose = false
// read-only and efficiently queried.
opts.CompactL0OnClose = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be problematic/lead to data loss if user triggers shutdown, and after a few seconds seeing that the process isn't stopping, kills the process during compaction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how Lotus used to behave before the native badger blockstore:

opts.Options = dgbadger.DefaultOptions("").WithTruncate(true).
WithValueThreshold(1 << 10)

The default value of this flag is true.

Flipping it to false was actually a regression, and it was caught by the Sentinel team because their lens component opens the store in read-only mode. Since L0 (unsorted level) hadn't been compacted to L1 (sorted level), badger refused to open it.

If we didn't notice corruption issues before, we shouldn't notice them now. However, I won't put my hand in the fire for badger...

@raulk
Copy link
Member Author

raulk commented Dec 3, 2020

I'm thinking that we might want to introduce a feature flag (env variable) to disable the memory watchdog entirely in case we find that some OS reports wrong system information.

@raulk
Copy link
Member Author

raulk commented Dec 3, 2020

^^ Done in f9cbf6d.

@raulk raulk changed the title introduce memory watchdog; LOTUS_MAX_HEAP; adjust badger cache sizes introduce memory watchdog; LOTUS_MAX_HEAP Dec 3, 2020
@raulk raulk mentioned this pull request Dec 3, 2020
@magik6k magik6k merged commit dc06d30 into master Dec 3, 2020
@magik6k magik6k deleted the raulk/memory-watchdog branch December 3, 2020 14:46
@raulk
Copy link
Member Author

raulk commented Dec 3, 2020

Documented in filecoin-project/filecoin-docs#609.

@kernelogic
Copy link

So if I don't set this variable, watch dog still works to prevent going over 95% of memory?

@raulk
Copy link
Member Author

raulk commented Dec 3, 2020

@kernelogic yes, it should automatically discover your total system memory and use it as a limit. It will then trigger GC at the watermarks listed above, and it will go into emergency mode when memory utilisation goes above 95%, at which point it will trigger GC every time it is called.

You should see the effective limits in log statements like this:

initialized watermark watchdog policy; watermarks: [0.5 0.6 0.7 0.85 0.9 0.925 0.95]; emergency watermark: 0.950000, thresholds: [66826909696 80192291635 93557673574 113605746483 120288437452 123629782937 126971128422]; silence period: 0s
max heap size not provided; memory watchdog will apply total system memory constraint: 133653819392 bytes

There is probably some tuning to do still, so it would be awesome if you can take this for a spin and report back. It is recommended that you run Lotus with this env variable. You will see much better working set / RSS reporting with this (which will become the default with go1.16):

GODEBUG="madvdontneed=1"

@kernelogic
Copy link

kernelogic commented Dec 3, 2020

I am setting this variable to 28GiB on my 32GiB server to see how it goes. They have been either showing "Killed" for no reason or just hung after sometime.

@raulk
Copy link
Member Author

raulk commented Dec 3, 2020

@kernelogic if those were OOM Kills (likely), this patch should help. Please report back!

@kernelogic
Copy link

OK It's been running for 24 hours without being killed. Any chance this can be looked at?
#5102

@kernelogic
Copy link

@raulk lotus crashed again, can't tell easily what went wrong except a "Killed" as the last line of the log.

2020-12-05T22:20:31.795Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreig4qnqhidvxyhf2e2jb5giv76fui3d76fc6lgnshlfo5uyahsddva", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:31.851Z        INFO    chain   chain/sync.go:718       block validation        {"took": 1.056991604, "height": "296681", "age": 1.851946503}
2020-12-05T22:20:31.855Z        INFO    chainstore      store/store.go:513      New heaviest tipset! [bafy2bzaceahaonzvgpi7xu3kxtwbkhxvjzba2lqcofzaflolxz2qjjsfjyacy bafy2bzacebw5itgp7opq5iupyamoaelym2rw2ipbdoxfruofp2e4vnqlquanw bafy2bzacedgkpqmxi7vc3z4sodjinsw6o6mcjiz3eh6o3br43zbnycs5uihbo bafy2bzacecmgbvmbe3yzhavir4ie5mz7njixxjv3jcodh2vmmg2kddlymyt72 bafy2bzacecmw6ixvibcvcs2v3aeb644nhiorxb3g54cezbox7tfcop2ati342] (height=296681)
2020-12-05T22:20:31.857Z        INFO    chain   chain/sync_manager.go:322       worker 37182 done; took 1.162468834s
2020-12-05T22:20:32.006Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:32.010Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreibc23linpdfblm3m2cxrhzhkreqasoxzxxsuazphkyzmov62gevyy", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:32.104Z        INFO    chain   chain/sync_manager.go:232       selected sync target: [bafy2bzaceahaonzvgpi7xu3kxtwbkhxvjzba2lqcofzaflolxz2qjjsfjyacy bafy2bzacebw5itgp7opq5iupyamoaelym2rw2ipbdoxfruofp2e4vnqlquanw bafy2bzacedgkpqmxi7vc3z4sodjinsw6o6mcjiz3eh6o3br43zbnycs5uihbo bafy2bzacecmgbvmbe3yzhavir4ie5mz7njixxjv3jcodh2vmmg2kddlymyt72 bafy2bzacecmw6ixvibcvcs2v3aeb644nhiorxb3g54cezbox7tfcop2ati342]
2020-12-05T22:20:32.105Z        INFO    chain   chain/sync_manager.go:314       worker 37184 syncing in [bafy2bzaceahaonzvgpi7xu3kxtwbkhxvjzba2lqcofzaflolxz2qjjsfjyacy bafy2bzacebw5itgp7opq5iupyamoaelym2rw2ipbdoxfruofp2e4vnqlquanw bafy2bzacedgkpqmxi7vc3z4sodjinsw6o6mcjiz3eh6o3br43zbnycs5uihbo bafy2bzacecmgbvmbe3yzhavir4ie5mz7njixxjv3jcodh2vmmg2kddlymyt72 bafy2bzacecmw6ixvibcvcs2v3aeb644nhiorxb3g54cezbox7tfcop2ati342]
2020-12-05T22:20:32.105Z        INFO    chain   chain/sync_manager.go:322       worker 37184 done; took 12.74µs
2020-12-05T22:20:32.210Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifegmuyu7bwrhlrrmeycqrexije4hqmbjtg3odpurk5koodv6izvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:32.220Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:32.256Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreialdpzuzepplxczriiiplmnalness7mxe2khf37fkghv3bxwuzt5a", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:32.422Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:32.468Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:32.670Z        WARN    vm      vm/runtime.go:332       Abortf: partition already proven: {{[12] false} map[] map[]}
2020-12-05T22:20:32.674Z        WARN    vm      vm/runtime.go:145       VM.Call failure: partition already proven: {{[12] false} map[] map[]} (RetCode=16):
    github.com/filecoin-project/specs-actors/v2/actors/builtin/miner.Actor.SubmitWindowedPoSt.func1
        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/miner/miner_actor.go:448
2020-12-05T22:20:32.674Z        WARN    vm      vm/vm.go:528    Send actor error        {"from": "f3vnghdrvacjgelomaukbj3f2tzx7jg4ugfol6qt43ul2mks7hkstes5wlwsj5qxpslsnfsfktishpyynydj6a", "to": "f020452", "nonce": 1616, "method": "5", "height": "296681", "error": "partition already proven: {{[12] false} map[] map[]} (RetCode=16):\n    github.com/filecoin-project/specs-actors/v2/actors/builtin/miner.Actor.SubmitWindowedPoSt.func1\n        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/miner/miner_actor.go:448"}
2020-12-05T22:20:32.707Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreidhcwhhg5b5pqdjpt7thbt72wf5bsgdb5ahhiklbvcadfuv2zesfa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:32.925Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:33.088Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreig4qnqhidvxyhf2e2jb5giv76fui3d76fc6lgnshlfo5uyahsddva", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:33.334Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreibc23linpdfblm3m2cxrhzhkreqasoxzxxsuazphkyzmov62gevyy", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:33.533Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifegmuyu7bwrhlrrmeycqrexije4hqmbjtg3odpurk5koodv6izvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:33.596Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:33.590Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreialdpzuzepplxczriiiplmnalness7mxe2khf37fkghv3bxwuzt5a", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:34.158Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:34.209Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:34.257Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreidhcwhhg5b5pqdjpt7thbt72wf5bsgdb5ahhiklbvcadfuv2zesfa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:34.504Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:34.811Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreig4qnqhidvxyhf2e2jb5giv76fui3d76fc6lgnshlfo5uyahsddva", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:35.252Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:35.459Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifegmuyu7bwrhlrrmeycqrexije4hqmbjtg3odpurk5koodv6izvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:35.600Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreibc23linpdfblm3m2cxrhzhkreqasoxzxxsuazphkyzmov62gevyy", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
2020-12-05T22:20:35.736Z        WARN    storagemarket_impl      clientstates/client_states.go:176       error when querying provider deal state: %!w(*xerrors.wrapError=&{failed to read deal status response 0xc000192100 {[9789323 25641765 25653558]}})
2020-12-05T22:20:35.898Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreialdpzuzepplxczriiiplmnalness7mxe2khf37fkghv3bxwuzt5a", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealUnknown"}
Killed
root@li1979-101:/opt/lotus# 

@kernelogic
Copy link

@raulk I am seeing a lot of emergency GC from watchdog - a few times per minute. But memory usage still high.

@raulk
Copy link
Member Author

raulk commented Dec 8, 2020

@kernelogic Thanks for the info! Could you please take a heap dump during emergency state, and upload it here?

$ wget -O kernelogic.heap.out http://localhost:1234/debug/pprof/heap

@kernelogic
Copy link

kernelogic commented Dec 16, 2020

My daemon crashed today on a 64GB RAM machine. You can see in the logs emergency triggers 5 times within 30 seconds and eventually crashed. Therefore I cannot get the heap. lotus version 1.2.2+git.a999e4167

2020-12-15T23:54:07.360Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 0.993900/0.950000 (% used/% emergency), utilization: 67035275264/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
2020-12-15T23:54:07.361Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:242      watchdog policy fired
2020-12-15T23:54:07.361Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:249      watchdog is triggering GC
2020-12-15T23:54:07.675Z        WARN    vm      vm/runtime.go:145       VM.Call failure: not enough gas: used=54337951, available=54484550, use=716340 (RetCode=7):
    github.com/filecoin-project/lotus/chain/vm.(*Runtime).chargeGasInternal
        /opt/lotus/chain/vm/runtime.go:575
2020-12-15T23:54:07.675Z        WARN    vm      vm/runtime.go:376       vmctx send failed: to: f04, method: 8: ret: [], err: not enough gas: used=54337951, available=54484550, use=716340 (RetCode=7)
2020-12-15T23:54:07.675Z        WARN    vm      vm/runtime.go:332       Abortf: failed to submit proof for bulk verification
2020-12-15T23:54:07.675Z        WARN    vm      vm/runtime.go:145       VM.Call failure: failed to submit proof for bulk verification (RetCode=7):
    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireSuccess
        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/shared.go:49
2020-12-15T23:54:07.675Z        WARN    vm      vm/vm.go:528    Send actor error        {"from": "f3wabdxwvlqg2svdsjhxidyj2fwrbenqeonlw2hi5isk455vj3c3444ok4gla2dj3y2vd2pzari77xcw5ll3ra", "to": "f03223", "nonce": 90656, "method": "7", "height": "325668", "error": "failed to submit proof for bulk verification (RetCode=7):\n    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireSuccess\n        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/shared.go:49"}
2020-12-15T23:54:08.108Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:253      watchdog-triggered GC finished; took: 746.534364ms; current heap allocated: 56371899368 bytes
2020-12-15T23:54:09.779Z        WARN    vm      vm/runtime.go:145       VM.Call failure: not enough gas: used=47056674, available=47561757, use=716340 (RetCode=7):
    github.com/filecoin-project/lotus/chain/vm.(*Runtime).chargeGasInternal
        /opt/lotus/chain/vm/runtime.go:575
2020-12-15T23:54:09.779Z        WARN    vm      vm/runtime.go:376       vmctx send failed: to: f04, method: 8: ret: [], err: not enough gas: used=47056674, available=47561757, use=716340 (RetCode=7)
2020-12-15T23:54:09.779Z        WARN    vm      vm/runtime.go:332       Abortf: failed to submit proof for bulk verification
2020-12-15T23:54:09.779Z        WARN    vm      vm/runtime.go:145       VM.Call failure: failed to submit proof for bulk verification (RetCode=7):
    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireSuccess
        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/shared.go:49
2020-12-15T23:54:09.779Z        WARN    vm      vm/vm.go:528    Send actor error        {"from": "f3rl3a7aalxsnsub3bkaejoqtxnvgabyp6cdrsns2immqwsh5wz4zgj22fq74okf7btouro4dmp4fdf5d6aeaa", "to": "f0100033", "nonce": 123991, "method": "7", "height": "325668", "error": "failed to submit proof for bulk verification (RetCode=7):\n    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireSuccess\n        /root/go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/shared.go:49"}
2020-12-15T23:54:13.125Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 0.995000/0.950000 (% used/% emergency), utilization: 67113721856/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
2020-12-15T23:54:13.129Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:242      watchdog policy fired
2020-12-15T23:54:13.129Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:249      watchdog is triggering GC
2020-12-15T23:54:13.947Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:253      watchdog-triggered GC finished; took: 817.600687ms; current heap allocated: 56490022320 bytes
2020-12-15T23:54:18.954Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 0.996900/0.950000 (% used/% emergency), utilization: 67236605952/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
2020-12-15T23:54:18.958Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:242      watchdog policy fired
2020-12-15T23:54:18.959Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:249      watchdog is triggering GC
2020-12-15T23:54:19.708Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:253      watchdog-triggered GC finished; took: 748.413373ms; current heap allocated: 56650042680 bytes
2020-12-15T23:54:22.902Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreicljwntloiblrck5k6yqbneibxph36tyolgmxl5trmhchztew32wa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:22.909Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreift7klgue7kmfmkolvtviebkiubijv2bdqczn7io6ceznvevilzaq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:22.909Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifde4l4uuhwk2b2umaqc66abwe37tpzsni32cumlvv6yc4opfhjpu", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:22.909Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreif65pc5czkia7i4qs2lkk5z6dy5gvffoove7zd3hpcjeqrx5qcsvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:24.489Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreicljwntloiblrck5k6yqbneibxph36tyolgmxl5trmhchztew32wa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:24.497Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreif65pc5czkia7i4qs2lkk5z6dy5gvffoove7zd3hpcjeqrx5qcsvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:24.497Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreift7klgue7kmfmkolvtviebkiubijv2bdqczn7io6ceznvevilzaq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:24.497Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifde4l4uuhwk2b2umaqc66abwe37tpzsni32cumlvv6yc4opfhjpu", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:24.721Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 0.997800/0.950000 (% used/% emergency), utilization: 67300225024/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
2020-12-15T23:54:24.721Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:242      watchdog policy fired
2020-12-15T23:54:24.721Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:249      watchdog is triggering GC
2020-12-15T23:54:25.419Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:253      watchdog-triggered GC finished; took: 695.939947ms; current heap allocated: 56727460056 bytes
2020-12-15T23:54:26.261Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreicljwntloiblrck5k6yqbneibxph36tyolgmxl5trmhchztew32wa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:26.287Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifde4l4uuhwk2b2umaqc66abwe37tpzsni32cumlvv6yc4opfhjpu", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:26.287Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreift7klgue7kmfmkolvtviebkiubijv2bdqczn7io6ceznvevilzaq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:26.287Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreif65pc5czkia7i4qs2lkk5z6dy5gvffoove7zd3hpcjeqrx5qcsvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:28.130Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreicljwntloiblrck5k6yqbneibxph36tyolgmxl5trmhchztew32wa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:28.138Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreift7klgue7kmfmkolvtviebkiubijv2bdqczn7io6ceznvevilzaq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:28.138Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifde4l4uuhwk2b2umaqc66abwe37tpzsni32cumlvv6yc4opfhjpu", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:28.138Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreif65pc5czkia7i4qs2lkk5z6dy5gvffoove7zd3hpcjeqrx5qcsvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:29.921Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreift7klgue7kmfmkolvtviebkiubijv2bdqczn7io6ceznvevilzaq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:29.934Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreicljwntloiblrck5k6yqbneibxph36tyolgmxl5trmhchztew32wa", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:30.128Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreif65pc5czkia7i4qs2lkk5z6dy5gvffoove7zd3hpcjeqrx5qcsvq", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:30.137Z        INFO    markets loggers/loggers.go:15   storage event   {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreifde4l4uuhwk2b2umaqc66abwe37tpzsni32cumlvv6yc4opfhjpu", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealPublishing"}
2020-12-15T23:54:30.375Z        INFO    chain   chain/sync_manager.go:232       selected sync target: [bafy2bzaceccaooj5oucsznyhdensj3lgmi3sqq2sqytvruujcsilasmmav3wq]
2020-12-15T23:54:30.380Z        INFO    chain   chain/sync_manager.go:314       worker 542 syncing in [bafy2bzaceccaooj5oucsznyhdensj3lgmi3sqq2sqytvruujcsilasmmav3wq]
2020-12-15T23:54:30.481Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 0.999400/0.950000 (% used/% emergency), utilization: 67404775424/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
2020-12-15T23:54:30.490Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:242      watchdog policy fired
2020-12-15T23:54:30.491Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:249      watchdog is triggering GC
2020-12-15T23:54:30.575Z        INFO    chain   chain/sync_manager.go:232       selected sync target: [bafy2bzaceasrp4tnkgywzrz7wcjvzusjiyrl442bqopj7fayr5vblakzbvvba]
2020-12-15T23:54:30.582Z        INFO    chain   chain/sync_manager.go:314       worker 543 syncing in [bafy2bzaceasrp4tnkgywzrz7wcjvzusjiyrl442bqopj7fayr5vblakzbvvba]
2020-12-15T23:54:30.586Z        INFO    chain   chain/sync_manager.go:232       selected sync target: [bafy2bzacecyv6mmqu6fup5px45r3z5r7arczgax6crkkhp27aonyc2twxmdai]
2020-12-15T23:54:30.586Z        INFO    chain   chain/sync_manager.go:314       worker 544 syncing in [bafy2bzacecyv6mmqu6fup5px45r3z5r7arczgax6crkkhp27aonyc2twxmdai]
2020-12-15T23:54:31.606Z        INFO    chain   chain/sync.go:718       block validation        {"took": 1.192934433, "height": "325669", "age": 1.606299839}
2020-12-15T23:54:31.630Z        INFO    chainstore      store/store.go:513      New heaviest tipset! [bafy2bzaceccaooj5oucsznyhdensj3lgmi3sqq2sqytvruujcsilasmmav3wq] (height=325669)
2020-12-15T23:54:31.651Z        INFO    chain   chain/sync_manager.go:322       worker 542 done; took 1.269114769s
2020-12-15T23:54:31.822Z        INFO    chain   chain/sync.go:718       block validation        {"took": 1.066193716, "height": "325669", "age": 1.822802653}
2020-12-15T23:54:31.812Z        INFO    chain   chain/sync.go:718       block validation        {"took": 1.041664067, "height": "325669", "age": 1.812612601}
2020-12-15T23:54:31.835Z        INFO    chainstore      store/store.go:513      New heaviest tipset! [bafy2bzaceccaooj5oucsznyhdensj3lgmi3sqq2sqytvruujcsilasmmav3wq bafy2bzaceasrp4tnkgywzrz7wcjvzusjiyrl442bqopj7fayr5vblakzbvvba bafy2bzacecyv6mmqu6fup5px45r3z5r7arczgax6crkkhp27aonyc2twxmdai] (height=325669)
2020-12-15T23:54:31.841Z        INFO    chain   chain/sync_manager.go:322       worker 543 done; took 1.253361997s
2020-12-15T23:54:31.841Z        INFO    chain   chain/sync_manager.go:322       worker 544 done; took 1.254838722s
2020-12-15T23:54:31.917Z        INFO    watchdog        go-watchdog@v0.0.1/watchdog.go:253      watchdog-triggered GC finished; took: 1.420877493s; current heap allocated: 56961550536 bytes
2020-12-15T23:54:36.577Z        INFO    pubsub  go-libp2p-pubsub@v0.4.0/gossipsub.go:1499       peer 16Uiu2HAmRH6MMzsPXNRWzQvFyQgV8CVe3e6hsffjNBm5FavWDhJJ didn't follow up in 1 IWANT requests; adding penalty
2020-12-15T23:54:37.069Z        INFO    watchdog        go-watchdog@v0.0.1/watermarks.go:85     watermark policy: emergency trigger; perc: 1.000000/0.950000 (% used/% emergency), utilization: 67447701504/70997580530/67447701504 (used/emergency/limit), used: %!d(MISSING), limit: %!d(MISSING),
Killed

@kernelogic
Copy link

Another advice, I think the first water mark 0.5 is a bit...low. The minimum RAM for miners is set to 128GB and when doing wdpost it will use like 80GB of RAM. This GC will just keep running every 45 seconds because it always exceed 0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement a memory watchdog
3 participants