Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for selection proof signatures #3223

Closed
wants to merge 3 commits into from

Conversation

macladson
Copy link
Member

Issue Addressed

#3216

Proposed Changes

Add a cache to each VC for selection proofs that have recently been computed. This ensures that duplicate selection proofs are processed quickly.

Additional Info

Maximum cache size is 64 signatures, at which point the oldest signature (or rather the signature referencing the oldest message) is removed to make room for the next signature.

WIP: Still need to remove pre-computation selection proofs if these are no longer desired.

@macladson macladson added the work-in-progress PR is a work-in-progress label May 26, 2022
@michaelsproul
Copy link
Member

Thinking on it some more I think it may be useful to keep the pre-computed sync selection proofs, because they also have the advantage of happening in the background off the hot path of trying to sign and publish a selection proof.

@jmcruz1983
Copy link

jmcruz1983 commented Jun 3, 2022

Good work guys,
this feature got massive improvement in memory footprint
compared to v2.2.1

lighthouse-4

@michaelsproul
Copy link
Member

@jmcruz1983 what about the total number of signing requests?

We're curious whether there are fewer signing requests with this PR, and whether the increases you were seeing previously were evenly spaced or due to big spikes.

@jmcruz1983
Copy link

jmcruz1983 commented Jun 6, 2022

@michaelsproul
numer of signing request didn't change much, we are running on similar numbers
Check before screenshot:
before
and after screenshot:
after

Everything is working with some decrease on Web3SignerRequestFailed errors
Thanks

@michaelsproul
Copy link
Member

Thanks for the info @jmcruz1983!

How many validators are on the machine from which that graph was drawn?

You mentioned previously that Lighthouse was doing a lot more signing than Teku, would you be able to send a similar graph for Teku? Or let us know Teku's average number of signatures per validator per epoch so we can compare.

Thanks! 🙏

@jmcruz1983
Copy link

@michaelsproul
Chart combines signing ops from tekus & lighthouse validators,
in this chart, from our test environment, we are having around 4k active keys

@jmcruz1983
Copy link

I managed to isolate lighthouse signing request and number of requests before & after are not significant:
lighthouse reqs
Here we are talking of 2 validators (each one with around 800 active keys)

@jmcruz1983
Copy link

jmcruz1983 commented Jun 13, 2022

Any idea when this PR will be released?
Thanks

@michaelsproul
Copy link
Member

@jmcruz1983 has the memory usage remained lower even after running this PR for a few days? We haven't been able to reproduce that on our nodes and don't know why that would occur

@jmcruz1983
Copy link

@michaelsproul
yes memory keeps lower since upgrade
mem

@jmcruz1983
Copy link

Any update here?

@michaelsproul
Copy link
Member

@jmcruz1983 Would you mind trying the VC from v2.3.1 on your test system? I'm wondering if maybe the memory reduction was due to upgrading from v2.2.1

@jmcruz1983
Copy link

Hi @michaelsproul
I did the v2.3.1 upgrade and memory increased back to the previous level.
So from my metrics I can observe that memory reduction is due to VC.
lighthouse-4_mem

@michaelsproul
Copy link
Member

@jmcruz1983 Thanks again!

I'd like to understand if there are any other differences between the binaries besides the version. My guess is that the memory difference might be due to the optimisation level used to compile our core crypto library (blst).

Are you using the Docker images from DockerHub for v2.3.1? Do you use the v2.3.1-modern tag, or just v2.3.1?

When you compile the Docker image yourself for this branch, I guess you're using a recently-released CPU? In which case you'll get the equivalent of v2.3.1-modern.

As a test, would you mind trying v2.3.1-modern on your infra?

@jmcruz1983
Copy link

Hi @michaelsproul
yes, I am using images from dockerHub with modern flavour,
So still VC was having lower memory footprint than v2.3.1-modern.

@michaelsproul
Copy link
Member

michaelsproul commented Jun 16, 2022

It's a real mystery! Do you mind trying the unoptimised v2.3.1 tag instead then?

(edit: cc @jmcruz1983)

@jmcruz1983
Copy link

@michaelsproul
With unoptimised v2.3.1 memory footprint it is same:
lighthouse-4_mem2

@michaelsproul
Copy link
Member

@jmcruz1983 Thanks! Can you please try building an image manually for the stable branch? Following the same process you use to build this PR

@jmcruz1983
Copy link

jmcruz1983 commented Jun 17, 2022

@michaelsproul
I did upgrade the validator with the custom build from stable branch and memory consumption did decrease:
lighthouse-4_mem3

FYI,
this is the docker file I use to build the custom builds:

FROM library/rust:1.58.1-bullseye AS builder
RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-dev

# Fetch specific https://github.com/sigp/lighthouse/pull/3223
# RUN git clone --quiet --depth 1 --branch vc-sig-cache https://github.com/macladson/lighthouse.git

# Fetch main stable branc
RUN git clone --quiet --depth 1 --branch stable https://github.com/sigp/lighthouse.git

ARG FEATURES
ENV FEATURES $FEATURES
RUN cd lighthouse && make

FROM ubuntu:latest
RUN apt-get clean \
  && rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse

# Adding non-root user
RUN adduser --disabled-password --gecos "" --home /opt/lighthouse lighthouse
RUN chown lighthouse:lighthouse /opt/lighthouse

# USER lighthouse
WORKDIR /opt/lighthouse

@michaelsproul
Copy link
Member

Hi @jmcruz1983, I've opened a PR to help us debug the issue here: #3279. Once that lands in unstable we'll be able to compare the SigP-compiled :latest-unstable image against a manual build of unstable, by observing the malloc metrics. We have a handy Grafana dashboard for these metrics here: https://github.com/sigp/lighthouse-metrics/blob/master/dashboards/MallocInfoGNU.json

bors bot pushed a commit that referenced this pull request Jun 20, 2022
## Issue Addressed

Following up from #3223 (comment), it has been observed that the validator client uses vastly more memory in some compilation configurations than others. Compiling with Cross and then putting the binary into an Ubuntu 22.04 image seems to use 3x more memory than compiling with Cargo directly on Debian bullseye.

## Proposed Changes

Enable malloc metrics for the validator client. This will hopefully allow us to see the difference between the two compilation configs and compare heap fragmentation. This PR doesn't enable malloc tuning for the VC because it was found to perform significantly worse. The `--disable-malloc-tuning` flag is repurposed to just disable the metrics.
@michaelsproul
Copy link
Member

@jmcruz1983 the metrics have landed in unstable if you'd like to test manually-compiled unstable vs :latest-unstable 🙏

@jmcruz1983
Copy link

jmcruz1983 commented Jun 21, 2022

@michaelsproul
I just set the dashboard and now I am running latest-unstable
so what specifics in the dashboard you want me check?

@michaelsproul
Copy link
Member

@jmcruz1983 If my suspicions are correct, I expect a large fraction of your VC's memory usage will be freed memory that hasn't been reclaimed, which should show up under Free block memory. This is one of our Prater nodes with 5K validators:

free_block_mem

The other main ones to keep an eye on are Mem-mapped bytes (which I think should be low) and Non-mmap memory (which should be the bulk of useful memory). We can compare these 3 counts (mmap, non-mmap, free) to the total as reported by the dash, or to htop's resident set size (which might be slightly larger).

@jmcruz1983
Copy link

jmcruz1983 commented Jun 21, 2022

@michaelsproul
I did run latest-unstable and after that I upgraded to custom build from unstable branch and I capture some screenshots from Malloc Info Dashboard:

  • Blue line belongs to latest-unstable build pulled from docker
  • Orange line belongs to custom build from unstable branch

You can observer that custom build shows a better memory footprint.
malloc_info1
mallo_info2
malloc_info3

@michaelsproul
Copy link
Member

Thanks @jmcruz1983. We'll do some more investigations on our end to see what could be using so much memory when the VC uses web3signer.

@jmcruz1983
Copy link

jmcruz1983 commented Jun 29, 2022

As requested by @paulhauner I took a snapshot on vc_signature_cache_hits metrics

cache hits

@macladson
Copy link
Member Author

I was able to reproduce the memory issues locally and have written up an issue for it at #3302.
Thanks for your help in discovering this @jmcruz1983 🙏

@macladson
Copy link
Member Author

Going to close this for now, seems like we get little to no benefit from this particular implementation. If these issues come up again, we could try investigating different avenues.

@macladson macladson closed this Sep 12, 2022
@macladson macladson deleted the vc-sig-cache branch January 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants