Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extract gossipsub / discv5 dependency gauge prometheus metrics #9710

Merged
merged 38 commits into from
Nov 7, 2024

Conversation

Maddiaa0
Copy link
Member

@Maddiaa0 Maddiaa0 commented Nov 4, 2024

image

peers for each topic
Uploading image.png…

Get access to all of the p2p metrics counts to help investigate issues

fixes: #9691

The conversion is non trivial so only gauge operations are activated for now

@Maddiaa0 Maddiaa0 changed the title wip: extract gossipsub / libp2p / discv5 dependency prometheus metrics feat: extract gossipsub / discv5 dependency gauge prometheus metrics Nov 5, 2024
@Maddiaa0 Maddiaa0 marked this pull request as ready for review November 5, 2024 06:12
Copy link
Member Author

Maddiaa0 commented Nov 5, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @Maddiaa0 and the rest of your teammates on Graphite Graphite

Maddiaa0 and others added 14 commits November 5, 2024 06:35
This redoes #9704
with the PR title pattern needed for an ARM build

---------

Co-authored-by: Tom French <tom@tomfren.ch>
Was broken due to not all tests being ran when using e2e all when
implementing tagging outgoing logs.
Adds notes to the docs that indicate that authwits only work in single
player mode, describes the problems with not simulating simulations, and
updates the bberg readme install instructions

closes AztecProtocol/dev-rel#422
closes AztecProtocol/dev-rel#433
closes #9256
closes AztecProtocol/dev-rel#423
closes #6865
Minimal repro of issue in public processor, in which state is not
updated during simulation, so each tx comes out with the same
`.startState`. This hasn't been caught because all of our testing either
does:

- One tx per block with public calls, so the initial db state is correct
- Passes the *same db* to the public processor and the txHandler (see
`TestContext`), so when the handler runs
processedTxHandler.addNewTx(processedTx) it actually updates the db of
both itself and the public processor

EDIT: Note that failure is visible in `prover-client` CI tests
[here](https://github.com/AztecProtocol/aztec-packages/actions/runs/11618042996/job/32356873092#step:5:788)

---

Fix now included in this PR:

- Update `this.db` for each processed tx in `public_processor` (this is
eventually queried in `public_kernel_tail_simulator.ts` to get the state
reference to use for the kernel tail)
- Ensure that if a `txHandler` is supplied, its db is independent of the
public processor's
…rics collection (#9754)

Simplifies local docker compose using otel-lgtm, allow native metrics
collection using local testnet by just setting m flag if the docker
compose is running
I had seen a sequencer trying to build a block with 100+ transactions,
which results in a death spiral.

Also increase the timeout on discv5 for when that is supported across
multiple nodes.
This PR removes the RethrowableError hack in the AVM simulator, and
relies on the PublicContext's
[rethrow](https://github.com/AztecProtocol/aztec-packages/blob/master/noir-projects/aztec-nr/aztec/src/context/public_context.nr#L88)
to propagate the errors. There are two caveats.

First, because currently Aztec-nr does not keep track of the cause
chain, it would be impossible to have the call stack and original
contract address available, so that the PXE can interpret the error and
show debug information. Solidity has the same problem. I'm introducing a
heuristic to keep track of the call stack for the simple case where the
caller always rethrows: the simulator keeps a running trace in the
machineState, and the caller uses this trace IF the revertData
coincides. That is, if you are (re)throwing the same as what we were
tracking.

Second, while this all works well for errors in user code (e.g.,
`assert` in Noir), because they generate a revertData with an error
selector and data, the "intrinsic" errors from the simulator (aka
exceptional halts) do not work as well. E.g., "division by zero",
"duplicated nullifier", "l1 to l2 blah blah". These errors are
exceptions in typescript and do not have an associated error selector,
and do not add to the revertdata. This _could_ be done with enshrined
error selectors. That's easy in the simulator, but it's not easy in the
circuit for several reasons that are beyond the scope of this
description. The current solution is to propagate the error message (the
user will see the right error) but you cannot "catch and process" the
error in aztec.nr because there is no selector. This is not a limitation
right now because there's no interface in the PublicContext that would
let you catch errors. To be continued.

Part of #9061.
maramihali and others added 10 commits November 6, 2024 10:05
Construct the _hiding_ circuit, which recursively verifies the last
folding proof and the decider proof in Client IVC and amend the e2e
prover accordingly. The ClientIVC proof becomes a mega proof for the
hiding circuit and a goblin proof which simplifies the work required to
transform this in a zero knowledge proof.
Per #9746 , mac os compilation on ARM fails by default because of
`std::execution::par_unseq`.

This PR fixes that by explicitly disabling it on MacOS since apple clang
doesn't support it. There's probably a better plug somewhere but I'm not
ultra-familiar with how you setup the build system, should this be a
preset, should this be based on apple-clang specifically, etc. Can adapt
the PR as needed.

I also rename DISABLE_TBB here because it seems to me that this should
be part of the native C++20 support on other platforms, but I have to
admit I'm really unfamiliar with the details here and I'm not sure if
intel TBB is needed, or indeed what it does exactly. Can cut this from
the PR.

Likewise, this turns parallel algorithms ON by default on the same C++20
assumption, but that could well not work.
TODO also stop in prover node
# Change Log

- **UDP and TCP load balancers for Validator stateful set replicas**
When using `--set network.public=true`, each Validator stateful set
replica gets 2 new load balancers. This configuration can also be added
for other stateful sets if needed in the future.

- **Removed invalid `{{- if not .Values.ethereum.external }}` template
code**
`not` is not a recognized based case within Helm templating. The code
being removed was not functional in Kubernetes deployments. More
information:
https://helm.sh/docs/chart_template_guide/control_structures/#ifelse

- **Removed duplicate Otel env vars in `prover-node.yaml`**
This resolves the following Helm output warning during deployment: 

```
W1105 14:56:12.948976   57996 warnings.go:70] spec.template.spec.containers[0].env[19]: hides previous definition of "OTEL_EXPORTER_OTLP_METRICS_ENDPOINT", which may be dropped when using apply
W1105 14:56:12.949029   57996 warnings.go:70] spec.template.spec.containers[0].env[20]: hides previous definition of "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT", which may be dropped when using apply
W1105 14:56:12.949040   57996 warnings.go:70] spec.template.spec.containers[0].env[21]: hides previous definition of "OTEL_EXPORTER_OTLP_LOGS_ENDPOINT", which may be dropped when using apply
```

Previous prover ENV VARS:

```
            - name: OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
              value: {{ include "aztec-network.otelCollectorMetricsEndpoint" . | quote }}
            - name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
              value: {{ include "aztec-network.otelCollectorTracesEndpoint" . | quote }}
            - name: OTEL_EXPORTER_OTLP_LOGS_ENDPOINT
              value: {{ include "aztec-network.otelCollectorLogsEndpoint" . | quote }}
           
           ... (additional vars excluded for brevity) ...
           
            - name: OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
              value: {{ include "aztec-network.otelCollectorMetricsEndpoint" . | quote }}
            - name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
              value: {{ include "aztec-network.otelCollectorTracesEndpoint" . | quote }}
            - name: OTEL_EXPORTER_OTLP_LOGS_ENDPOINT
              value: {{ include "aztec-network.otelCollectorLogsEndpoint" . | quote }}
```
subrepo:
  subdir:   "barretenberg"
  merged:   "c97c11293d"
upstream:
  origin:   "https://github.com/AztecProtocol/barretenberg"
  branch:   "master"
  commit:   "c97c11293d"
git-subrepo:
  version:  "0.4.6"
  origin:   "???"
  commit:   "???" [skip ci]
subrepo:
  subdir:   "noir-projects/aztec-nr"
  merged:   "ac58205331"
upstream:
  origin:   "https://github.com/AztecProtocol/aztec-nr"
  branch:   "master"
  commit:   "ac58205331"
git-subrepo:
  version:  "0.4.6"
  origin:   "???"
  commit:   "???" [skip ci]
@Maddiaa0 Maddiaa0 requested review from just-mitch and removed request for fcarreiro, charlielye and dbanks12 November 6, 2024 10:11
@@ -257,6 +260,8 @@ export class LibP2PService extends WithTracer implements P2PService {
heartbeatInterval: config.gossipsubInterval,
mcacheLength: config.gossipsubMcacheLength,
mcacheGossip: config.gossipsubMcacheGossip,
metricsRegister: otelMetricsAdapter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Nice work.

@just-mitch just-mitch merged commit 58e75cd into master Nov 7, 2024
68 checks passed
@just-mitch just-mitch deleted the md/gossip-sub-metrics branch November 7, 2024 18:04
ludamad pushed a commit that referenced this pull request Nov 7, 2024
…9710)

Get access to all of the p2p metrics counts to help investigate issues

fixes: #9691
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

observability(p2p): get gossipsub metrics exported into our metrics