-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: gating test #9918
base: master
Are you sure you want to change the base?
feat: gating test #9918
Conversation
7ad1018
to
9202220
Compare
9202220
to
c17fe1d
Compare
@@ -78,7 +78,7 @@ http://{{ include "aztec-network.fullname" . }}-boot-node-0.{{ include "aztec-ne | |||
{{- if .Values.validator.externalTcpHost -}} | |||
http://{{ .Values.validator.externalTcpHost }}:{{ .Values.validator.service.nodePort }} | |||
{{- else -}} | |||
http://{{ include "aztec-network.fullname" . }}-validator-0.{{ include "aztec-network.fullname" . }}-validator.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.validator.service.nodePort }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusts validatorUrl
to load balance over all validators.
# the PXE has its node set to the the validator service. | ||
# and that's all we need to know to bootstrap, | ||
# since it will get a good node ENR from whatever node the PXE connects to. | ||
echo "{{ include "aztec-network.pxeUrl" . }}" > /shared/pxe_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the comment above, this looping is really just a safety/sanity check, since as long as one validator is up, the PXE should be routed to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smallest set where we can kill 25%.
@@ -3,4 +3,4 @@ set -eu | |||
|
|||
cd "$(dirname "${BASH_SOURCE[0]}")" | |||
|
|||
helm upgrade metrics . -n metrics --values "./values/prod.yaml" --install --create-namespace --atomic $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to revert this change, but removing atomic was necessary to try to debug.
[SequencerState.PUBLISHING_BLOCK_TO_PEERS]: 5 + this.maxTxsPerBlock * 2, // if we take 5 seconds to create block, then 4 transactions at 2 seconds each | ||
[SequencerState.WAITING_FOR_ATTESTATIONS]: 5 + this.maxTxsPerBlock * 2 + 3, // it shouldn't take 3 seconds to publish to peers | ||
[SequencerState.PUBLISHING_BLOCK]: 5 + this.maxTxsPerBlock * 2 + 3 + 5, // wait 5 seconds for attestations | ||
[SequencerState.WAITING_FOR_TXS]: 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to bump these times, as we were often trying to transition to "WAITING_FOR_TXS" at ~3.5 seconds, which was resulting in no blocks getting produced, which resulted in reorgs due at least in part to lack of support for #9404
} | ||
|
||
const txHashes = validTxs.map(tx => tx.getTxHash()); | ||
|
||
this.isFlushing = false; | ||
this.log.verbose('Collecting attestations'); | ||
this.log.info('Collecting attestations'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ludamad I'm actually not seeing these logs under in loki even with debug on. Thoughts? I'm not sure if it is all verbose
logs, or what. That was part of the impetus for upping the level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #9958
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I am not sure. We should write good tests along with moving to a non-node debugger (winston? pino?)
ba1c2ab
to
8efe7ea
Compare
8efe7ea
to
58ce04b
Compare
This test: - updates the aztec network deployment, allowing validators to use each other as boot nodes - applies the "network-requirements" network shaping - permanently disables the boot node - runs 3 epochs during which it: - kills 25% of the validators - asserts that we miss less than 50% of slots Other work in this branch includes: - add `ignoreDroppedReceiptsFor` TX wait options - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals - this may help with #9613 - scalable loki deployment for prod - more visible logging for core sequencer operations - better error handling during the setup of l2 contracts - better error handling in the pxe - rename the network shaping charts to "aztec-chaos-scenarios"
1fada01
to
c5152e9
Compare
Taking a look at this one now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into all non-k8s stuff
export async function awaitL2BlockNumber( | ||
rollupCheatCodes: RollupCheatCodes, | ||
blockNumber: bigint, | ||
timeoutSeconds: number, | ||
logger: Logger, | ||
) { | ||
logger.info(`Waiting for L2 Block ${blockNumber}`); | ||
let tips = await rollupCheatCodes.getTips(); | ||
const endTime = Date.now() + timeoutSeconds * 1000; | ||
while (tips.pending < blockNumber && Date.now() < endTime) { | ||
logger.info(`At L2 Block ${tips.pending}`); | ||
await sleep(1000); | ||
tips = await rollupCheatCodes.getTips(); | ||
} | ||
logger.info(`Reached L2 Block ${tips.pending}`); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to throw loudly if we reach the timeout without hitting the target block number.
.catch(err => { | ||
this.log.error(err); | ||
throw err; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the job queue handle jobs that throw ok? Either way, let's add some context to the error being logged.
.catch(err => { | ||
this.log.error(err); | ||
throw err; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
await this.node.sendTx(tx).catch(err => { | ||
this.log.error(err); | ||
throw err; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
.catch(err => { | ||
this.log.error(err); | ||
throw err; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of bumping so many verbose
to info
, it'll make all tests so much more verbose. But if we need this inorder to see those logs in k8s tests, I'm fine with it until we move to a better logger.
if (this.state === SequencerState.STOPPED && force !== true) { | ||
this.log.warn( | ||
`Cannot set sequencer from ${this.state} to ${proposedState} as it is stopped. Set force=true to override.`, | ||
); | ||
return; | ||
} | ||
const secondsIntoSlot = getSecondsIntoSlot(this.l1GenesisTime, this.aztecSlotDuration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass in the current slot number, and keep the getSecondsIntoSlot
call here? Seems like it would reduce some verbosity by not having to manually call getSecondsIntoSlot
before each setState
.
@@ -32,7 +32,7 @@ spec: | |||
echo "L2 contracts initialized" | |||
env: | |||
- name: PXE_URL | |||
value: {{ include "aztec-network.pxeUrl" . | quote }} | |||
value: {{ include "aztec-network.bootNodeUrl" . | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why has this changed?
|
||
import { exec, spawn } from 'child_process'; | ||
import path from 'path'; | ||
import { promisify } from 'util'; | ||
import { z } from 'zod'; | ||
|
||
import type { RollupCheatCodes } from '../../../aztec.js/src/utils/cheat_codes.js'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbonit: dodgy update import here
@@ -1,3 +1,6 @@ | |||
nameOverride: null | |||
fullnameOverride: null | |||
|
|||
global: | |||
# When deploying, override the namespace to where spartan will deploy to, this will apply all chaos experiments to all pods within that namespace | |||
# run deployment with --values global.namespace=your-namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for including this note. It wasn't clear to me at first why the values files used different namespace refs.
@@ -9,7 +9,6 @@ telemetry: | |||
otelCollectorEndpoint: http://metrics-opentelemetry-collector.metrics:4318 | |||
|
|||
validator: | |||
debug: "aztec:*,-aztec:avm_simulator:*,-aztec:libp2p_service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this level of logging causing stability issues? (for my own understanding)
Main feature of this branch (unlike the branch name indicates) is the
gating_passive.test.ts
.This test:
Other work in this branch includes:
ignoreDroppedReceiptsFor
TX wait optionsFix #9713
Fix #9883