Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: gating test #9918

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

feat: gating test #9918

wants to merge 2 commits into from

Conversation

just-mitch
Copy link
Contributor

@just-mitch just-mitch commented Nov 13, 2024

Main feature of this branch (unlike the branch name indicates) is the gating_passive.test.ts.

This test:

  • updates the aztec network deployment, allowing validators to use each other as boot nodes
  • applies the "network-requirements" network shaping
  • permanently disables the boot node
  • runs 3 epochs during which it:
    • kills 25% of the validators
    • asserts that we miss less than 50% of slots

Other work in this branch includes:

  • allows sequencers to start building as soon as they see they can propose in the next L1 slot
  • add ignoreDroppedReceiptsFor TX wait options
  • scalable loki deployment for prod
  • more visible logging for core sequencer operations
  • better error handling during the setup of l2 contracts
  • better error handling in the pxe
  • rename the network shaping charts to "aztec-chaos-scenarios"

Fix #9713
Fix #9883

@just-mitch just-mitch linked an issue Nov 13, 2024 that may be closed by this pull request
@just-mitch just-mitch marked this pull request as ready for review November 13, 2024 02:13
@just-mitch just-mitch added network-all Run this CI job. hotfix A PR/issue that needs to be cherrypicked back to the RC labels Nov 13, 2024
@@ -78,7 +78,7 @@ http://{{ include "aztec-network.fullname" . }}-boot-node-0.{{ include "aztec-ne
{{- if .Values.validator.externalTcpHost -}}
http://{{ .Values.validator.externalTcpHost }}:{{ .Values.validator.service.nodePort }}
{{- else -}}
http://{{ include "aztec-network.fullname" . }}-validator-0.{{ include "aztec-network.fullname" . }}-validator.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.validator.service.nodePort }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusts validatorUrl to load balance over all validators.

# the PXE has its node set to the the validator service.
# and that's all we need to know to bootstrap,
# since it will get a good node ENR from whatever node the PXE connects to.
echo "{{ include "aztec-network.pxeUrl" . }}" > /shared/pxe_url
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the comment above, this looping is really just a safety/sanity check, since as long as one validator is up, the PXE should be routed to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smallest set where we can kill 25%.

@@ -3,4 +3,4 @@ set -eu

cd "$(dirname "${BASH_SOURCE[0]}")"

helm upgrade metrics . -n metrics --values "./values/prod.yaml" --install --create-namespace --atomic $@
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to revert this change, but removing atomic was necessary to try to debug.

[SequencerState.PUBLISHING_BLOCK_TO_PEERS]: 5 + this.maxTxsPerBlock * 2, // if we take 5 seconds to create block, then 4 transactions at 2 seconds each
[SequencerState.WAITING_FOR_ATTESTATIONS]: 5 + this.maxTxsPerBlock * 2 + 3, // it shouldn't take 3 seconds to publish to peers
[SequencerState.PUBLISHING_BLOCK]: 5 + this.maxTxsPerBlock * 2 + 3 + 5, // wait 5 seconds for attestations
[SequencerState.WAITING_FOR_TXS]: 5,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to bump these times, as we were often trying to transition to "WAITING_FOR_TXS" at ~3.5 seconds, which was resulting in no blocks getting produced, which resulted in reorgs due at least in part to lack of support for #9404

}

const txHashes = validTxs.map(tx => tx.getTxHash());

this.isFlushing = false;
this.log.verbose('Collecting attestations');
this.log.info('Collecting attestations');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ludamad I'm actually not seeing these logs under in loki even with debug on. Thoughts? I'm not sure if it is all verbose logs, or what. That was part of the impetus for upping the level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #9958

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I am not sure. We should write good tests along with moving to a non-node debugger (winston? pino?)

@just-mitch just-mitch force-pushed the 9883-chore-bump-loki-space branch 2 times, most recently from ba1c2ab to 8efe7ea Compare November 14, 2024 17:23
@just-mitch just-mitch enabled auto-merge (squash) November 15, 2024 02:12
@just-mitch just-mitch removed the network-all Run this CI job. label Nov 15, 2024
This test:
- updates the aztec network deployment, allowing validators to use each other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
  - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation
  - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"
@just-mitch just-mitch enabled auto-merge (squash) November 15, 2024 17:16
@stevenplatt stevenplatt self-requested a review November 20, 2024 14:54
@stevenplatt
Copy link
Collaborator

Taking a look at this one now...

Copy link
Collaborator

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into all non-k8s stuff

Comment on lines +339 to +354
export async function awaitL2BlockNumber(
rollupCheatCodes: RollupCheatCodes,
blockNumber: bigint,
timeoutSeconds: number,
logger: Logger,
) {
logger.info(`Waiting for L2 Block ${blockNumber}`);
let tips = await rollupCheatCodes.getTips();
const endTime = Date.now() + timeoutSeconds * 1000;
while (tips.pending < blockNumber && Date.now() < endTime) {
logger.info(`At L2 Block ${tips.pending}`);
await sleep(1000);
tips = await rollupCheatCodes.getTips();
}
logger.info(`Reached L2 Block ${tips.pending}`);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to throw loudly if we reach the timeout without hitting the target block number.

Comment on lines +508 to +511
.catch(err => {
this.log.error(err);
throw err;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the job queue handle jobs that throw ok? Either way, let's add some context to the error being logged.

Comment on lines +565 to +568
.catch(err => {
this.log.error(err);
throw err;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +577 to +580
await this.node.sendTx(tx).catch(err => {
this.log.error(err);
throw err;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Comment on lines +602 to +605
.catch(err => {
this.log.error(err);
throw err;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of bumping so many verbose to info, it'll make all tests so much more verbose. But if we need this inorder to see those logs in k8s tests, I'm fine with it until we move to a better logger.

if (this.state === SequencerState.STOPPED && force !== true) {
this.log.warn(
`Cannot set sequencer from ${this.state} to ${proposedState} as it is stopped. Set force=true to override.`,
);
return;
}
const secondsIntoSlot = getSecondsIntoSlot(this.l1GenesisTime, this.aztecSlotDuration);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass in the current slot number, and keep the getSecondsIntoSlot call here? Seems like it would reduce some verbosity by not having to manually call getSecondsIntoSlot before each setState.

@@ -32,7 +32,7 @@ spec:
echo "L2 contracts initialized"
env:
- name: PXE_URL
value: {{ include "aztec-network.pxeUrl" . | quote }}
value: {{ include "aztec-network.bootNodeUrl" . | quote }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this changed?


import { exec, spawn } from 'child_process';
import path from 'path';
import { promisify } from 'util';
import { z } from 'zod';

import type { RollupCheatCodes } from '../../../aztec.js/src/utils/cheat_codes.js';
Copy link
Member

@Maddiaa0 Maddiaa0 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbonit: dodgy update import here

@@ -1,3 +1,6 @@
nameOverride: null
fullnameOverride: null

global:
# When deploying, override the namespace to where spartan will deploy to, this will apply all chaos experiments to all pods within that namespace
# run deployment with --values global.namespace=your-namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for including this note. It wasn't clear to me at first why the values files used different namespace refs.

@@ -9,7 +9,6 @@ telemetry:
otelCollectorEndpoint: http://metrics-opentelemetry-collector.metrics:4318

validator:
debug: "aztec:*,-aztec:avm_simulator:*,-aztec:libp2p_service"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this level of logging causing stability issues? (for my own understanding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotfix A PR/issue that needs to be cherrypicked back to the RC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: bump loki space feat: Add chaos ability to kill validators/boot-nodes
5 participants