Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpha4 #525

Merged
merged 139 commits into from
Oct 15, 2024
Merged

Alpha4 #525

merged 139 commits into from
Oct 15, 2024

Conversation

blxdyx
Copy link
Collaborator

@blxdyx blxdyx commented Sep 29, 2024

No description provided.

taratorio and others added 30 commits September 6, 2024 12:05
…tech#11887)

Previously I tried to re-use `fixCanonicalChain` in the astrid stage for
handling fork choice update however after more testing I realised that
was wrong since it causes the below issue upon unwinds:
```
INFO[09-05|08:27:38.289] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
EROR[09-05|08:27:38.289] Staged Sync                              err="[4/6 Execution] domains.GetDiffset(11588734, 0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb): not found"
```

That is due the fact that `fixCanonicalChain` updates the canonical hash
as it traverses the header chain backwards. In the context of update
fork choice, this is something that should be done only after the Unwind
has succeeded, otherwise we will get the above error when unwinding
execution.

This also causes the 2nd error below which is a result of the Execution
Unwind failing and rolling back the RwTx of the stage loop (insert block
changes get lost). This situation will be further stabilised when
working on erigontech#11533

This PR fixes the first problem by creating a new function `connectTip`
which traverses the header chain backwards, collects new nodes and bad
nodes but does not update the canonical hash while doing so. Instead the
canonical hashes get updated in `updateForkChoiceForward` after the
unwind has been successfully executed.

Full Logs
```
INFO[09-05|08:27:36.210] [2/6 PolygonSync] forward                progress=11588734
DBUG[09-05|08:27:38.222] [bridge] processing new blocks           from=11588729 to=11588729 lastProcessedBlockNum=11588720 lastProcessedBlockTime=1725521200 lastProcessedEventID=2702
DBUG[09-05|08:27:38.222] [sync] inserted blocks                   len=1 duration=1.882458ms
DBUG[09-05|08:27:38.286] [bridge] processing new blocks           from=11588734 to=11588734 lastProcessedBlockNum=11588720 lastProcessedBlockTime=1725521200 lastProcessedEventID=2702
DBUG[09-05|08:27:38.287] [sync] inserted blocks                   len=1 duration=945.75µs
DBUG[09-05|08:27:38.287] [bor.heimdall] synchronizing spans...    blockNum=11588734
DBUG[09-05|08:27:38.287] [bridge] synchronizing events...         blockNum=11588734 lastProcessedBlockNum=11588720
INFO[09-05|08:27:38.287] [2/6 PolygonSync] update fork choice     block=11588734 age=1s hash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb
INFO[09-05|08:27:38.287] [2/6 PolygonSync] new fork - unwinding and caching fork choice unwindNumber=11588733 badHash=0x3e1f67072996aec05806d298de3bb281bdcf23566da0dc254c4670ac385768d4 cachedTipNumber=115
88734 cachedTipHash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb cachedNewNodes=1
DBUG[09-05|08:27:38.289] UnwindTo                                 block=11588733 block_hash=0x3e1f67072996aec05806d298de3bb281bdcf23566da0dc254c4670ac385768d4 err=nil stack="[sync.go:171 stage_polygon_syn
c.go:1425 stage_polygon_sync.go:1379 stage_polygon_sync.go:1538 stage_polygon_sync.go:494 stage_polygon_sync.go:175 default_stages.go:479 sync.go:531 sync.go:410 stageloop.go:249 stageloop.go:101 asm_arm6
4.s:1222]"
DBUG[09-05|08:27:38.289] [2/6 PolygonSync] DONE                   in=2.078894042s
INFO[09-05|08:27:38.289] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
EROR[09-05|08:27:38.289] Staged Sync                              err="[4/6 Execution] domains.GetDiffset(11588734, 0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb): not found"
INFO[09-05|08:27:38.792] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
INFO[09-05|08:27:38.792] aggregator unwind                        step=24 txUnwindTo=38128876 stepsRangeInDB="accounts:0.4, storage:0.4, code:0.4, commitment:0.0, logaddrs: 0.4, logtopics: 0.4, tracesfrom: 0.4, tracesto: 0.4"
DBUG[09-05|08:27:38.818] [1/6 OtterSync] DONE                     in=2.958µs
INFO[09-05|08:27:38.818] [2/6 PolygonSync] forward                progress=11588733
INFO[09-05|08:27:38.818] [2/6 PolygonSync] new fork - processing cached fork choice after unwind cachedTipNumber=11588734 cachedTipHash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb cachedNewNodes=1
DBUG[09-05|08:27:39.083] [bor.heimdall] block producers tracker observing span id=1812
DBUG[09-05|08:27:43.532] Error while executing stage              err="[2/6 PolygonSync] stopped: parent's total difficulty not found with hash 04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb and height 11588734: <nil>"
EROR[09-05|08:27:43.532] [2/6 PolygonSync] stopping node          err="parent's total difficulty not found with hash 04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb and height 11588734: <nil>"
DBUG[09-05|08:27:43.534] rpcdaemon: the subscription to pending blocks channel was closed 
INFO[09-05|08:27:43.534] Exiting... 
INFO[09-05|08:27:43.535] HTTP endpoint closed                     url=127.0.0.1:8545
INFO[09-05|08:27:43.535] RPC server shutting down 
```
I used simple "semaphore" to limit the number of goroutines to 4.

Co-authored-by: shota.silagadze <shota.silagadze@taal.com>
…1906)

Don't need to process attestations from gossip if the committee index
associated with the attestation is not subscribed or doesn't require
aggregation.
We are trying to optimise `AggregateAndProofService`. After profiling
the service, I see that most of the CPU time is spent on signature
verifications. From the graph, overall, the function took 6.6% (74
seconds) of all the time (not just execution time. Percentage would be
much higher if we took just cpu time) see the screenshot:

<img width="1470" alt="Screenshot 2024-09-01 at 10 28 38"
src="https://github.com/user-attachments/assets/929ce103-2bf3-43d9-a0fa-ca504e4b58bb">


Now we are trying to aggregate all the signatures and verify them
altogether with `bls.VerifyMultipleSignatures` function in an async way
and run the final functions if verifications succeed. I basically
removed all the code where we verified those three signatures and
instead gathered them for verifying later. After profiling that I see
the following output:

<img width="1468" alt="Screenshot 2024-09-01 at 10 44 31"
src="https://github.com/user-attachments/assets/abb842a3-0b4f-4640-8a88-791a2d0af62b">

Now most of the time, as I see, is spent on public key aggregation when
we are verifying validator aggregated signatures. But I guess there is
no way to optimise that one. As we are spending most of the time on
`NewPublicKeyFromBytes` maybe we could cache constructed keys but I
think bls already does that. So, that's as good as it gets.

---------

Co-authored-by: shota.silagadze <shota.silagadze@taal.com>
Extended pprof read API to include: goroutine, threadcreate, heap,
allocs, block, mutex
…r amoy network (erigontech#11902)

[Polygon] Bor: Added Ahmedabad HF related configs and block number for
amoy network
This is the Ahmedabad block number -
[11865856](https://amoy.polygonscan.com/block/countdown/11865856)

PR in bor - [bor#1324](maticnetwork/bor#1324)
`go1.23.1` released, means time to drop `go1.21` support
HI there, I found that the old testcase is deleted in the repo now. So
here is my advice

1 used the old link with commit hash
2 delete the link

Im using the first solution now.
- move logic to `state/sqeeze.go` 
- enable code.kv compression - only values
- increase MaxLimit of compress pattern - because it shows better
comprass ratio of code.kv - with smaller dictionary
- Create `heimdall.Reader` for future use in `bor_*` API
- Make `AssembleReader` and `NewReader` leaner by not requiring full
`BorConfig`
Support `bor_*` RPCs when using `polygon.sync` when rpcdaemon with
datadir.
…rigontech#11929)

Run into an issue since we started pruning total difficulty.
```
EROR[09-09|10:58:03.057] [2/6 PolygonSync] stopping node          err="parent's total difficulty not found with hash 9334099de5d77c0d56afefde9985d44f8b4416db99dfe926908d5501fa8dbd9e and height 11736178: <nil>
```

It happened for checkpoint
[9703](https://heimdall-api-amoy.polygon.technology/checkpoints/9703).
Our start block was in the middle of the checkpoint range which meant we
have to fetch all the 8k blocks in this checkpoint to verify the
checkpoint root hash when receiving blocks from the peer.

The current logic will attempt to insert all these 8k blocks and it will
fail with a missing parent td error because we only keep the last 1000
parent td records.

This PR fixes this by enhancing the block downloader to not re-insert
blocks behind the `start` block. This solves the parent td error and
also is saving some unnecessary inserts on the first waypoint processing
on startup.
on empty request see error `can't find blockNumber by txnID=1235`
…ontech#11873)

comment docker-build-check job as mentioned in the issue
[11872](erigontech#11872) -- it will
save us 5-6 mins time of waiting for the routine check for each workflow
run (faster PR check, etc).

get rid of "skip-build-cache" which is removed since v5
…ch#11938)

Switch to go builder 1.23.1, 
introduce docker provenance attest and SBOM.
Which should expectedly increase docker image score to A.
More issues surfaced on chain tip when testing astrid:
1. context deadline exceeded when requesting new block event at tip from
peer - can happen, safe to ignore event and continue instead of
crashing:
```
[EROR] [09-06|03:41:00.183] [2/6 PolygonSync] stopping node          err="await *eth.BlockHeadersPacket66 response interrupted: context deadline exceeded"
```
2. Noticed we do not penalise peers for penalize-able errors when
calling `FetchBlocks` - added that in
3. We got another error that crashed the process -
`ErrNonSequentialHeaderNumbers` - it is safe to ignore new block event
if this happens and continue
```
EROR[09-05|20:26:35.141] [2/6 PolygonSync] stopping node          err="non sequential header numbers in fetch headers response: current=11608859, expected=11608860"
```
4. Added all other p2p errors that may happen and are safe to ignore at
tip event processing
5. Added debug logging for better visibility into chain tip events
6. Fixed missing check for whether we have already processed a new block
event (ie if its hash is already contained in the canonical chain
builder)
Move Astrid bridge functions to its own gRPC server and client to not
rely on existing block reader infrastructure.
This PR for  erigontech#11417 includes:
1. splitting segments into dirtySegments and visibleSegments
2. dirtySegments are updated in background and not accessible for APP.
3. dirtySegments will be added to visibleSegments when:
    - there's no gap/overlap/garbage 
    - all types of segments are created and indexed  at that height
4. add unit test:  `TestCalculateVisibleSegments`

---------

Co-authored-by: lupin012 <58134934+lupin012@users.noreply.github.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Co-authored-by: Ilya Mikheev <54912776+JkLondon@users.noreply.github.com>
Co-authored-by: JkLondon <ilya@mikheev.fun>
Co-authored-by: shashiy <shaashiiy@gmail.com>
Co-authored-by: Elias Rad <146735585+nnsW3@users.noreply.github.com>
Co-authored-by: awskii <awskii@users.noreply.github.com>
Co-authored-by: blxdyx <125243069+blxdyx@users.noreply.github.com>
Co-authored-by: Giulio rebuffo <giulio.rebuffo@gmail.com>
Co-authored-by: Shota <silagadzeshota@gmail.com>
Co-authored-by: shota.silagadze <shota.silagadze@taal.com>
Co-authored-by: Dmytro Vovk <vovk.dimon@gmail.com>
Co-authored-by: Massa <massarinoaa@gmail.com>
jsvisa and others added 21 commits September 25, 2024 13:44
 (erigontech#12066)

align with go-ethereum of detailed oog reason, ref:
https://github.com/ethereum/go-ethereum/blob/b018da9d02513ab13de50d63688c465798bd0e14/core/vm/interpreter.go#L273-L275

```go
dynamicCost, err = operation.dynamicGas(in.evm, contract, stack, mem, memorySize)
cost += dynamicCost // for tracing
if err != nil {
    return nil, fmt.Errorf("%w: %v", ErrOutOfGas, err)
}
if !contract.UseGas(dynamicCost, in.evm.Config.Tracer, tracing.GasChangeIgnored) {
    return nil, ErrOutOfGas
}
```

---------

Signed-off-by: jsvisa <delweng@gmail.com>
closes erigontech#11707

---------

Co-authored-by: JkLondon <ilya@mikheev.fun>
closes erigontech#11974

---------

Co-authored-by: JkLondon <ilya@mikheev.fun>
Co-authored-by: Dmytro Vovk <vovk.dimon@gmail.com>
Co-authored-by: JkLondon <ilya@mikheev.fun>
made format change in interpreter so that `make docker` task runs (it
doesn't run if the changes are only in .github/ dir)
…age push workflow (erigontech#12115)

- revert changes to ci-cd-main-branch-docker-images.yml
- do those changes instead in ci-cd-main-branch-docker-images2.yml (temp
copy) for quicker testing off PRs.
@setunapo setunapo merged commit aba2217 into node-real:main Oct 15, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.