Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/vm: speed up push and interpreter loop #30662

Merged
merged 5 commits into from
Oct 30, 2024

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Oct 23, 2024

Looking at the cpu profile of a burntpix benchmark, I noticed that a lot of time was spent in gas-used, in the interpreter loop. It's an actual call (not inlined), which explicitly wants to be ignored by tracing ("tracing.GasChangeIgnored"), so it can be safely and simply inlined.

The other change is in pushX. These also do a call to common.RightPadBytes. I replaced that by a doing a corresponding Lsh on the u256 if needed. Note: it's needed only to make the stack output look right, for fuzzers. It technically doesn't matter what we put there: if code ends on a pushdata immediate, nothing will consume the stack element. We could just as well just ignore it, if we didn't care about fuzzers (which I do).

Seems quite a lot faster on burntpix, according to my runs.

This PR:

EVM gas used:    5642735088
execution time:  34.84609475s
allocations:     915683
allocated bytes: 175334088
EVM gas used:    5642735088
execution time:  36.671958278s
allocations:     915701
allocated bytes: 175340528

Master

EVM gas used:    5642735088
execution time:  49.349209526s
allocations:     915684
allocated bytes: 175333368
EVM gas used:    5642735088
execution time:  46.581006598s
allocations:     915681
allocated bytes: 175330728

@holiman
Copy link
Contributor Author

holiman commented Oct 23, 2024

Btw if we want this, we can also remove tracing.GasChangeIgnored. cc @s1na ?

@holiman holiman force-pushed the speedy_interpreter_loop branch from 63d2422 to 8c582ab Compare October 23, 2024 11:42
@holiman
Copy link
Contributor Author

holiman commented Oct 23, 2024

Added a benchmark

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/vm/runtime
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                        │ loop.master  │            loop.master.2             │
                        │    sec/op    │    sec/op     vs base                │
SimpleLoop/loop2-100M-8   846.1m ± 15%   645.2m ± 33%  -23.74% (p=0.020 n=20)

                        │  loop.master  │          loop.master.2          │
                        │     B/op      │     B/op       vs base          │
SimpleLoop/loop2-100M-8   6.365Ki ± 13%   6.570Ki ± 17%  ~ (p=0.640 n=20)

                        │ loop.master │        loop.master.2         │
                        │  allocs/op  │ allocs/op   vs base          │
SimpleLoop/loop2-100M-8    46.00 ± 4%   46.00 ± 4%  ~ (p=0.576 n=20)


// Missing bytes: pushByteSize - len(pushData)
if missing := pushByteSize - (end - start); missing > 0 {
Copy link
Contributor

@jwasinger jwasinger Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we could remove this if from the hot-loop and adjust the tracers correspondingly, it's (understandably) a micro optimization that doesn't bring measurable performance benefits:

                                       │ bench_if_removed.txt │      bench_if_not_removed.txt      │                                                                                                                
                                       │        sec/op        │   sec/op     vs base               │                                                                                                                
SimpleLoop/staticcall-identity-100M-12            162.9m ± 0%   160.7m ± 2%  -1.35% (p=0.009 n=10)                                                                                                                  
SimpleLoop/call-identity-100M-12                  183.9m ± 2%   180.8m ± 1%  -1.64% (p=0.001 n=10)
SimpleLoop/loop-100M-12                           182.0m ± 1%   182.0m ± 0%       ~ (p=0.853 n=10)
SimpleLoop/loop2-100M-12                          200.7m ± 0%   208.0m ± 5%  +3.68% (p=0.000 n=10)
SimpleLoop/call-nonexist-100M-12                  496.8m ± 2%   508.8m ± 9%  +2.41% (p=0.011 n=10)
SimpleLoop/call-EOA-100M-12                       142.3m ± 0%   141.1m ± 5%       ~ (p=0.143 n=10)
SimpleLoop/call-reverting-100M-12                 323.7m ± 0%   324.9m ± 1%       ~ (p=0.123 n=10)
geomean                                           220.1m        220.9m       +0.37%

core/vm/interpreter.go Outdated Show resolved Hide resolved
core/vm/interpreter.go Outdated Show resolved Hide resolved
holiman and others added 3 commits October 24, 2024 09:03
Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>
@fjl fjl added this to the 1.14.12 milestone Oct 30, 2024
@fjl fjl merged commit 25bc077 into ethereum:master Oct 30, 2024
3 checks passed
holiman added a commit that referenced this pull request Nov 19, 2024
Looking at the cpu profile of a burntpix benchmark, I noticed that a lot
of time was spent in gas-used, in the interpreter loop. It's an actual
call (not inlined), which explicitly wants to be ignored by tracing
("tracing.GasChangeIgnored"), so it can be safely and simply inlined.

The other change is in `pushX`. These also do a call to
`common.RightPadBytes`. I replaced that by a doing a corresponding `Lsh`
on the `u256` if needed. Note: it's needed only to make the stack output
look right, for fuzzers. It technically doesn't matter what we put
there: if code ends on a pushdata immediate, nothing will consume the
stack element. We could just as well just ignore it, if we didn't care
about fuzzers (which I do).

Seems quite a lot faster on burntpix, according to my runs. 

This PR:
```
EVM gas used:    5642735088
execution time:  34.84609475s
allocations:     915683
allocated bytes: 175334088
```
```
EVM gas used:    5642735088
execution time:  36.671958278s
allocations:     915701
allocated bytes: 175340528
```

Master
```
EVM gas used:    5642735088
execution time:  49.349209526s
allocations:     915684
allocated bytes: 175333368
```
```
EVM gas used:    5642735088
execution time:  46.581006598s
allocations:     915681
allocated bytes: 175330728
```

---------

Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
zfy0701 pushed a commit to sentioxyz/go-ethereum that referenced this pull request Dec 3, 2024
Looking at the cpu profile of a burntpix benchmark, I noticed that a lot
of time was spent in gas-used, in the interpreter loop. It's an actual
call (not inlined), which explicitly wants to be ignored by tracing
("tracing.GasChangeIgnored"), so it can be safely and simply inlined.

The other change is in `pushX`. These also do a call to
`common.RightPadBytes`. I replaced that by a doing a corresponding `Lsh`
on the `u256` if needed. Note: it's needed only to make the stack output
look right, for fuzzers. It technically doesn't matter what we put
there: if code ends on a pushdata immediate, nothing will consume the
stack element. We could just as well just ignore it, if we didn't care
about fuzzers (which I do).

Seems quite a lot faster on burntpix, according to my runs. 

This PR:
```
EVM gas used:    5642735088
execution time:  34.84609475s
allocations:     915683
allocated bytes: 175334088
```
```
EVM gas used:    5642735088
execution time:  36.671958278s
allocations:     915701
allocated bytes: 175340528
```

Master
```
EVM gas used:    5642735088
execution time:  49.349209526s
allocations:     915684
allocated bytes: 175333368
```
```
EVM gas used:    5642735088
execution time:  46.581006598s
allocations:     915681
allocated bytes: 175330728
```

---------

Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants