Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

register-based interpreter #485

Open
belm0 opened this issue Oct 27, 2022 · 65 comments
Open

register-based interpreter #485

belm0 opened this issue Oct 27, 2022 · 65 comments
Labels
epic-generator DSL-based code generator for the interpreter(s) epic-registers Ideas related to a register VM

Comments

@belm0
Copy link

belm0 commented Oct 27, 2022

this 3.10 fork is API and ABI compatible with CPython

I'm seeing 5% speedup on my app (long-running, large, heterogeneous, highly concurrent, async workload), which is in line with overall speedup they report across the benchmarks (closer to 9% is reported for ARM).

https://github.com/zq1997/RegCPython
https://dl.acm.org/doi/10.1145/3568973

I'd like to see how this composes with Python 3.11...

@belm0
Copy link
Author

belm0 commented Oct 28, 2022

I had an exchange with the author. He believes that the RegCPython gains will be orthogonal to Python 3.11 gains, and will compose well. Rebasing to 3.11 would be a large undertaking, however, and he doesn't plan it soon due to finishing his PhD, etc.

@jneb
Copy link

jneb commented Oct 29, 2022

I am convinced that a register based interpreter is going to a significant speedup independent of the rest of the optimisations.
However, the rewrite itself is going to interfere a lot with the other activities if we are not careful. Here a short list of what I saw when browsing through the design. I find it hard to see a good approach yet, but I hope this list helps making a good plan.

  • Most of the interference is a result of the fact that opcodes get multiple parameters. Many parts of the interpreter implicitly assume opcodes have only one parameter, and
  • dis: relatively easy
  • opcode pairs (eg LOAD_FAST__LOAD_FAST): looks doable
  • collecting stats: doable, maybe it already is independent
  • using DSL to generate ceval.c: i think going to registers can only be done AFTER this step, to simplify the separation of work
  • specialization: should be not too hard
  • code generator: optimisations of the kind pop_and_jump versus jump_and_pop are becoming unnecessary; otherwise these are unaffected in comparison to the expression compilation

I am pretty sure this list is not complete, so feel free to comment.

@iritkatriel
Copy link
Collaborator

@vstinner experimented with this on 3.3: https://faster-cpython.readthedocs.io/registervm.html

@iritkatriel
Copy link
Collaborator

My main question after reading the paper (but not looking at the code):

How does this not run out of registers? The instruction format is (opcode, oparg1, oparg2, oparg3) where each item is 8 bits. What if we need more than 256 registers? Since the co_consts are "loaded" into registers, and easy way to run out is to have > 256 constants in a function.

@jneb
Copy link

jneb commented Nov 2, 2022

I can't stop thinking about this. I would like to have the speed of this without having to have so many dependencies.
So while having a nice bike ride through the dunes, I asked myself the question: WHY is a register based interpreter faster? Maybe thinking about this could lead to ideas giving speedup.
Here are my thoughts so far:

  • fewer opcodes per function: we achieve this effect now using combined opcodes. The price is parameter processing, of course
  • the most common words are LOAD_CONST and LOAD_FAST; both of these would be obsolete because the values are "already there"
  • no stack pointer increment/decrement: part of this is achieved by combining opcodes. The problem are the opcodes (conditional jump, etc) that need to pop to do what they do
  • temp values and locals in the same space: you seldom need to store a result, because it is there already
  • compile time allocation: in a way, a stack is a run time allocator. That brings me to an idea: The compiler could give every opcode its stack level as a parameter! That way, each opcode can calculate the locations of input and output from that. No increment needed. The output location is a parameter too, which can be a stack level or a local variable.

That last idea could be called a "poor man's register allocator": just make enough locals die the highest stack
Honestly, I don't consider myself qualified to actually build this, but this last idea is so crazy that I feel tempted to try it!

@jneb
Copy link

jneb commented Nov 2, 2022

How does this not run out of registers? The instruction format is (opcode, oparg1, oparg2, oparg3) where each item is 8 bits. What if we need more than 256 registers? Since the co_consts are "loaded" into registers, and easy way to run out is to have > 256 constants in a function.

You need some extension coding, a generalisation of EXTENDED_ARG. I bet that functions using so many values wil not be fast anymore.
Another way is to have the compiler fix the problem (as "normal" compilers do: every processors has a limited set of registers, as far as i know :-) )

@jneb
Copy link

jneb commented Nov 2, 2022

I figured it out: the crazy idea above is possible. The following conceptual process could change an existing bytecode stream into a register based stream, leaving many parts of the interpreter as they are now.
This implies that many current optimisations (specialisation, for example) will still work:

  • associate each bytecode with an equivalent register based bytecode with extra parameters for each input and output. So ADD becomes ADD(inputreg, inputreg -> outputreg)
  • at the beginning of a function, fill the registers with the constants, parameters, locals, and stack levels in that order (some copy operations and some allocations)
  • replace each opcode by its register based equivalent: e.g. ADD becomes ADD(S-1, S -> S-1) if the stack level is S at that point
  • DUP_TOP becomes MOVE S -> S+1
  • DROP_TOP disappears
  • LOAD_CONST, LOAD_FAST, STORE_CONST (oops, see below), STORE_FAST all become register moves
  • after this is all done, a standard register optimization will eliminate a lot of MOVE opcodes

Here a small example:
def square(x): return x ** 2
Is in bytecode (simplified):
LOAD_FAST x; LOAD_CONST 2; POW; RETURN
First transform the opcodes to register operations
Registers 0 ... 3 are filled with 2, x, <space for 2 stack levels>
Code becomes:
MOVE r1 -> r2; MOVE r0 -> r3; POW r2, r3 -> r2; RETURN r2
The optimisation trivially eliminates two registers:
POW r1, r0 -> r1; RETURN r1

There are of course many details, such as free variables, globals, default values, etc but I don't see why this wouldn't work.
This would make it easy to make a register based interpreter without having to rewrite big chunks of code. All you need is:

  • a compiler post processor that does the above operation
  • a piece of code to set up / allocate the register array at the beginning of the function
  • a transformed version of each byte code. With a bit of luck, that could be done partly automatically using the new DSL

(A day later I realized this could work with 3 address as well as 2 address instructions. The latter one is probably going to be faster.)

@gvanrossum
Copy link
Collaborator

I'd love to see that STORE_CONST instruction. :-)

@sweeneyde
Copy link

How hard would it be to support CPython's current line-tracing behavior (e.g. PEP 626) with register-based bytecode? Looking at that RegCPython repo's test_sys_settrace.py, it looks like they disabled many of the tracing tests.

@gvanrossum
Copy link
Collaborator

Do they support settrace at all? If not, that's cheating. It's what everybody looking for a quick win does -- drop tracing support, interpreter gets faster (Cinder does it too IIRC).

But it shouldn't be hard to support it the same way we currently do in 3.11 and main: the tracing flag is either 0 or 255, and or-ed in with the opcode; and case 255 has special handling code that does the tracing and then executes the original opcode.

@carljm
Copy link

carljm commented Nov 3, 2022

I think with a register-based interpreter you'd also have fewer opcodes and so might end up needing somewhat more NOPs for locations in some cases in order to cover all the lines?

Cinder does it too IIRC

In the JIT we don't support tracing. Not worth it, if you want to trace just turn off the JIT. We haven't done anything to disturb tracing in the interpreter.

@gvanrossum
Copy link
Collaborator

Ah, that makes sense, sorry. FWIW I feel that inserting NOPs just to trace all lines is being hyper-correct. But we're getting ahead of ourselves here.

If we're serious we should at least ask the paper author to sign the PSF CLA, so we can read the code with imputiny (and maybe submit a pro-forma PR, even if we're never going to merge it).

@jneb
Copy link

jneb commented Nov 3, 2022

I'd love to see that STORE_CONST instruction. :-)

I admit it. It was in the middle of the night that I typed that. I had to edit it to fix a lot of typos and bad grammar. But I'm happy to make a pull request for you...

@o11c
Copy link

o11c commented Nov 3, 2022

Note that Java is an example of using stack-based bytecode but verifying/converting it to register-based before execution. But honestly it's better to just change the bytecode format entirely.

Further details I'll post on the N-address issue just linked, since that's the interesting question.

@kmod
Copy link

kmod commented Nov 3, 2022

One wrinkle about a register-based interpreter is that it probably changes the semantics of when references are dropped. In today's stack interpreter references are mostly consumed immediately by bytecodes, so intermediates will be deallocated nearly-immediately. In a register-interpreter, I assume that the references stay alive until the register is reused, which can be arbitrarily later. To preserve existing semantics, maybe a bit of the register index could be used to signal clearing the register after the instruction? Or maybe something more clever, but it seems like there would be a tradeoff with performance. Or maybe it's not a big deal to change this particular behavior.

I wrote up a simple test and saw that RegCPython deallocates temporaries in a different order than CPython. I'm not claiming that this is a representative piece of code or anything, just a quick demonstration that this is a thing:

class I(int):
    def __add__(self, rhs):
        return I(int(self) + int(rhs))

    def __del__(self):
        print("del", self)

def main():
    (I(1) + I(2)) + (I(4) + I(8))

if __name__ == "__main__":
    main()
$ python3 t.py
del 1
del 2
del 4
del 8
del 3
del 12
del 15
$ RegCPython/python t.py 
del 1
del 2
del 4
del 3
del 8
del 12
del 15

(3 and 8 are switched)

I don't have any particular horse in this race, but I haven't seen this mentioned yet so I'm just throwing it out there.

@iritkatriel
Copy link
Collaborator

They mention the GC issue in the paper, and it is indeed one of our concerns.

@kmod
Copy link

kmod commented Nov 4, 2022

In the paper they only mention the possibility for extra memory usage, which I think misses the bigger issue.

What I'm talking about is the semantic change of changing when object finalizers are run. The author casually dismisses the issue as an implementation detail, but this is a very important implementation detail that people rely on. It's easy to be dismissive and say that its users' own faults for relying on this detail, but I think some humility is in order because CPython itself includes code that is dependent on prompt finalization (WeakKeyDictionary comes to mind) that I have broken in the past by changing finalization timings.

Apologies if this is already on your radar, but as I said I haven't seen it mentioned.

@gvanrossum
Copy link
Collaborator

Can’t hurt saying it again.

https://xkcd.com/1172/

@markshannon
Copy link
Member

markshannon commented Nov 4, 2022

I think changing the order of finalization due to temporary references and local variables is something we can be reasonably relaxed about.

We already changed the order of finalization in 3.11 for some locals and temporaries.

global g = "goodbye"
def f1(a):
    global g
    a = None
    g = None
def f2(x, y):
    f1(x+y)
f2("he", "llo")

In 3.10 "hello" will be freed after "goodbye"
But in 3.11 "hello" will be freed before goodbye, as f2 doesn't hold a reference to "hello" during the call to f1.

@markshannon
Copy link
Member

@iritkatriel
Have you given any thought on how generating the interpreter from instruction definitions with stack effects will impact this?

My thinking is that if we view "stack effects" as "input/output effects" then creating register machine instructions should be doable.
Taking the classic example of binary add:

inst(BINARY_ADD, (left right -- res)) {
     BODY
}

We can generate the register code easily:

TARGET(BINARY_ADD) {
    PyObject left = locals[r1];
    PyObject right = locals[r2];
    BODY
    SETLOCAL(r3, res);
}

But we need to account for the difference in reference counting. The simplest thing I can think of is to automatically insert increfs for all inputs, then manually remove redundant incref/decref pairs.

@iritkatriel
Copy link
Collaborator

iritkatriel commented Nov 4, 2022

Let's assume we go for a hybrid solution as we discussed in the meeting on Wednesday. So we still maintain the stack, but the compiler passes register ids to the opcodes that tell them where in the stack to look for inputs. We also add the consts and locals etc to the stack so that everything can be accessed via such references.

The stack can still control the lifetimes of temporaries as it does now if the compiler also tells each opcode how many items to pop (or how far to clear the stack). So if BINARY_ADD adds two consts, nothing is popped. But if it's a const and a temporary, then one thing is popped. Etc. The lifetimes of objects should be quite similar to what they are now if we can pull this off.

I wondered how many registers a function would typically need, so I updated count_opcodes to calculate an upper bound, based on number of locals, consts and stacksize, etc. Bottom line is that it will probably be rare to need more than 255 registers.

@iritkatriel
Copy link
Collaborator

the compiler also tells each opcode how many items to pop (or how far to clear the stack).

Alternatively, rather than each instruction cleaning up after itself, this can be implicit in the register allocation - when the next instruction overwrites something that the previous instruction was supposed to pop, it decrefs the previous value. Then we need to add a way for the compiler to explicitly shrink the stack when it needs to.

@iritkatriel
Copy link
Collaborator

Another related point: the frame knows where the non-stack localsplus entries end (and the stack begins). So it can assume that any register below this index is holding a borrowed ref, and above that it owns the ref.

@zq1997
Copy link

zq1997 commented Nov 4, 2022

Hi everyone, I'm the author of RegCPython.

I just learned today that RegCPython has been discussed here, and it looks like a stack-to-register architecture transition in real engineering faces a lot more issuses than I expected.

@jneb The idea you mentioned is consistent with this article which proposes a binary extension that converts stack-based bytecode to register-based one. I finally decided to change the bytecode format entirely out of some considerations:

  • I want to compare the compilation speed of the two architectures.
  • Maybe a full register-based interpreter would be faster than a partial register-based one.
  • Generating the bytecode from the AST instead of the old stack-based bytecode, we have more information and more freedom. Perhaps I have a chance to make the new instruction set more elegant.
  • Uh, furthermore, implementing it in a different way would make my paper more likely to be accepted.

@sweeneyde When writing RegCPython, I put much effort into PEP 626 support (to be honest, it really gave me a headache). As in the following example, RegCPython can already emits NOP instructions at the appropriate locations to keep every line covered. @carljm This does result in RegCPython having more NOPs than CPython, but overall the cost is not that great. Also, RegCPython does not ignore checks for cframe.use_tracing, so these lines should be traced correctly. In short, RegCPython didn't cheat by dropping support for settrace.

Python 3.10.1 (heads/RegCPython-dirty:601a68f081, Jun 29 2022, 13:14:52) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dis import dis
>>> 
>>> def foo(a, b, c):
...     return a + b + c
... 
>>> dis(foo)
2        0 BINARY_ADD               0  1  3 (a  b  $0)
         4 BINARY_ADD               3  2  3 ($0  c  $0)
         8 RETURN_VALUE             3  0  0 ($0  .  .)
>>> 
>>> def bar(a, b, c):
...     return \
...             a \
...             + b + \
...             c
... 
>>> dis(bar)
3         0 NOP                      0  0  0 (.  .  .)

4         4 NOP                      0  0  0 (.  .  .)

3         8 BINARY_ADD               0  1  3 (a  b  $0)

5        12 NOP                      0  0  0 (.  .  .)

3        16 BINARY_ADD               3  2  3 ($0  c  $0)

2        20 RETURN_VALUE             3  0  0 ($0  .  .)

As for why RegCPython disabled 25 test cases out of the 228 test cases in test_sys_settrace.py, this is a bit of a frustrating story for me. I once tried my best to design a new and more elegant instruction set. In particular, I wanted to improve the opcode design for try-block setup and exception handling. Personally, I don't like CPython's design of duplicating the entire contents of the finally block.

Python code

from dis import dis

def foo():
    try:
        return 42
    finally:
        print('something')

dis(foo)

CPython v3.10.1 always duplicates the finally-body for cases with and without exceptions, respectively.

  4           0 SETUP_FINALLY            7 (to 16)

  5           2 POP_BLOCK

  7           4 LOAD_GLOBAL              0 (print)
              6 LOAD_CONST               1 ('something')
              8 CALL_FUNCTION            1
             10 POP_TOP
             12 LOAD_CONST               2 (42)
             14 RETURN_VALUE
        >>   16 LOAD_GLOBAL              0 (print)
             18 LOAD_CONST               1 ('something')
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 RERAISE                  0

RegCPython tried to avoid this problem.

4         0 START_TRY                0  0  2 ($0*  .  <to 12>)

5         4 CALL_FINALLY             0  0  0 (.  .  <to 8>)
      >>  8 RETURN_VALUE             9  0  0 (42  .  .)

7     >> 12 LOAD_GLOBAL              0  0  6 (print  .  $6)
         16 MOVE_FAST               10  0  7 ('something'  .  $7)
         20 CALL_FUNCTION            1  6  6 (<0x1>  $6*  $6)
         24 END_FINALLY              0  0  0 (<0x0>  .  .)

However, I later found that the decision makes passing the following test case tricky, because RegCPython's trace result will be that it returns in the try-body (the line where return 2 in is) instead of in the finally-body (the line where 4 is).

def test_return_through_finally(self):

    def func():
        try:
            return 2
        finally:
            4

    self.run_and_compare(func,
        [(0, 'call'),
         (1, 'line'),
         (2, 'line'),
         (4, 'line'),
         (4, 'return')])

If I don't skip this test case, it will report below error.

AssertionError: events did not match expectation:
  (0, 'call')
  (1, 'line')
  (2, 'line')
  (4, 'line')
+ (2, 'line')
- (4, 'return')
?  ^

+ (2, 'return')
?  ^

When I found that I should modify the frame_setlineno and related functions in Objects/frameobject.c accordingly, I further realized that I had to face more problems. The relevant code is tooooo complicated. To be honest, I still can't guarantee that I understand 100% of the original frame_setlineno. At one point I was going to give up the support for frame_setlineno. In the end I chose to do my best to adapt, but I failed to achieve it perfectly. The adapted frame_setlineno still differs from CPython's when dealing with some situations related to finally-block, so several test cases in test_sys_settrace.py are skipped. Also, I used to think that trace-related features were implementation-defined behavior rather than Python's language standard, so RegCPython could show some differences.

All in all, maybe it would be better if I could go back in time and reverse my bad decision, or if I had more time to understand frame_setlineno and improve its register-based implementation. Unfortunately, I've been short of time and energy to deal with these engineering issues. For the same reason, RegCPython also lacks comments, so reading its source might have some difficulty.

@iritkatriel
Copy link
Collaborator

@zq1997 - thank you for joining this discussion, and for the work you did on the paper.

I may have missed it, but I don't think you mention in the paper what the performance impact of eager decrefs was. Do you have that info? As @markshannon mentioned, changing the order of finalisation is not a big deal, but significantly increasing the lifetimes of object is more of a problem.

Second - before we can look at your code we would need to know that you are happy to contribute it, or parts of it, to cpython. Typically people create a PR and sign the cpython CLA agreement. Are you able to do this? (We would not merge your PR into 3.10, but I believe that by creating it you would indicate that you are happy for us to do so).

@vstinner
Copy link

vstinner commented Nov 4, 2022

@vstinner experimented with this on 3.3: https://faster-cpython.readthedocs.io/registervm.html

Hi,

Right, I made this experiment in 2012 and it was great :-) It made Python faster, like 10 or 20%, I don't recall.

First, I didn't clear any register, ever. But the aggressive register allocator and some instructions writing the result into registers would implicitly clear the old values of registers. So sometimes, by luck, the Python semantics was respected.

When I started to communicate on my work, it was clear that I had to strictly respect the Python semantics of clearing objects the same way as stack-based bytecode. I added CLEAR_REG instruction. But then I got many bugs since I didn't design my code with CLEAR_REG in mind. I had a lot of troubles with conditional code, loops (jumps), etc. I'm not a compiler designer, so I had hard time trying to keep track of register/variables lifetime to decide if and when the code can be optimized or not. At some points, I gave up by lack of time and because I had too many bugs.

The nice thing with registers was that it was aside constants in a frame, so constants would just become "read-only registers". LOAD_FAST just became redundant. I was too lazy to check that bytecode didn't write into "constants", but that's trivial to validate ;-)

Using registers, the number of instructions is lower. Executing a single instruction has a fixed cost. ceval.c is big, I'm not sure that it fits well into CPU L1-instruction cache. So having less instructions... makes the code faster. It's that simple.

One complicated part of using only registers is that some many instructions rely implicitly on the stack, especially exception handling. It wasn't obvious to me how values magically land into the stack, nor how to re-design ceval.c differently.

My implementation was hack-ish: I kept all stack-based bytecode, I didn't touch the compiler. I only rewrote bytecode on-demand to convert stack-based instructions to register-based instructions. Having stack and registers at the same time made things more complicated for changing the weird instructions pushing values on the stack (again, especially exception handling). But it was very pleasant to compare performances between stack and register bytecode! I was sure that Python was built the same way, since it was technically the exact same executable for both :-) Obviously, a final implementation would emit directly register-based bytecode from the compiler.

Another issue of my project is that I experimented optimizations like moving "constants" or even expressions out of loops, but it also changed the Python semantics. I got backfire about these optimizations, even if there was a way to disable them. Maybe thanks to Python 3.11 adaptative bytecode, it might be possible to re-implement such interesting optimizations.

I tried again these kinds of optimizations in my "FAT Python" project which emits faster code thanks to multiple assumptions, and these assumptions were checked at the function entry using "guards" implemented in C. If a single "assumption" was no longer true, the code was de-optimized. "FAT Python" is mostly based on my old "astoptimizer" project, with a small piece of C code for "guards". This project introduces the first flavor of "versionning" to Python dict: PEP 509.

I also proposed PEP 510 but I rejected this PEP, since I failed to prove any clear speedup using my design for "guards". Also, an issue was that guards were only at the entry of a function. Using unittest.mock in the middle of a function to override a built-in function was simply not supported. Adaptative bytecode works at the instruction level, so guards are just put at the right place: this way, the Python semantics is respected. Nice!

Well, tell me if I can help you to decipher my old fork of Python ;-)

@brandtbucher
Copy link
Member

I understand your pain @zq1997... frame_setlineno causes us lots of headaches too.

If I could wave a magic wand and remove one feature from CPython...

@markshannon
Copy link
Member

Thanks for @zq1997 for the insight into your work.

Personally, I don't like CPython's design of duplicating the entire contents of the finally block.

Is this related to the register VM, or some other reason?

@iritkatriel
Copy link
Collaborator

The PEEK() and POKE() code you see is boilerplate generated by the cases generator. Only these lines are in the instruction definition, which the generator doesn't control:

            res = PyNumber_Positive(value);
            Py_DECREF(value);
            ERROR_IF(res == NULL, error);

I'm thinking of a change to the DSL where the decref of inputs is also generated. You specify two blocks - the first can't have any ERROR_IF, the second can't have any DEOP_IF, and the generator decrefs inputs between them.

@gvanrossum
Copy link
Collaborator

I'm thinking of a change to the DSL where the decref of inputs is also generated. You specify two blocks - the first can't have any ERROR_IF, the second can't have any DEOP_IF, and the generator decrefs inputs between them.

Interesting. This would be easy to do in the generator, but it would be a lot of churn in bytecodes.c. Doable though.

@jneb
Copy link

jneb commented Dec 8, 2022

@gvanrossum wrote:

So, I'm curious what you actually did in your semi-manual experiment.

My idea was to have the constants and arguments together in consecutive registers.
Then the binary instructions get three arguments:
two inputs and an output. If an input or output corresponds to a register, that number is given; otherwise a special value (e g. -1) is used to indicate (from the stack).
That way, most LOAD_CONST, LOAD_FAST STORE_FAST are replaced by a register parameter, with corresponding reduction in INCREF and DECREF.
The conversion is easy to do at code generation time (I wrote that code, but it is fairly trivial). I did a few programs, and saw that almost all of the above bytecodes were replaced by stack arguments. The exception is when complex expressions are compiled, where there are intermediate values on stack.
Simple example:
a = b + c + d
would be, instead of

LOAD_FAST b
LOAD FAST c
BINARY +
LOAD_FAST d
BINARY +
STORE_CONST a

the semi-register version:

BINARY+ b, c, <stack>
BINARY+ <stack>, d, a

@iritkatriel
Copy link
Collaborator

My idea was to have the constants and arguments together in consecutive registers. Then the binary instructions get three arguments: two inputs and an output. If an input or output corresponds to a register, that number is given; otherwise a special value (e g. -1) is used to indicate (from the stack). That way, most LOAD_CONST, LOAD_FAST STORE_FAST are replaced by a register parameter, with corresponding reduction in INCREF and DECREF. The conversion is easy to do at code generation time (I wrote that code, but it is fairly trivial). I did a few programs, and saw that almost all of the above bytecodes were replaced by stack arguments. The exception is when complex expressions are compiled, where there are intermediate values on stack.

Our plan is something like that, except that the stack is in the register space, so rather than -1 the compiler can just calculate the index of the top-of-stack register.

@jneb
Copy link

jneb commented Dec 8, 2022

Our plan is something like that, except that the stack is in the register space, so rather than -1 the compiler can just calculate the index of the top-of-stack register.

That would save even more runtime computation. Cool!

@iritkatriel
Copy link
Collaborator

This branch contains the change where each instruction is laid out with an extra codeword (for oparg2, oparg3, currently set to 0 and ignored by the eval loop). Benchmarking shows a 1% slowdown.

Part of this would be due to more EXTENDED_ARGs for jumps (we jump up to twice as far). We will get some of this back by using 16 bits for jump target offsets.

@jneb
Copy link

jneb commented Dec 8, 2022

@iritkatriel wrote:

Our plan is something like that, except that the stack is in the register space, so rather than -1 the compiler can just calculate the index of the top-of-stack register.

One reason I proposed a separate stack mechanism is that the refcounts in registers are treated differently than on the stack: some cleanup is needed at the end otherwise. But it probably is worth the trouble.

@gvanrossum
Copy link
Collaborator

Our plan is something like that, except that the stack is in the register space, so rather than -1 the compiler can just calculate the index of the top-of-stack register.

This will save a branch (testing for -1) so it should be much better.

@jneb
Copy link

jneb commented Dec 10, 2022

This will save a branch (testing for -1) so it should be much better.

Although I guess good branch predictors wil not have a hard time with predicting this case, I do agree: it is not only the branch that takes time, but also all the code around it, running in the inner loop, and filling up cache space.
I was thinking of the subtle semantic change mentioned above, where the cleanup order is mixed up. In any case, even if some objects on the unused "virtual stack levels" will hang around for a while longer, the number of INCREF/DECREF pairs will stay the same (at least in all cases I can imagine).

gvanrossum added a commit to gvanrossum/cpython that referenced this issue Dec 13, 2022
The presence of this macro indicates that a particular instruction
may be considered for conversion to a register-based format
(see faster-cpython/ideas#485).

An invariant (currently unchecked) is that `DEOPT_IF()` may only
occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur
*after* it. One reason not to check this is that there are a few
places where we insert *two* `DECREF_INPUTS()` calls, in different
branches of the code. The invariant checking would have to be able
to do some flow control analysis to understand this.

Note that many instructions, especially specialized ones,
can't be converted to use this macro straightforwardly.
This is because the generator currently only generates plain
`Py_DECREF(variable)` statements, and cannot generate
things like `_Py_DECREF_SPECIALIZED()` let alone deal with
`_PyList_AppendTakeRef()`.
@gvanrossum
Copy link
Collaborator

In a discussion offline we concluded that it's better if the hybrid model doesn't try to reuse the stack as a bank of temporary registers. Instead, registers could be modeled as temporary local variables with "impossible" names (e.g. $4 could be the name of the unnamed local variable at offset 4). When a value on the stack is needed in a register as input for a register-based instruction, we could use STORE_FAST to pop it off the stack and store it in the register. When a register is needed as input to a stack-based instruction, we use LOAD_FAST to push it onto the stack.

We will need to do some design work on compiler architecture to make it easy to generate code for both register-based and stack-based instructions during the transition. For example, the expression "visit" function (compiler_visit_expr1()) will have to return information about whether the result of an operation is on top of the stack or in a given register, and it will need to use this info as input for further visits. (There are some design choices here, e.g. one could also imagine that the visit function receives a "hint" argument indicating a preferred destination for the instruction, so that a register instruction could write the result of e.g. a = b + c directly into the local variable a. Another design might ge to just generate STORE_FAST here and have a separate peepholer that optimize a register instruction followed by a STORE_FAST.)

There's also a recurring problem where we find that debugging new instructions is very difficult. It was mentioned that the (deep)frozen importlib is a big problem here because as long as that doesn't compile and run we won't be able to get to a REPL prompt. A suggestion was to initially have a register-based and a stack-based version of a new instruction, and to emit the register version only in certain cases, e.g. if the filename or function name equals "test". That way we can ensure that the deep-frozen importlib will still work even if the register-based instruction is hopelessly broken. Once the register instruction is debugged we can always generate it and get rid of the stack-based version.

@iritkatriel
Copy link
Collaborator

iritkatriel commented Dec 15, 2022

This is a branch where I'm trying to implement the unary_ops working with the fake locals as described above (it's still giving me some grief, WIP). I limited it to files that have "mytest" in the name.

At some point I had a version that was not restricted to certain files, and which could build and run most tests, but seemed to get stale bytecode whenever a test used test.support.import_helper to do things like import a python implementation of a library and while exclude the C part.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Dec 17, 2022

Thoughts on compiler architecture.

(None of this is original; I heard most of it from Mark during a recent meeting, and worked out some details in my head.)

When compiling an expression tree to a register VM, it's pretty clear that the visit() function (which is called recursively to emit the bytecode) needs to return an indication of where the result can be found. Possibilities would be:

  • on top of the stack
  • in local variable V
  • in temporary register N
  • in constant K

(A register-based instruction treats variables, registers and constants the same way, but the compiler might not.)

When generating stack-based instruction, if the argument is not on the stack, the compiler needs to emit a LOAD_FAST instruction. Conversely, when generating a register-based instruction, if the actual operand is on top of the stack, the compiler needs to allocate a temporary register and a STORE_FAST instruction.

Consider compiling a + b*c, in a (pretty silly) hypothetical hybrid VM where addition (ADD) is stack-based but multiplication (MUL_R) is register-based. The visit function for a should just return "it's in variable a" . Then the visit function for b+c is called. This recurses into b, for which it returns "it's in b", and similar for c. Then it can emit a MUL_R instruction with oparg1 and oparg2 referring to b and c. For the result (oparg3) it will have to allocate a temporary register, say $0. Before we can emit the ADD instruction, we first need to emit LOAD_FAST a and then LOAD_FAST $0. Finally we emit ADD, and return that the result is on top of the stack.

Now let's compile a * (b+c) in the same VM.

  • visit(a) returns "variable a"
  • visit(b+c) recurses down
    • visit(b) returns "variable b"; emit LOAD_FAST b
    • visit(c) returns "variable c"; emit LOAD_FAST c
    • emit ADD
    • return "on top of stack"
  • allocate a temporary register, $1
  • emit STORE_FAST $1
  • allocate another temp register, $2
  • emit MUL_R a, $1, $2
  • return "in $2"

In addition to allocating registers, the compiler also needs to keep track of which registers are still in use. In the above examples, once a register has been loaded on top of the stack using LOAD_FAST it can be considered no longer in used.

Now let's say addition also becomes register-based (ADD_R). Compiling a * b+c again:

  • visit(a) returns "variable a"
  • visit(b+c) recurses
    • visit(b) returns "variable b"
    • visit(c) returns "variable c"
    • allocate temporary register $3
    • emit ADD_R b, c, $3
    • return "in $3"
  • allocate temporary $4
  • emit MUL_R a, $3, $4

Some later optimization stage might observe that $3 and $4 can share the same physical register, reducing the number of temporary registers needed.

Now let's compile an assignment, x = a + b in the same VM. The RHS expression translates to ADD_R a, b, $5. How do we get the result into x? If all we have is LOAD_FAST and STORE_FAST, we'd have to emit LOAD_FAST $5; STORE_FAST x. If we add a MOVE_R instruction, we could shorten that to MOVE_R $5, x. But the best approach is to update the output register for ADD_R to x. How do we get there?

We could imagine another optimization stage that recognizes a register instruction with an output in a temporary register followed by a MOVE_R from that register, and eliminates the MOVE_R by adjusting the preceding instruction's output register.

A different approach might be to pass an extra argument to the visit() indicating the preferred destination. The visit() function may ignore this, and the caller will then emit the appropriate LOAD/STORE/MOVE based on visit()'s return; but if visit() generates a register instruction and the preferred destination is a variable, visit() can just emit an instruction with that variable as its target, avoiding the need to allocate a register. I like this approach.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Dec 17, 2022

Generating code for conditionals in a register VM.

How should we generate code for a and b? Currently we do something like

  LOAD a
  JUMP_IF_FALSE_OR_POP L1
  LOAD b
L1:
  # Result is on top of stack

In a register VM, we could instead have an instruction JUMP_IF_FALSE_R which takes a register and a jump offset, and generate this:

  MOVE a, $0
  JUMP_IF_FALSE_R $0, L1
  MOVE b, $0
L1:
  # Result is in $0

Interestingly, if a and b: ... is compiled differently, because the compiler sees that the values of a and b are never needed (only the truthiness is). The current compiler uses POP_JUMP_IF_FALSE. But the register VM could use the same instruction! (Still in a different arrangement.)

  JUMP_IF_FALSE_R a, L1
  JUMP_IF_FALSE_R b, L1
  <code for ...>
L1:

This feels like a nice little win for the register VM.

@gvanrossum
Copy link
Collaborator

(Note that, sadly, we can't translate x = a and b into

  MOVE a, x
  JUMP_IF_FALSE_R x, L1
  MOVE b, x
L1:

because if bool(a) raises an exception, x should be unchanged.)

gvanrossum added a commit to python/cpython that referenced this issue Dec 17, 2022
…#100205)

The presence of this macro indicates that a particular instruction
may be considered for conversion to a register-based format
(see faster-cpython/ideas#485).

An invariant (currently unchecked) is that `DEOPT_IF()` may only
occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur
*after* it. One reason not to check this is that there are a few
places where we insert *two* `DECREF_INPUTS()` calls, in different
branches of the code. The invariant checking would have to be able
to do some flow control analysis to understand this.

Note that many instructions, especially specialized ones,
can't be converted to use this macro straightforwardly.
This is because the generator currently only generates plain
`Py_DECREF(variable)` statements, and cannot generate
things like `_Py_DECREF_SPECIALIZED()` let alone deal with
`_PyList_AppendTakeRef()`.
@gvanrossum gvanrossum added the epic-registers Ideas related to a register VM label Dec 17, 2022
gvanrossum added a commit to gvanrossum/cpython that referenced this issue Dec 17, 2022
… input (python#100205)

The presence of this macro indicates that a particular instruction
may be considered for conversion to a register-based format
(see faster-cpython/ideas#485).

An invariant (currently unchecked) is that `DEOPT_IF()` may only
occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur
*after* it. One reason not to check this is that there are a few
places where we insert *two* `DECREF_INPUTS()` calls, in different
branches of the code. The invariant checking would have to be able
to do some flow control analysis to understand this.

Note that many instructions, especially specialized ones,
can't be converted to use this macro straightforwardly.
This is because the generator currently only generates plain
`Py_DECREF(variable)` statements, and cannot generate
things like `_Py_DECREF_SPECIALIZED()` let alone deal with
`_PyList_AppendTakeRef()`.
@gvanrossum
Copy link
Collaborator

Thoughts on compiler architecture.

The algorithm in [Zhang] is slightly more involved; it is explained on pp. 8-9 (starting at "Our approach", with a reference to Fig. 4 on p. 7). Basically their visitor returns where the result is located, which may be unresolved, and they require the caller to patch up unresolved results. I suspect this is similar to my "pass in where you would like the result to be" idea.

An interesting convention is that the result of an instruction, if it has one, is always designated by oparg3. (This helps the patching up unresolved results.)

Following this section is an explanation of their algorithm for allocating and freeing temporary variables (registers). Cleverly, registers are associated to a specific AST node, and freed automatically when node traversal ends.

Other thoughts after reading more of the paper

I wonder if we could extend the register allocation algorithm to automatically insert delete instructions for registers that are no longer needed. Though surely this requires more thought. See CLEAR_TMPS mentioned on p. 18, and Fig. 11(b) on p. 19. The authors reject it as impractical due to causing too much unnecessary overhead, but their discussion brushes over the effects on real-world code a bit easily.

The paper mentions (on p. 0) a change in semantics for certain named expressions, e.g. y = x + (x := 2). The authors claim that it is acceptable to first set x to 2 and then do the addition, but this fails to take into account the requirement that the LHS of a binary operator should be evaluated before the RHS. (Arguably, "evaluating" here is poorly specified, or perhaps defined tautologically using an implied stack, but their interpretation doesn't follow the spirit of the rule.) Fortunately they admit it can be fixed with "more engineering effort".

@markshannon
Copy link
Member

If anyone wants to read lots of relatively old-school C code, the Lua VM is a register machine and the compiler works (I believe) in a single pass. So it might be worth a look.

@gvanrossum
Copy link
Collaborator

Googling for "Lua VM" gets lots of hits, for Lua VMs written in Go, C#, Lua... I think this is the one is C that you're referring to, right? https://github.com/lua/lua . If that's the one, I think the compiler is in lcode.c and the interpreter in lvm.c. It's indeed pretty clean old-style C.

@brandtbucher
Copy link
Member

If anyone wants to read lots of relatively old-school C code, the Lua VM is a register machine and the compiler works (I believe) in a single pass. So it might be worth a look.

In case anyone else is interested:

@brandtbucher
Copy link
Member

Ah, we crossed posts.

@brandtbucher
Copy link
Member

This paper has some interesting info, too:

https://www.lua.org/doc/jucs05.pdf

@markshannon
Copy link
Member

Some thoughts regarding register allocation/assignment.

There seem to be two viable approaches.

  1. Assign temporaries for all expressions during code generation using a very simple allocator, and fix up with a peephole optimizer.
  2. Assign virtual registers for all expressions during code generation, and use a full-blown register allocator to allocate variables to locals and temporaries.

A third approach, passing and returning the desired and actual registers doesn't work because, as mentioned in the paper and above, the expression x := 2 clobbers (yes, "clobber" is the technical term) x, so that y = x + (x:=2) could mis-compile if visiting x returned the index of the local variable "x", as it will be clobbered by (x:=2).

Approach 2 will produce the best code, but is quite complex.
So let's carry on with approach 1. If it turns out that it doesn't produce good enough code, we can look into approach 2.

@markshannon
Copy link
Member

It might be worth renaming compiler_visit_expr to compiler_visit_stack, and adding compiler_visit_register to ease moving instructions from stack to register form.

int compiler_visit_expr_stack(compiler *c, expr_ty e)
{
    switch (e->kind) {
            /* Stack instructions */
        default:
            int reg = compiler_visit_expr_register(c, e);
            RETURN_ON_ERROR(reg);
            emit(LOAD_FAST, reg);
    }
}

int compiler_visit_expr_register(compiler *c, expr_ty e)
{
    switch (e->kind) {
            /* Register instructions */
        default:
            RETURN_ON_ERROR(compiler_visit_expr_stack(c, e));
            int reg = allocate_temp();
            emit(STORE_FAST, reg);
    }
}

This way, statements and expressions can be converted from stack to register form independently.

@dolfinus
Copy link

dolfinus commented Jan 5, 2023

Maybe worth reading - LuaJIT Remake project which uses the same approach with generating register-based interpreter from high-level DSL:
https://sillycross.github.io/2022/11/22/2022-11-22/
https://github.com/luajit-remake/luajit-remake
https://github.com/luajit-remake/luajit-remake/tree/master/annotated/bytecodes

@iritkatriel
Copy link
Collaborator

iritkatriel commented Jan 24, 2023

We have been experimenting with the register VM idea for a few weeks, here are some of our conclusions so far.

The migration can be incremental, transitioning one opcode at a time from register-based to stack-based. Temporary ops LOAD_REG/STORE_REG will move values between the stack and the registers to keep everything working.

We will add tmp and const registers to the frame's fast locals array. We ended up with a layout of [localsplus, tmp, stack (during transition), consts], where the consts are copied from the co_consts (and are borrowed references), while all the other registers hold references to the objects (if not NULL). Putting the localsplus first means we don't need to change anything around their construction (with cell/free vars etc). Putting the tmps before the stack means that everything up to top-of-stack needs to be decreffed (like now) when a frame is destroyed so that also minimises changes.

We decided that instructions should not all have the same size (because of the caches, instructions already do not have uniform size, so there's not much to gain from making the op part uniform). Instructions with up to one oparg will have size 1 word, those with 2 or 3 opargs will have size 2 words, and a few might have more opargs (but beyond 3 they are explicit consts, not registers. An example is BINARY_OP where oparg4 is the op). The cases_generator can emit the right JUMPBY(N) for each instruction, but updating the runtime to work correctly with variable length instructions can get fiddly around instructions like RETURN_GENERATOR and YIELD_FROM which need to ensure that prev_instr is set correctly on the frame in all cases (exceptions, etc).

The changes in the compiler's general infrastructure include:

  1. attaching a type to each oparg (const / tmp / explicit int / jump target label / unused).
  2. adding a mechanism to allocate tmp registers and keep track of which ones are used (a stack should work)
  3. adding more opargs to struct instr and teaching write_instr, instr_size etc to work with them.
  4. resolving the opargs to their actual values in assembly stage before emitting code
  5. updating the peephole optimisations (some will not be needed anymore, some need to change, there will be new ones)

Once we have that we can migrate the bytecodes one by one, updating the compiler and the bytecode implementation in tandem.

Areas of uncertainty:

It does not seem possible to assess the performance improvements without implementing the whole thing, and tuning it (when the expected gain is 5-10%, cutting any corners can skew the numbers.)

So we don't know whether we can achieve the results reported for 3.10. Reasons why our results may be different are:

  • 3.11 is very different from 3.10
  • We are not happy to hold references to tmp registers until the end of the function. We don't know what it would cost to avoid doing that.
  • We have concerns about the performance impact of adding more locals to the eval loop, creating pressure on the (hardware) registers there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-generator DSL-based code generator for the interpreter(s) epic-registers Ideas related to a register VM
Projects
None yet
Development

No branches or pull requests