-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
register-based interpreter #485
Comments
I had an exchange with the author. He believes that the RegCPython gains will be orthogonal to Python 3.11 gains, and will compose well. Rebasing to 3.11 would be a large undertaking, however, and he doesn't plan it soon due to finishing his PhD, etc. |
I am convinced that a register based interpreter is going to a significant speedup independent of the rest of the optimisations.
I am pretty sure this list is not complete, so feel free to comment. |
@vstinner experimented with this on 3.3: https://faster-cpython.readthedocs.io/registervm.html |
My main question after reading the paper (but not looking at the code): How does this not run out of registers? The instruction format is (opcode, oparg1, oparg2, oparg3) where each item is 8 bits. What if we need more than 256 registers? Since the co_consts are "loaded" into registers, and easy way to run out is to have > 256 constants in a function. |
I can't stop thinking about this. I would like to have the speed of this without having to have so many dependencies.
That last idea could be called a "poor man's register allocator": just make enough locals die the highest stack |
You need some extension coding, a generalisation of EXTENDED_ARG. I bet that functions using so many values wil not be fast anymore. |
I figured it out: the crazy idea above is possible. The following conceptual process could change an existing bytecode stream into a register based stream, leaving many parts of the interpreter as they are now.
Here a small example: There are of course many details, such as free variables, globals, default values, etc but I don't see why this wouldn't work.
(A day later I realized this could work with 3 address as well as 2 address instructions. The latter one is probably going to be faster.) |
I'd love to see that |
How hard would it be to support CPython's current line-tracing behavior (e.g. PEP 626) with register-based bytecode? Looking at that RegCPython repo's test_sys_settrace.py, it looks like they disabled many of the tracing tests. |
Do they support settrace at all? If not, that's cheating. It's what everybody looking for a quick win does -- drop tracing support, interpreter gets faster (Cinder does it too IIRC). But it shouldn't be hard to support it the same way we currently do in 3.11 and main: the tracing flag is either 0 or 255, and or-ed in with the opcode; and case 255 has special handling code that does the tracing and then executes the original opcode. |
I think with a register-based interpreter you'd also have fewer opcodes and so might end up needing somewhat more NOPs for locations in some cases in order to cover all the lines?
In the JIT we don't support tracing. Not worth it, if you want to trace just turn off the JIT. We haven't done anything to disturb tracing in the interpreter. |
Ah, that makes sense, sorry. FWIW I feel that inserting NOPs just to trace all lines is being hyper-correct. But we're getting ahead of ourselves here. If we're serious we should at least ask the paper author to sign the PSF CLA, so we can read the code with imputiny (and maybe submit a pro-forma PR, even if we're never going to merge it). |
I admit it. It was in the middle of the night that I typed that. I had to edit it to fix a lot of typos and bad grammar. But I'm happy to make a pull request for you... |
Note that Java is an example of using stack-based bytecode but verifying/converting it to register-based before execution. But honestly it's better to just change the bytecode format entirely. Further details I'll post on the N-address issue just linked, since that's the interesting question. |
One wrinkle about a register-based interpreter is that it probably changes the semantics of when references are dropped. In today's stack interpreter references are mostly consumed immediately by bytecodes, so intermediates will be deallocated nearly-immediately. In a register-interpreter, I assume that the references stay alive until the register is reused, which can be arbitrarily later. To preserve existing semantics, maybe a bit of the register index could be used to signal clearing the register after the instruction? Or maybe something more clever, but it seems like there would be a tradeoff with performance. Or maybe it's not a big deal to change this particular behavior. I wrote up a simple test and saw that RegCPython deallocates temporaries in a different order than CPython. I'm not claiming that this is a representative piece of code or anything, just a quick demonstration that this is a thing: class I(int):
def __add__(self, rhs):
return I(int(self) + int(rhs))
def __del__(self):
print("del", self)
def main():
(I(1) + I(2)) + (I(4) + I(8))
if __name__ == "__main__":
main()
(3 and 8 are switched) I don't have any particular horse in this race, but I haven't seen this mentioned yet so I'm just throwing it out there. |
They mention the GC issue in the paper, and it is indeed one of our concerns. |
In the paper they only mention the possibility for extra memory usage, which I think misses the bigger issue. What I'm talking about is the semantic change of changing when object finalizers are run. The author casually dismisses the issue as an implementation detail, but this is a very important implementation detail that people rely on. It's easy to be dismissive and say that its users' own faults for relying on this detail, but I think some humility is in order because CPython itself includes code that is dependent on prompt finalization (WeakKeyDictionary comes to mind) that I have broken in the past by changing finalization timings. Apologies if this is already on your radar, but as I said I haven't seen it mentioned. |
Can’t hurt saying it again. |
I think changing the order of finalization due to temporary references and local variables is something we can be reasonably relaxed about. We already changed the order of finalization in 3.11 for some locals and temporaries. global g = "goodbye"
def f1(a):
global g
a = None
g = None
def f2(x, y):
f1(x+y)
f2("he", "llo") In 3.10 "hello" will be freed after "goodbye" |
@iritkatriel My thinking is that if we view "stack effects" as "input/output effects" then creating register machine instructions should be doable. inst(BINARY_ADD, (left right -- res)) {
BODY
} We can generate the register code easily: TARGET(BINARY_ADD) {
PyObject left = locals[r1];
PyObject right = locals[r2];
BODY
SETLOCAL(r3, res);
} But we need to account for the difference in reference counting. The simplest thing I can think of is to automatically insert increfs for all inputs, then manually remove redundant incref/decref pairs. |
Let's assume we go for a hybrid solution as we discussed in the meeting on Wednesday. So we still maintain the stack, but the compiler passes register ids to the opcodes that tell them where in the stack to look for inputs. We also add the consts and locals etc to the stack so that everything can be accessed via such references. The stack can still control the lifetimes of temporaries as it does now if the compiler also tells each opcode how many items to pop (or how far to clear the stack). So if BINARY_ADD adds two consts, nothing is popped. But if it's a const and a temporary, then one thing is popped. Etc. The lifetimes of objects should be quite similar to what they are now if we can pull this off. I wondered how many registers a function would typically need, so I updated count_opcodes to calculate an upper bound, based on number of locals, consts and stacksize, etc. Bottom line is that it will probably be rare to need more than 255 registers. |
Alternatively, rather than each instruction cleaning up after itself, this can be implicit in the register allocation - when the next instruction overwrites something that the previous instruction was supposed to pop, it decrefs the previous value. Then we need to add a way for the compiler to explicitly shrink the stack when it needs to. |
Another related point: the frame knows where the non-stack localsplus entries end (and the stack begins). So it can assume that any register below this index is holding a borrowed ref, and above that it owns the ref. |
Hi everyone, I'm the author of RegCPython. I just learned today that RegCPython has been discussed here, and it looks like a stack-to-register architecture transition in real engineering faces a lot more issuses than I expected. @jneb The idea you mentioned is consistent with this article which proposes a binary extension that converts stack-based bytecode to register-based one. I finally decided to change the bytecode format entirely out of some considerations:
@sweeneyde When writing RegCPython, I put much effort into PEP 626 support (to be honest, it really gave me a headache). As in the following example, RegCPython can already emits NOP instructions at the appropriate locations to keep every line covered. @carljm This does result in RegCPython having more NOPs than CPython, but overall the cost is not that great. Also, RegCPython does not ignore checks for
As for why RegCPython disabled 25 test cases out of the 228 test cases in test_sys_settrace.py, this is a bit of a frustrating story for me. I once tried my best to design a new and more elegant instruction set. In particular, I wanted to improve the opcode design for try-block setup and exception handling. Personally, I don't like CPython's design of duplicating the entire contents of the finally block. Python code from dis import dis
def foo():
try:
return 42
finally:
print('something')
dis(foo) CPython v3.10.1 always duplicates the finally-body for cases with and without exceptions, respectively.
RegCPython tried to avoid this problem.
However, I later found that the decision makes passing the following test case tricky, because RegCPython's trace result will be that it returns in the try-body (the line where def test_return_through_finally(self):
def func():
try:
return 2
finally:
4
self.run_and_compare(func,
[(0, 'call'),
(1, 'line'),
(2, 'line'),
(4, 'line'),
(4, 'return')]) If I don't skip this test case, it will report below error.
When I found that I should modify the All in all, maybe it would be better if I could go back in time and reverse my bad decision, or if I had more time to understand |
@zq1997 - thank you for joining this discussion, and for the work you did on the paper. I may have missed it, but I don't think you mention in the paper what the performance impact of eager decrefs was. Do you have that info? As @markshannon mentioned, changing the order of finalisation is not a big deal, but significantly increasing the lifetimes of object is more of a problem. Second - before we can look at your code we would need to know that you are happy to contribute it, or parts of it, to cpython. Typically people create a PR and sign the cpython CLA agreement. Are you able to do this? (We would not merge your PR into 3.10, but I believe that by creating it you would indicate that you are happy for us to do so). |
Hi, Right, I made this experiment in 2012 and it was great :-) It made Python faster, like 10 or 20%, I don't recall. First, I didn't clear any register, ever. But the aggressive register allocator and some instructions writing the result into registers would implicitly clear the old values of registers. So sometimes, by luck, the Python semantics was respected. When I started to communicate on my work, it was clear that I had to strictly respect the Python semantics of clearing objects the same way as stack-based bytecode. I added CLEAR_REG instruction. But then I got many bugs since I didn't design my code with CLEAR_REG in mind. I had a lot of troubles with conditional code, loops (jumps), etc. I'm not a compiler designer, so I had hard time trying to keep track of register/variables lifetime to decide if and when the code can be optimized or not. At some points, I gave up by lack of time and because I had too many bugs. The nice thing with registers was that it was aside constants in a frame, so constants would just become "read-only registers". LOAD_FAST just became redundant. I was too lazy to check that bytecode didn't write into "constants", but that's trivial to validate ;-) Using registers, the number of instructions is lower. Executing a single instruction has a fixed cost. ceval.c is big, I'm not sure that it fits well into CPU L1-instruction cache. So having less instructions... makes the code faster. It's that simple. One complicated part of using only registers is that some many instructions rely implicitly on the stack, especially exception handling. It wasn't obvious to me how values magically land into the stack, nor how to re-design ceval.c differently. My implementation was hack-ish: I kept all stack-based bytecode, I didn't touch the compiler. I only rewrote bytecode on-demand to convert stack-based instructions to register-based instructions. Having stack and registers at the same time made things more complicated for changing the weird instructions pushing values on the stack (again, especially exception handling). But it was very pleasant to compare performances between stack and register bytecode! I was sure that Python was built the same way, since it was technically the exact same executable for both :-) Obviously, a final implementation would emit directly register-based bytecode from the compiler. Another issue of my project is that I experimented optimizations like moving "constants" or even expressions out of loops, but it also changed the Python semantics. I got backfire about these optimizations, even if there was a way to disable them. Maybe thanks to Python 3.11 adaptative bytecode, it might be possible to re-implement such interesting optimizations. I tried again these kinds of optimizations in my "FAT Python" project which emits faster code thanks to multiple assumptions, and these assumptions were checked at the function entry using "guards" implemented in C. If a single "assumption" was no longer true, the code was de-optimized. "FAT Python" is mostly based on my old "astoptimizer" project, with a small piece of C code for "guards". This project introduces the first flavor of "versionning" to Python dict: PEP 509. I also proposed PEP 510 but I rejected this PEP, since I failed to prove any clear speedup using my design for "guards". Also, an issue was that guards were only at the entry of a function. Using unittest.mock in the middle of a function to override a built-in function was simply not supported. Adaptative bytecode works at the instruction level, so guards are just put at the right place: this way, the Python semantics is respected. Nice! Well, tell me if I can help you to decipher my old fork of Python ;-) |
I understand your pain @zq1997... If I could wave a magic wand and remove one feature from CPython... |
Thanks for @zq1997 for the insight into your work.
Is this related to the register VM, or some other reason? |
I'm thinking of a change to the DSL where the decref of inputs is also generated. You specify two blocks - the first can't have any ERROR_IF, the second can't have any DEOP_IF, and the generator decrefs inputs between them. |
Interesting. This would be easy to do in the generator, but it would be a lot of churn in bytecodes.c. Doable though. |
@gvanrossum wrote:
My idea was to have the constants and arguments together in consecutive registers.
the semi-register version:
|
Our plan is something like that, except that the stack is in the register space, so rather than -1 the compiler can just calculate the index of the top-of-stack register. |
That would save even more runtime computation. Cool! |
This branch contains the change where each instruction is laid out with an extra codeword (for oparg2, oparg3, currently set to 0 and ignored by the eval loop). Benchmarking shows a 1% slowdown. Part of this would be due to more EXTENDED_ARGs for jumps (we jump up to twice as far). We will get some of this back by using 16 bits for jump target offsets. |
@iritkatriel wrote:
One reason I proposed a separate stack mechanism is that the refcounts in registers are treated differently than on the stack: some cleanup is needed at the end otherwise. But it probably is worth the trouble. |
This will save a branch (testing for -1) so it should be much better. |
Although I guess good branch predictors wil not have a hard time with predicting this case, I do agree: it is not only the branch that takes time, but also all the code around it, running in the inner loop, and filling up cache space. |
The presence of this macro indicates that a particular instruction may be considered for conversion to a register-based format (see faster-cpython/ideas#485). An invariant (currently unchecked) is that `DEOPT_IF()` may only occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur *after* it. One reason not to check this is that there are a few places where we insert *two* `DECREF_INPUTS()` calls, in different branches of the code. The invariant checking would have to be able to do some flow control analysis to understand this. Note that many instructions, especially specialized ones, can't be converted to use this macro straightforwardly. This is because the generator currently only generates plain `Py_DECREF(variable)` statements, and cannot generate things like `_Py_DECREF_SPECIALIZED()` let alone deal with `_PyList_AppendTakeRef()`.
In a discussion offline we concluded that it's better if the hybrid model doesn't try to reuse the stack as a bank of temporary registers. Instead, registers could be modeled as temporary local variables with "impossible" names (e.g. We will need to do some design work on compiler architecture to make it easy to generate code for both register-based and stack-based instructions during the transition. For example, the expression "visit" function ( There's also a recurring problem where we find that debugging new instructions is very difficult. It was mentioned that the (deep)frozen |
This is a branch where I'm trying to implement the unary_ops working with the fake locals as described above (it's still giving me some grief, WIP). I limited it to files that have "mytest" in the name. At some point I had a version that was not restricted to certain files, and which could build and run most tests, but seemed to get stale bytecode whenever a test used test.support.import_helper to do things like import a python implementation of a library and while exclude the C part. |
Thoughts on compiler architecture. (None of this is original; I heard most of it from Mark during a recent meeting, and worked out some details in my head.) When compiling an expression tree to a register VM, it's pretty clear that the visit() function (which is called recursively to emit the bytecode) needs to return an indication of where the result can be found. Possibilities would be:
(A register-based instruction treats variables, registers and constants the same way, but the compiler might not.) When generating stack-based instruction, if the argument is not on the stack, the compiler needs to emit a LOAD_FAST instruction. Conversely, when generating a register-based instruction, if the actual operand is on top of the stack, the compiler needs to allocate a temporary register and a STORE_FAST instruction. Consider compiling Now let's compile
In addition to allocating registers, the compiler also needs to keep track of which registers are still in use. In the above examples, once a register has been loaded on top of the stack using Now let's say addition also becomes register-based (ADD_R). Compiling
Some later optimization stage might observe that Now let's compile an assignment, We could imagine another optimization stage that recognizes a register instruction with an output in a temporary register followed by a MOVE_R from that register, and eliminates the MOVE_R by adjusting the preceding instruction's output register. A different approach might be to pass an extra argument to the visit() indicating the preferred destination. The visit() function may ignore this, and the caller will then emit the appropriate LOAD/STORE/MOVE based on visit()'s return; but if visit() generates a register instruction and the preferred destination is a variable, visit() can just emit an instruction with that variable as its target, avoiding the need to allocate a register. I like this approach. |
Generating code for conditionals in a register VM. How should we generate code for
In a register VM, we could instead have an instruction JUMP_IF_FALSE_R which takes a register and a jump offset, and generate this:
Interestingly,
This feels like a nice little win for the register VM. |
(Note that, sadly, we can't translate
because if |
…#100205) The presence of this macro indicates that a particular instruction may be considered for conversion to a register-based format (see faster-cpython/ideas#485). An invariant (currently unchecked) is that `DEOPT_IF()` may only occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur *after* it. One reason not to check this is that there are a few places where we insert *two* `DECREF_INPUTS()` calls, in different branches of the code. The invariant checking would have to be able to do some flow control analysis to understand this. Note that many instructions, especially specialized ones, can't be converted to use this macro straightforwardly. This is because the generator currently only generates plain `Py_DECREF(variable)` statements, and cannot generate things like `_Py_DECREF_SPECIALIZED()` let alone deal with `_PyList_AppendTakeRef()`.
… input (python#100205) The presence of this macro indicates that a particular instruction may be considered for conversion to a register-based format (see faster-cpython/ideas#485). An invariant (currently unchecked) is that `DEOPT_IF()` may only occur *before* `DECREF_INPUTS()`, and `ERROR_IF()` may only occur *after* it. One reason not to check this is that there are a few places where we insert *two* `DECREF_INPUTS()` calls, in different branches of the code. The invariant checking would have to be able to do some flow control analysis to understand this. Note that many instructions, especially specialized ones, can't be converted to use this macro straightforwardly. This is because the generator currently only generates plain `Py_DECREF(variable)` statements, and cannot generate things like `_Py_DECREF_SPECIALIZED()` let alone deal with `_PyList_AppendTakeRef()`.
The algorithm in [Zhang] is slightly more involved; it is explained on pp. 8-9 (starting at "Our approach", with a reference to Fig. 4 on p. 7). Basically their visitor returns where the result is located, which may be unresolved, and they require the caller to patch up unresolved results. I suspect this is similar to my "pass in where you would like the result to be" idea. An interesting convention is that the result of an instruction, if it has one, is always designated by oparg3. (This helps the patching up unresolved results.) Following this section is an explanation of their algorithm for allocating and freeing temporary variables (registers). Cleverly, registers are associated to a specific AST node, and freed automatically when node traversal ends. Other thoughts after reading more of the paper I wonder if we could extend the register allocation algorithm to automatically insert delete instructions for registers that are no longer needed. Though surely this requires more thought. See The paper mentions (on p. 0) a change in semantics for certain named expressions, e.g. |
If anyone wants to read lots of relatively old-school C code, the Lua VM is a register machine and the compiler works (I believe) in a single pass. So it might be worth a look. |
Googling for "Lua VM" gets lots of hits, for Lua VMs written in Go, C#, Lua... I think this is the one is C that you're referring to, right? https://github.com/lua/lua . If that's the one, I think the compiler is in lcode.c and the interpreter in lvm.c. It's indeed pretty clean old-style C. |
In case anyone else is interested:
|
Ah, we crossed posts. |
This paper has some interesting info, too: |
Some thoughts regarding register allocation/assignment. There seem to be two viable approaches.
A third approach, passing and returning the desired and actual registers doesn't work because, as mentioned in the paper and above, the expression Approach 2 will produce the best code, but is quite complex. |
It might be worth renaming int compiler_visit_expr_stack(compiler *c, expr_ty e)
{
switch (e->kind) {
/* Stack instructions */
default:
int reg = compiler_visit_expr_register(c, e);
RETURN_ON_ERROR(reg);
emit(LOAD_FAST, reg);
}
}
int compiler_visit_expr_register(compiler *c, expr_ty e)
{
switch (e->kind) {
/* Register instructions */
default:
RETURN_ON_ERROR(compiler_visit_expr_stack(c, e));
int reg = allocate_temp();
emit(STORE_FAST, reg);
}
} This way, statements and expressions can be converted from stack to register form independently. |
Maybe worth reading - LuaJIT Remake project which uses the same approach with generating register-based interpreter from high-level DSL: |
We have been experimenting with the register VM idea for a few weeks, here are some of our conclusions so far. The migration can be incremental, transitioning one opcode at a time from register-based to stack-based. Temporary ops We will add tmp and const registers to the frame's fast locals array. We ended up with a layout of We decided that instructions should not all have the same size (because of the caches, instructions already do not have uniform size, so there's not much to gain from making the op part uniform). Instructions with up to one oparg will have size 1 word, those with 2 or 3 opargs will have size 2 words, and a few might have more opargs (but beyond 3 they are explicit consts, not registers. An example is BINARY_OP where oparg4 is the op). The cases_generator can emit the right The changes in the compiler's general infrastructure include:
Once we have that we can migrate the bytecodes one by one, updating the compiler and the bytecode implementation in tandem. Areas of uncertainty:It does not seem possible to assess the performance improvements without implementing the whole thing, and tuning it (when the expected gain is 5-10%, cutting any corners can skew the numbers.) So we don't know whether we can achieve the results reported for 3.10. Reasons why our results may be different are:
|
this 3.10 fork is API and ABI compatible with CPython
I'm seeing 5% speedup on my app (long-running, large, heterogeneous, highly concurrent, async workload), which is in line with overall speedup they report across the benchmarks (closer to 9% is reported for ARM).
https://github.com/zq1997/RegCPython
https://dl.acm.org/doi/10.1145/3568973
I'd like to see how this composes with Python 3.11...
The text was updated successfully, but these errors were encountered: