Faster startup -- Experiment B -- streamline and tweak #153
Replies: 62 comments
-
FWIW, in the 100 most popular PyPI packages, these were the 50 most common constants:
|
Beta Was this translation helpful? Give feedback.
-
Crazy idea: combine these two instructions. It could do something like switch (oparg) {
case 0: value = Py_None; break;
case 1: value = Py_False; break;
// Several more cases
default:
value = PyLong_FromLong(oparg - 50);
PUSH(value);
DISPATCH();
}
Py_INCREF(value);
PUSH(value);
DISPATCH(); The idea would be that oparg values in the range 10 to 255 translate to -40 to 215 -- that should cover the most common integers. (Or we could use a different bias, e.g. only go down to -5, since cached small ints cover the range of -5 to 255.) Then again, a dedicated MAKE_INT instruction would be faster. (We could still support negative values by using a bias value.) |
Beta Was this translation helpful? Give feedback.
-
I missed empty bytes ( |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Oh, it's |
Beta Was this translation helpful? Give feedback.
-
I started working on MAKE_INT here iritkatriel/cpython#26 I think this opcode might be more appropriately named LOAD_SMALL_INT. Initial perf numbers show this to be slightly slower than LOAD_CONST (I'll post something here soon). It might help if I create a function that takes the index in the small int cache and does the lookup. Right now we translate the opcode (which is the index) to the value and then the lookup function transforms the value back to the index. |
Beta Was this translation helpful? Give feedback.
-
New:
Old:
Sanity check: New:
Old:
same dis as above. |
Beta Was this translation helpful? Give feedback.
-
I'm not getting very stable results from pyperformance, but here's a table anyway.
|
Beta Was this translation helpful? Give feedback.
-
So I did the optimisation I mentioned above and now I get: which is comparable to LOAD_CONST. The change is this:
and then changed in coeval.c:
|
Beta Was this translation helpful? Give feedback.
-
Cool! So the compiler applies the bias? That makes the bytecode dependent on its value — easy to miss changing the magic number if the bias changes. I also wonder whether it would be faster or slower if instead we just hard-coded a few popular values (e.g. -1 .. 3) in LOAD_COMMON_CONSTANT? |
Beta Was this translation helpful? Give feedback.
-
I think that might be tidier. There is too much Mark-style breaking of the rules in what I did. |
Beta Was this translation helpful? Give feedback.
-
Here's a rewrite with one common-consts array for ints and other types: For now I only populate ints, next I will uncomment the other types (and fix the test). |
Beta Was this translation helpful? Give feedback.
-
I think we need to check the common consts again. |
Beta Was this translation helpful? Give feedback.
-
Uh, I found it. It is used for function defaults.
|
Beta Was this translation helpful? Give feedback.
-
I think we should only use How many of the constants in #65 (comment) are in run-once code that would be better implemented as bytecode? The reason I ask is that creating objects in the interpreter is likely to be faster than in the marshal module. Small integer constants and single character constants might be better implemented with their own instructions, |
Beta Was this translation helpful? Give feedback.
-
The docstring would also be on the function object, so you could access it from there. Arguably it's a property of the function and not of the function's code object. Or am I missing something? |
Beta Was this translation helpful? Give feedback.
-
I am thinking of a situation (say scanning PYC files) where you don't want to execute any code. Without execution, you have no function objects. So the doc string as specified in the source has to be somewhere in some code object. Function objects exist primarily to store all the dynamic state wrapping code objects:
There are also a bunch of "metadata" attributes that are set on the function from the code object: |
Beta Was this translation helpful? Give feedback.
-
I added total size of co_consts to the script: faster-cpython/tools#5 Here are some results. Total co_consts size goes from 735,736 to 718,616, so about 2.3% smaller. It will get even better (if) when we move None docstrings out of co_consts. Old:
With LOAD_NONE:
|
Beta Was this translation helpful? Give feedback.
-
Here are top 30 opcode pairs with LOAD_NONE: Note: LOAD_NONE --> RETURN_VALUE 7,439 2.52%
|
Beta Was this translation helpful? Give feedback.
-
While I do think we should do RETURN_NONE, I wouldn't expect the world of it -- dynamically, there can be only one RETURN_NONE instruction per call (by definition :-). |
Beta Was this translation helpful? Give feedback.
-
With this PR we got: Total co_consts size goes from 48,822 to 46,682, about 4.3% less. If I add this change: iritkatriel/cpython#30 |
Beta Was this translation helpful? Give feedback.
-
With only the docstring change (but no LOAD_NONE), it's 48,302 . |
Beta Was this translation helpful? Give feedback.
-
Either,there’s a typo or that last conclusion is wrong (48,302 is hardly less than 48,822)? |
Beta Was this translation helpful? Give feedback.
-
Sorry, I misread the numbers, you're right. So it's the other way around - adding the docstring change to LOAD_NONE helps a lot, but docstring alone not so much. |
Beta Was this translation helpful? Give feedback.
-
So there are 520 code objects that have a None at position 0 that's only used to indicate there's no docstring (these code objects don't load None on the stack at all). And there are 2140 code objects that have a None in at another position used to load None only (we can deduce that these code objects have a docstring at position 0). But 7944 code objects have a None in position 0 that is used both to load None and to indicate there's no docstring. So we should do both things if we care about the size of co_consts. Counter to this, we don't measure a real speedup, we save only 64 kBif all those packages you counted are imported in one process, and the marshal encoding of None is a single byte ( I say we give up. :-( It's too bad, I had high hopes for each of these until we did the work. |
Beta Was this translation helpful? Give feedback.
-
Yes, it doesn't seem to be worth another opcode. Is there a chance that a MAKE_INT opcode will do anything? |
Beta Was this translation helpful? Give feedback.
-
I expect it's the same story. The common small ints are already cached, so LOAD_CONST is more efficient than anything else. We'd save much less because there just are fewer of these than None loads (6.27% of the top 25 constants are ints), although the marshal size is more (5 bytes -- there's no "short int" type in marshal, and it probably wouldn't matter). |
Beta Was this translation helpful? Give feedback.
-
One more thing I haven't tried yet is LOAD_CONST(None)+STORE_FAST: |
Beta Was this translation helpful? Give feedback.
-
But if I read that correctly, it's only ~0.2% of opcode pairs that gets improved. Maybe we should create a framework for measuring dynamic opcode pairs, using e.g. a benchmark suite like PyPerformance as a basis. |
Beta Was this translation helpful? Give feedback.
-
There are still some ideas here that we might do, so I'm moving it to Discussions. |
Beta Was this translation helpful? Give feedback.
-
This is a placeholder for a description of Experiment B (see #32 (comment))
try to speed up unmarshalling of code objects by low-level optimizations
streamline code objects (fewer redundant, computed fields; fewer small objects)
opcodes for immediate values:
Benefit of MAKE_INT and LOAD_COMMON_CONSTANT is that they reduce the size of co_consts (and hence of PYC files) without adding to the size of the instruction array. Downside is that the speed and simplicity of LOAD_CONST is hard to beat, so it will be hard to measure the difference in benchmarks. But None, False, True and small ints make up a lot of the constants (we can easily do research on this).
Beta Was this translation helpful? Give feedback.
All reactions