Inline dict/list/set comprehensions in the compiler for better performance #97933

carljm · 2022-10-05T20:36:58Z

Feature or enhancement

In Cinder we inline some list/dict/set comprehensions in the compiler for better performance. That is, instead of creating a function object every time and calling it, for some comprehensions we just emit bytecode directly in the outer function to implement the comprehension loop.

Pitch

This change was a significant CPU efficiency improvement on the Instagram production web tier. Allocating a new Python function object is a significant and not strictly necessary cost.

There are some backward-compatibility considerations. In the Cinder implementation, we refuse to inline comprehensions if there is a name collision between a name assigned within the comprehension and a name in the containing scope, and we delete any names defined in the comprehension immediately after the inlined bytecode, to preserve the same visibility of names defined in the comprehension. There can still be some visible changes, e.g. if locals() is called within the comprehension it will show names from the containing function too. If a traceback occurs within the comprehension (or sys._getframe is called), it will not show an additional last frame for the comprehension itself. In practice we've not observed either of these to be a problem.

I discussed this idea with @markshannon, and he suggested a couple possible improvements. Instead of refusing to inline comprehensions with name clashes, we could push the prior value of the colliding name(s) onto the stack before running the comprehension, and then pop it back into the name afterward. This would allow inlining ~all comprehensions. He also suggested that we could add PUSH_FRAME and POP_FRAME opcodes and wrap the inlined bytecode in these. Then we would still elide the cost of function object creation, but the comprehension bytecode (although part of the parent bytecode) would still run inside its own frame, eliminating the above-mentioned incompatibilities.

Linked PRs

The text was updated successfully, but these errors were encountered:

carljm · 2022-10-05T20:39:57Z

In Cinder 3.8 the win here was both from not doing a (slow) allocation of a new function object and not doing a (slow) function call. In 3.11+ function calls are now a lot faster, which may take away some of the value here. Also if we improve compatibility via PUSH_FRAME and POP_FRAME, we reintroduce some extra frame overhead (though this is also greatly reduced in 3.11+). I can write a PR and test it on pyperformance and microbenchmarks to see how much this helps performance, but before I do that I am interested in knowing if there are a priori objections to the idea, even if it is a perf win.

markshannon · 2022-10-10T12:55:19Z

The language definition is rather vague about the semantics. It mentions scopes and comprehensions, but doesn't seem to define either. So let's define what would be needed for full compatibility:

The iteration variable must not leak into the surrounding scope
The behavior of locals() should be as if the list comprehension were a lambda
The list comprehension should appear in tracebacks.

With that in mind, I think pushing a new frame is required.
However, there is no need to create a new function object.
The current implementation of frames needs a function, but all it uses it for is to provide a strong reference to the globals and builtins. The caller's function can be reused, as it has the correct globals and builtins.

carljm · 2023-01-06T16:40:32Z

@markshannon A couple further clarifications about compatibility requirements here.

Is it acceptable if references to the comprehension iterator variable in the caller scope now raise UnboundLocalError instead of NameError?
Is it acceptable if locals() within the comprehension no longer includes the synthetic .0 iterator argument that it includes today?

I'm assuming both of these are acceptable, since they don't violate the constraints you listed above, just making sure we're on the same page here.

There would be a possible implementation that avoids even these changes, where we keep the comprehension's bytecode in its own code object and effectively do things just as they are done today, except we replace MAKE_FUNCTION...CALL with a new dedicated opcode that just "calls" the comprehension without creating a new function object. But this leaves some nice wins on the table; e.g. comprehensions that close over caller locals aren't uncommon, and I was hoping to go for the actually-inlined-bytecode, single-code-object approach that can convert these to shared locals and avoid cells for them.

markshannon · 2023-01-06T17:26:47Z

Should be fine, but if we absolutely must maintain the existing behavior, then we already do static analysis of local variables. If it is possible that the variable is undefined, then don't inline the iterator; which should happen about never.
We're going to break this with the register VM anyway, so you're in the clear 🙂

markshannon · 2023-01-06T17:35:12Z

OOI, how do you plan to fully inline the comprehension code and have the comprehension appear in tracebacks?

markshannon · 2023-01-06T17:40:10Z

Personally, I think that the comprehension not appearing in the frame stack is fine, and may actually be an improvement, but will need wider discussion.

markshannon · 2023-01-06T17:41:23Z

One other question. Do you have a link to the Cinder code that does this?

carljm · 2023-01-06T17:51:22Z

I had been thinking I would still push a new frame even with inlined bytecode, by means of wrapping the inlined bytecode in new PUSH_FRAME and POP_FRAME opcodes. But I realized that there is a problem with pushing a new frame and converting closed-over caller locals into shared locals: mutations to those previously-cells should be visible in the caller, and if I push a new frame they wouldn't be (unless I manually copy them back out of the inner frame to the outer one in POP_FRAME, I guess, but that starts to feel ugly.)

If we could get wider agreement that it's ok to change the behavior of locals() inside a comprehension, and not necessarily have comprehensions show up separately on the Python stack, then I could do something more similar to the current Cinder approach, and inline without a new frame. (Currently Cinder deletes the iteration var after the comprehension runs, and refuses to inline if there's a name clash, but instead we could push/pop the outer value if there is a clash.) I guess the place to seek that wider agreement would be a new Discourse thread; I'll start one.

Most recent version of Cinder implementation is in facebookincubator/cinder@ba2d0ac

carljm · 2023-01-06T18:18:46Z

https://discuss.python.org/t/inlining-comprehensions-what-are-the-compatibility-constraints/22527

carljm · 2023-01-25T01:10:52Z

Ok, I've worked on a couple possible implementations of this, and pushed one of them in #101310.

The other approach (full inlining) is still WIP; ideally I'd like to get that in a working state and compare the performance (particularly in cases where the comprehension has a closure and moves some outer locals into cells) and complexity in order to make a call between them.

markshannon · 2023-01-25T12:07:21Z

I'd push for the full inlining approach. The resulting code is so much shorter, and presumably faster.
Currently we have:

>>> def f(v, s):
...     return [v + i for i in s]
... 
>>> dis.dis(f)
  MAKE_CELL                0 (v)
  RESUME                   0
  LOAD_CLOSURE             0 (v)
  BUILD_TUPLE              1
  LOAD_CONST               1 (<code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>)
  MAKE_FUNCTION            8 (closure)
  LOAD_FAST                1 (s)
  GET_ITER
  PRECALL                  0
  CALL                     0
  RETURN_VALUE

Disassembly of <code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>:
  COPY_FREE_VARS           1
  RESUME                   0
  BUILD_LIST               0
  LOAD_FAST                0 (.0)
  FOR_ITER                 7 (to 24)
  STORE_FAST               1 (i)
  LOAD_DEREF               2 (v)
  LOAD_FAST                1 (i)
  BINARY_OP                0 (+)
  LIST_APPEND              2
  JUMP_BACKWARD            8 (to 8)
  END_FOR
  RETURN_VALUE

With full inlining we would get:

>>> def f(v, s):
...     return [v + i for i in s]
... 
>>> dis.dis(f)
  RESUME                   0
  BUILD_LIST               0
  LOAD_FAST                1 (s)
  FOR_ITER                 7 (to 24)
  STORE_FAST               2 (i)
  LOAD_FAST                0 (v)
  LOAD_FAST                2 (i)
  BINARY_OP                0 (+)
  LIST_APPEND              2
  JUMP_BACKWARD            8 (to 8)
  END_FOR
  RETURN_VALUE

Full inlining also enables handling "ad-hoc comprehensions" where a generator expression is wrapped in a class call or filter.
faster-cpython/ideas#545

carljm · 2023-01-29T00:54:26Z

@markshannon it seems that we can't inline cases like this, where the comprehension itself has cells:

def f():
    items = [lambda: x for x in range(3)]
    x = -1
    return [item() for item in items]

Because unless the comprehension has a real frame of its own, x = -1 will stomp on the contents of the cell for x and we'll get [-1, -1, -1] instead of [2, 2, 2].

Such code is probably a bug to begin with, since the last value in the comprehension will be used for all the generated closures, but I'm still not sure we can get away with changing this? And I don't see a good way to inline and preserve the original behavior.

Do you see anything better here than just refusing to inline in this case?

Hmm, I could try "push previous cell to the stack, make new cell, run comprehension, pop previous cell from stack" for this case...

markshannon · 2023-01-29T11:25:59Z

Hmm, I could try "push previous cell to the stack, make new cell, run comprehension, pop previous cell from stack" for this case...

Exactly this 🙂

Whether the inner x in the lambda escapes or not shouldn't matter when compiling the enclosing function, just when initializing the inner x.
Renaming for clarity:

def f_escape():
    x0 = ...
    items = [lambda: x1 for x1 in range(3)]
    x0 = -1
    return [item() for item in items]

(Both x0 and x1 are called "x" and occupy the same local slot)
Compare with the non-escaping case:

def f_normal():
    x0 = ...
    items = [x1 for x1 in range(3)]
    x0 = -1
    return items

We compile f_normal to something like:

def f_normal():
    x0 = ...
    PUSH(x0)   # Save x0
    del x1  # Initialize x1
    items = [x1 for x1 in range(3)]
    x0 = POP()
    ...

For an escaping x1 we need to initialize the cell.

def f_escapes():
    x0 = ...
    PUSH(x0)   # Save x0
    del x1 # Initialize x1 
    x1 = MAKE_CELL(x1)  # Initialize x1 
    items = [lambda: x1 for x1 in range(3)]
    x0 = POP()
    ...

carljm · 2023-01-30T23:52:48Z

#101441 implements full comprehension inlining as discussed above.

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>

* main: pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300)

* main: (156 commits) pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300) pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277) pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298) pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320) require-pr-label.yml: Add missing "permissions:" (python#104309) pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939) pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181) pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188) pythonGH-104308: socket.getnameinfo should release the GIL (python#104307) pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311) pythongh-99113: A Per-Interpreter GIL! (pythongh-104210) pythonGH-104284: Fix documentation gettext build (python#104296) pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251) pythongh-104223: Fix issues with inheriting from buffer classes (python#104227) pythongh-99108: fix typo in Modules/Setup (python#104293) ...

* main: (35 commits) pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300) pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277) pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298) pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320) require-pr-label.yml: Add missing "permissions:" (python#104309) pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939) pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181) pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188) pythonGH-104308: socket.getnameinfo should release the GIL (python#104307) pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311) pythongh-99113: A Per-Interpreter GIL! (pythongh-104210) pythonGH-104284: Fix documentation gettext build (python#104296) pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251) pythongh-104223: Fix issues with inheriting from buffer classes (python#104227) pythongh-99108: fix typo in Modules/Setup (python#104293) ...

…ection

…#105126)

…ection (pythonGH-105126) (cherry picked from commit 5701799) Co-authored-by: Carl Meyer <carl@oddbird.net>

…section (GH-105126) (#105142)

carljm added the type-feature A feature request or enhancement label Oct 5, 2022

AlexWaygood added the performance Performance or resource usage label Oct 5, 2022

carljm self-assigned this Jan 6, 2023

carljm mentioned this issue Jan 25, 2023

gh-97933: add opcode for more efficient comprehension execution #101310

Closed

markshannon closed this as completed Jan 29, 2023

markshannon reopened this Jan 29, 2023

carljm added a commit to carljm/cpython that referenced this issue Jan 30, 2023

pythongh-97933: inline sync list/dict/set comprehensions

72afa83

carljm mentioned this issue Jan 30, 2023

gh-97933: inline list/dict/set comprehensions #101441

Merged

carljm closed this as completed in #101441 May 9, 2023

carljm added a commit that referenced this issue May 9, 2023

gh-97933: (PEP 709) inline list/dict/set comprehensions (#101441)

c3b595e

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>

carljm added a commit to carljm/cpython that referenced this issue May 31, 2023

pythongh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode s…

0918fcf

…ection

bedevere-bot mentioned this issue May 31, 2023

gh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode section #105126

Merged

carljm added a commit that referenced this issue May 31, 2023

gh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode section (…

5701799

…#105126)

bedevere-bot mentioned this issue May 31, 2023

[3.12] gh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode section (GH-105126) #105142

Merged

miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 31, 2023

pythongh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode s…

c0cadf3

…ection (pythonGH-105126) (cherry picked from commit 5701799) Co-authored-by: Carl Meyer <carl@oddbird.net>

carljm pushed a commit that referenced this issue May 31, 2023

[3.12] gh-97933: add LOAD_FAST_AND_CLEAR to 3.12 What's New bytecode …

d3c21a9

…section (GH-105126) (#105142)

erlend-aasland mentioned this issue Jun 6, 2023

3.12 backport gh 105236 #105358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline dict/list/set comprehensions in the compiler for better performance #97933

Inline dict/list/set comprehensions in the compiler for better performance #97933

carljm commented Oct 5, 2022 •

edited by bedevere-bot

Loading

carljm commented Oct 5, 2022

markshannon commented Oct 10, 2022

carljm commented Jan 6, 2023 •

edited

Loading

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

carljm commented Jan 6, 2023

carljm commented Jan 6, 2023

carljm commented Jan 25, 2023

markshannon commented Jan 25, 2023 •

edited

Loading

carljm commented Jan 29, 2023 •

edited

Loading

markshannon commented Jan 29, 2023

carljm commented Jan 30, 2023

Inline dict/list/set comprehensions in the compiler for better performance #97933

Inline dict/list/set comprehensions in the compiler for better performance #97933

Comments

carljm commented Oct 5, 2022 • edited by bedevere-bot Loading

Feature or enhancement

Pitch

Linked PRs

carljm commented Oct 5, 2022

markshannon commented Oct 10, 2022

carljm commented Jan 6, 2023 • edited Loading

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

markshannon commented Jan 6, 2023

carljm commented Jan 6, 2023

carljm commented Jan 6, 2023

carljm commented Jan 25, 2023

markshannon commented Jan 25, 2023 • edited Loading

carljm commented Jan 29, 2023 • edited Loading

markshannon commented Jan 29, 2023

carljm commented Jan 30, 2023

carljm commented Oct 5, 2022 •

edited by bedevere-bot

Loading

carljm commented Jan 6, 2023 •

edited

Loading

markshannon commented Jan 25, 2023 •

edited

Loading

carljm commented Jan 29, 2023 •

edited

Loading