Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline dict/list/set comprehensions in the compiler for better performance #97933

Closed
carljm opened this issue Oct 5, 2022 · 14 comments · Fixed by #101441
Closed

Inline dict/list/set comprehensions in the compiler for better performance #97933

carljm opened this issue Oct 5, 2022 · 14 comments · Fixed by #101441
Assignees
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@carljm
Copy link
Member

carljm commented Oct 5, 2022

Feature or enhancement

In Cinder we inline some list/dict/set comprehensions in the compiler for better performance. That is, instead of creating a function object every time and calling it, for some comprehensions we just emit bytecode directly in the outer function to implement the comprehension loop.

Pitch

This change was a significant CPU efficiency improvement on the Instagram production web tier. Allocating a new Python function object is a significant and not strictly necessary cost.

There are some backward-compatibility considerations. In the Cinder implementation, we refuse to inline comprehensions if there is a name collision between a name assigned within the comprehension and a name in the containing scope, and we delete any names defined in the comprehension immediately after the inlined bytecode, to preserve the same visibility of names defined in the comprehension. There can still be some visible changes, e.g. if locals() is called within the comprehension it will show names from the containing function too. If a traceback occurs within the comprehension (or sys._getframe is called), it will not show an additional last frame for the comprehension itself. In practice we've not observed either of these to be a problem.

I discussed this idea with @markshannon, and he suggested a couple possible improvements. Instead of refusing to inline comprehensions with name clashes, we could push the prior value of the colliding name(s) onto the stack before running the comprehension, and then pop it back into the name afterward. This would allow inlining ~all comprehensions. He also suggested that we could add PUSH_FRAME and POP_FRAME opcodes and wrap the inlined bytecode in these. Then we would still elide the cost of function object creation, but the comprehension bytecode (although part of the parent bytecode) would still run inside its own frame, eliminating the above-mentioned incompatibilities.

Linked PRs

@carljm carljm added the type-feature A feature request or enhancement label Oct 5, 2022
@carljm
Copy link
Member Author

carljm commented Oct 5, 2022

In Cinder 3.8 the win here was both from not doing a (slow) allocation of a new function object and not doing a (slow) function call. In 3.11+ function calls are now a lot faster, which may take away some of the value here. Also if we improve compatibility via PUSH_FRAME and POP_FRAME, we reintroduce some extra frame overhead (though this is also greatly reduced in 3.11+). I can write a PR and test it on pyperformance and microbenchmarks to see how much this helps performance, but before I do that I am interested in knowing if there are a priori objections to the idea, even if it is a perf win.

@AlexWaygood AlexWaygood added the performance Performance or resource usage label Oct 5, 2022
@markshannon
Copy link
Member

The language definition is rather vague about the semantics. It mentions scopes and comprehensions, but doesn't seem to define either. So let's define what would be needed for full compatibility:

  • The iteration variable must not leak into the surrounding scope
  • The behavior of locals() should be as if the list comprehension were a lambda
  • The list comprehension should appear in tracebacks.

With that in mind, I think pushing a new frame is required.
However, there is no need to create a new function object.
The current implementation of frames needs a function, but all it uses it for is to provide a strong reference to the globals and builtins. The caller's function can be reused, as it has the correct globals and builtins.

@carljm carljm self-assigned this Jan 6, 2023
@carljm
Copy link
Member Author

carljm commented Jan 6, 2023

@markshannon A couple further clarifications about compatibility requirements here.

  1. Is it acceptable if references to the comprehension iterator variable in the caller scope now raise UnboundLocalError instead of NameError?
  2. Is it acceptable if locals() within the comprehension no longer includes the synthetic .0 iterator argument that it includes today?

I'm assuming both of these are acceptable, since they don't violate the constraints you listed above, just making sure we're on the same page here.

There would be a possible implementation that avoids even these changes, where we keep the comprehension's bytecode in its own code object and effectively do things just as they are done today, except we replace MAKE_FUNCTION...CALL with a new dedicated opcode that just "calls" the comprehension without creating a new function object. But this leaves some nice wins on the table; e.g. comprehensions that close over caller locals aren't uncommon, and I was hoping to go for the actually-inlined-bytecode, single-code-object approach that can convert these to shared locals and avoid cells for them.

@markshannon
Copy link
Member

  1. Should be fine, but if we absolutely must maintain the existing behavior, then we already do static analysis of local variables. If it is possible that the variable is undefined, then don't inline the iterator; which should happen about never.
  2. We're going to break this with the register VM anyway, so you're in the clear 🙂

@markshannon
Copy link
Member

OOI, how do you plan to fully inline the comprehension code and have the comprehension appear in tracebacks?

@markshannon
Copy link
Member

Personally, I think that the comprehension not appearing in the frame stack is fine, and may actually be an improvement, but will need wider discussion.

@markshannon
Copy link
Member

One other question. Do you have a link to the Cinder code that does this?

@carljm
Copy link
Member Author

carljm commented Jan 6, 2023

I had been thinking I would still push a new frame even with inlined bytecode, by means of wrapping the inlined bytecode in new PUSH_FRAME and POP_FRAME opcodes. But I realized that there is a problem with pushing a new frame and converting closed-over caller locals into shared locals: mutations to those previously-cells should be visible in the caller, and if I push a new frame they wouldn't be (unless I manually copy them back out of the inner frame to the outer one in POP_FRAME, I guess, but that starts to feel ugly.)

If we could get wider agreement that it's ok to change the behavior of locals() inside a comprehension, and not necessarily have comprehensions show up separately on the Python stack, then I could do something more similar to the current Cinder approach, and inline without a new frame. (Currently Cinder deletes the iteration var after the comprehension runs, and refuses to inline if there's a name clash, but instead we could push/pop the outer value if there is a clash.) I guess the place to seek that wider agreement would be a new Discourse thread; I'll start one.

Most recent version of Cinder implementation is in facebookincubator/cinder@ba2d0ac

@carljm
Copy link
Member Author

carljm commented Jan 6, 2023

@carljm
Copy link
Member Author

carljm commented Jan 25, 2023

Ok, I've worked on a couple possible implementations of this, and pushed one of them in #101310.

The other approach (full inlining) is still WIP; ideally I'd like to get that in a working state and compare the performance (particularly in cases where the comprehension has a closure and moves some outer locals into cells) and complexity in order to make a call between them.

@markshannon
Copy link
Member

markshannon commented Jan 25, 2023

I'd push for the full inlining approach. The resulting code is so much shorter, and presumably faster.
Currently we have:

>>> def f(v, s):
...     return [v + i for i in s]
... 
>>> dis.dis(f)
  MAKE_CELL                0 (v)
  RESUME                   0
  LOAD_CLOSURE             0 (v)
  BUILD_TUPLE              1
  LOAD_CONST               1 (<code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>)
  MAKE_FUNCTION            8 (closure)
  LOAD_FAST                1 (s)
  GET_ITER
  PRECALL                  0
  CALL                     0
  RETURN_VALUE

Disassembly of <code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>:
  COPY_FREE_VARS           1
  RESUME                   0
  BUILD_LIST               0
  LOAD_FAST                0 (.0)
  FOR_ITER                 7 (to 24)
  STORE_FAST               1 (i)
  LOAD_DEREF               2 (v)
  LOAD_FAST                1 (i)
  BINARY_OP                0 (+)
  LIST_APPEND              2
  JUMP_BACKWARD            8 (to 8)
  END_FOR
  RETURN_VALUE

With full inlining we would get:

>>> def f(v, s):
...     return [v + i for i in s]
... 
>>> dis.dis(f)
  RESUME                   0
  BUILD_LIST               0
  LOAD_FAST                1 (s)
  FOR_ITER                 7 (to 24)
  STORE_FAST               2 (i)
  LOAD_FAST                0 (v)
  LOAD_FAST                2 (i)
  BINARY_OP                0 (+)
  LIST_APPEND              2
  JUMP_BACKWARD            8 (to 8)
  END_FOR
  RETURN_VALUE

Full inlining also enables handling "ad-hoc comprehensions" where a generator expression is wrapped in a class call or filter.
faster-cpython/ideas#545

@carljm
Copy link
Member Author

carljm commented Jan 29, 2023

@markshannon it seems that we can't inline cases like this, where the comprehension itself has cells:

def f():
    items = [lambda: x for x in range(3)]
    x = -1
    return [item() for item in items]

Because unless the comprehension has a real frame of its own, x = -1 will stomp on the contents of the cell for x and we'll get [-1, -1, -1] instead of [2, 2, 2].

Such code is probably a bug to begin with, since the last value in the comprehension will be used for all the generated closures, but I'm still not sure we can get away with changing this? And I don't see a good way to inline and preserve the original behavior.

Do you see anything better here than just refusing to inline in this case?

Hmm, I could try "push previous cell to the stack, make new cell, run comprehension, pop previous cell from stack" for this case...

@markshannon
Copy link
Member

Hmm, I could try "push previous cell to the stack, make new cell, run comprehension, pop previous cell from stack" for this case...

Exactly this 🙂

Whether the inner x in the lambda escapes or not shouldn't matter when compiling the enclosing function, just when initializing the inner x.
Renaming for clarity:

def f_escape():
    x0 = ...
    items = [lambda: x1 for x1 in range(3)]
    x0 = -1
    return [item() for item in items]

(Both x0 and x1 are called "x" and occupy the same local slot)
Compare with the non-escaping case:

def f_normal():
    x0 = ...
    items = [x1 for x1 in range(3)]
    x0 = -1
    return items

We compile f_normal to something like:

def f_normal():
    x0 = ...
    PUSH(x0)   # Save x0
    del x1  # Initialize x1
    items = [x1 for x1 in range(3)]
    x0 = POP()
    ...

For an escaping x1 we need to initialize the cell.

def f_escapes():
    x0 = ...
    PUSH(x0)   # Save x0
    del x1 # Initialize x1 
    x1 = MAKE_CELL(x1)  # Initialize x1 
    items = [lambda: x1 for x1 in range(3)]
    x0 = POP()
    ...

@carljm
Copy link
Member Author

carljm commented Jan 30, 2023

#101441 implements full comprehension inlining as discussed above.

carljm added a commit that referenced this issue May 9, 2023
Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main:
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (156 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (35 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
carljm added a commit to carljm/cpython that referenced this issue May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
3 participants