PEP 709 – Inlined comprehensions
- Author:
- Carl Meyer <carl at oddbird.net>
- Sponsor:
- Guido van Rossum <guido at python.org>
- Discussions-To:
- Discourse thread
- Status:
- Final
- Type:
- Standards Track
- Created:
- 24-Feb-2023
- Python-Version:
- 3.12
- Post-History:
- 25-Feb-2023
- Resolution:
- Discourse message
Abstract
Comprehensions are currently compiled as nested functions, which provides isolation of the comprehension’s iteration variable, but is inefficient at runtime. This PEP proposes to inline list, dictionary, and set comprehensions into the code where they are defined, and provide the expected isolation by pushing/popping clashing locals on the stack. This change makes comprehensions much faster: up to 2x faster for a microbenchmark of a comprehension alone, translating to an 11% speedup for one sample benchmark derived from real-world code that makes heavy use of comprehensions in the context of doing actual work.
Motivation
Comprehensions are a popular and widely-used feature of the Python language. The nested-function compilation of comprehensions optimizes for compiler simplicity at the expense of performance of user code. It is possible to provide near-identical semantics (see Backwards Compatibility) with much better runtime performance for all users of comprehensions, with only a small increase in compiler complexity.
Rationale
Inlining is a common compiler optimization in many languages. Generalized inlining of function calls at compile time in Python is near-impossible, since call targets may be patched at runtime. Comprehensions are a special case, where we have a call target known statically in the compiler that can neither be patched (barring undocumented and unsupported fiddling with bytecode directly) nor escape.
Inlining also permits other compiler optimizations of bytecode to be more effective, because they can now “see through” the comprehension bytecode, instead of it being an opaque call.
Normally a performance improvement would not require a PEP. In this case, the simplest and most efficient implementation results in some user-visible effects, so this is not just a performance improvement, it is a (small) change to the language.
Specification
Given a simple comprehension:
def f(lst):
return [x for x in lst]
The compiler currently emits the following bytecode for the function f
:
1 0 RESUME 0
2 2 LOAD_CONST 1 (<code object <listcomp> at 0x...)
4 MAKE_FUNCTION 0
6 LOAD_FAST 0 (lst)
8 GET_ITER
10 CALL 0
20 RETURN_VALUE
Disassembly of <code object <listcomp> at 0x...>:
2 0 RESUME 0
2 BUILD_LIST 0
4 LOAD_FAST 0 (.0)
>> 6 FOR_ITER 4 (to 18)
10 STORE_FAST 1 (x)
12 LOAD_FAST 1 (x)
14 LIST_APPEND 2
16 JUMP_BACKWARD 6 (to 6)
>> 18 END_FOR
20 RETURN_VALUE
The bytecode for the comprehension is in a separate code object. Each time
f()
is called, a new single-use function object is allocated (by
MAKE_FUNCTION
), called (allocating and then destroying a new frame on the
Python stack), and then immediately thrown away.
Under this PEP, the compiler will emit the following bytecode for f()
instead:
1 0 RESUME 0
2 2 LOAD_FAST 0 (lst)
4 GET_ITER
6 LOAD_FAST_AND_CLEAR 1 (x)
8 SWAP 2
10 BUILD_LIST 0
12 SWAP 2
>> 14 FOR_ITER 4 (to 26)
18 STORE_FAST 1 (x)
20 LOAD_FAST 1 (x)
22 LIST_APPEND 2
24 JUMP_BACKWARD 6 (to 14)
>> 26 END_FOR
28 SWAP 2
30 STORE_FAST 1 (x)
32 RETURN_VALUE
There is no longer a separate code object, nor creation of a single-use function object, nor any need to create and destroy a Python frame.
Isolation of the x
iteration variable is achieved by the combination of the
new LOAD_FAST_AND_CLEAR
opcode at offset 6
, which saves any outer value
of x
on the stack before running the comprehension, and 30 STORE_FAST
,
which restores the outer value of x
(if any) after running the
comprehension.
If the comprehension accesses variables from the outer scope, inlining avoids the need to place these variables in a cell, allowing the comprehension (and all other code in the outer function) to access them as normal fast locals instead. This provides further performance gains.
In some cases, the comprehension iteration variable may be a global or cellvar
or freevar, rather than a simple function local, in the outer scope. In these
cases, the compiler also internally pushes and pops the scope information for
the variable when entering/leaving the comprehension, so that semantics are
maintained. For example, if the variable is a global outside the comprehension,
LOAD_GLOBAL
will still be used where it is referenced outside the
comprehension, but LOAD_FAST
/ STORE_FAST
will be used within the
comprehension. If it is a cellvar/freevar outside the comprehension, the
LOAD_FAST_AND_CLEAR
/ STORE_FAST
used to save/restore it do not change
(there is no LOAD_DEREF_AND_CLEAR
), meaning that the entire cell (not just
the value within it) is saved/restored, so the comprehension does not write to
the outer cell.
Comprehensions occurring in module or class scope are also inlined. In this
case, the comprehension will introduce usage of fast-locals (LOAD_FAST
/
STORE_FAST
) for the comprehension iteration variable within the
comprehension only, in a scope where otherwise only LOAD_NAME
/
STORE_NAME
would be used, maintaining isolation.
In effect, comprehensions introduce a sub-scope where local variables are fully isolated, but without the performance cost or stack frame entry of a call.
Generator expressions are currently not inlined in the reference implementation of this PEP. In the future, some generator expressions may be inlined, where the returned generator object does not leak.
Asynchronous comprehensions are inlined the same as synchronous ones; no special handling is needed.
Backwards Compatibility
Comprehension inlining will cause the following visible behavior changes. No changes in the standard library or test suite were necessary to adapt to these changes in the implementation, suggesting the impact in user code is likely to be minimal.
Specialized tools depending on undocumented details of compiler bytecode output may of course be affected in ways beyond the below, but these tools already must adapt to bytecode changes in each Python version.
locals() includes outer variables
Calling locals()
within a comprehension will include all locals of the
function containing the comprehension. E.g. given the following function:
def f(lst):
return [locals() for x in lst]
Calling f([1])
in current Python will return:
[{'.0': <list_iterator object at 0x7f8d37170460>, 'x': 1}]
where .0
is an internal implementation detail: the synthetic sole argument
to the comprehension “function”.
Under this PEP, it will instead return:
[{'lst': [1], 'x': 1}]
This now includes the outer lst
variable as a local, and eliminates the
synthetic .0
.
No comprehension frame in tracebacks
Under this PEP, a comprehension will no longer have its own dedicated frame in a stack trace. For example, given this function:
def g():
raise RuntimeError("boom")
def f():
return [g() for x in [1]]
Currently, calling f()
results in the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in f
File "<stdin>", line 5, in <listcomp>
File "<stdin>", line 2, in g
RuntimeError: boom
Note the dedicated frame for <listcomp>
.
Under this PEP, the traceback looks like this instead:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in f
File "<stdin>", line 2, in g
RuntimeError: boom
There is no longer an extra frame for the list comprehension. The frame for the
f
function has the correct line number for the comprehension, however, so
this simply makes the traceback more compact without losing any useful
information.
It is theoretically possible that code using warnings with the stacklevel
argument could observe a behavior change due to the frame stack change. In
practice, however, this seems unlikely. It would require a warning raised in
library code that is always called through a comprehension in that same
library, where the warning is using a stacklevel
of 3+ to bypass the
comprehension and its containing function and point to a calling frame outside
the library. In such a scenario it would usually be simpler and more reliable
to raise the warning closer to the calling code and bypass fewer frames.
Tracing/profiling will no longer show a call/return for the comprehension
Naturally, since list/dict/set comprehensions will no longer be implemented as a
call to a nested function, tracing/profiling using sys.settrace
or
sys.setprofile
will also no longer reflect that a call and return have
occurred.
Impact on other Python implementations
Per comments from representatives of GraalPython and PyPy, they would likely feel the need to adapt to the observable behavior changes here, given the likelihood that someone, at some point, will depend on them. Thus, all else equal, fewer observable changes would be less work. But these changes (at least in the case of GraalPython) should be manageable “without much headache”.
How to Teach This
It is not intuitively obvious that comprehension syntax will or should result
in creation and call of a nested function. For new users not already accustomed
to the prior behavior, I suspect the new behavior in this PEP will be more
intuitive and require less explanation. (“Why is there a <listcomp>
line in
my traceback when I didn’t define any such function? What is this .0
variable I see in locals()
?”)
Security Implications
None known.
Reference Implementation
This PEP has a reference implementation in the form of a PR against the CPython main branch which passes all tests.
The reference implementation performs the micro-benchmark ./python -m pyperf
timeit -s 'l = [1]' '[x for x in l]'
1.96x faster than the main
branch (in a
build compiled with --enable-optimizations
.)
The reference implementation performs the comprehensions
benchmark in the
pyperformance benchmark suite
(which is not a micro-benchmark of comprehensions alone, but tests
real-world-derived code doing realistic work using comprehensions) 11% faster
than main
branch (again in optimized builds). Other benchmarks in
pyperformance (none of which use comprehensions heavily) don’t show any impact
outside the noise.
The implementation has no impact on non-comprehension code.
Rejected Ideas
More efficient comprehension calling, without inlining
An alternate approach
introduces a new opcode for “calling” a comprehension in streamlined fashion
without the need to create a throwaway function object, but still creating a new
Python frame. This avoids all of the visible effects listed under Backwards
Compatibility, and provides roughly half of the performance benefit (1.5x
improvement on the microbenchmark, 4% improvement on comprehensions
benchmark in pyperformance.) It also requires adding a new pointer to the
_PyInterpreterFrame
struct and a new Py_INCREF
on each frame
construction, meaning (unlike this PEP) it has a (very small) performance cost
for all code. It also provides less scope for future optimizations.
This PEP takes the position that full inlining offers sufficient additional performance to more than justify the behavior changes.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0709.rst
Last modified: 2023-12-15 15:06:12 GMT