PEP: 709 Title: Inlined comprehensions Author: Carl Meyer
<carl@oddbird.net> Sponsor: Guido van Rossum <guido@python.org>
Discussions-To:
https://discuss.python.org/t/pep-709-inlined-comprehensions/24240
Status: Final Type: Standards Track Created: 24-Feb-2023 Python-Version:
3.12 Post-History: 25-Feb-2023 Resolution:
https://discuss.python.org/t/pep-709-inlined-comprehensions/24240/36

Abstract

Comprehensions are currently compiled as nested functions, which
provides isolation of the comprehension's iteration variable, but is
inefficient at runtime. This PEP proposes to inline list, dictionary,
and set comprehensions into the code where they are defined, and provide
the expected isolation by pushing/popping clashing locals on the stack.
This change makes comprehensions much faster: up to 2x faster for a
microbenchmark of a comprehension alone, translating to an 11% speedup
for one sample benchmark derived from real-world code that makes heavy
use of comprehensions in the context of doing actual work.

Motivation

Comprehensions are a popular and widely-used feature of the Python
language. The nested-function compilation of comprehensions optimizes
for compiler simplicity at the expense of performance of user code. It
is possible to provide near-identical semantics (see Backwards
Compatibility) with much better runtime performance for all users of
comprehensions, with only a small increase in compiler complexity.

Rationale

Inlining is a common compiler optimization in many languages.
Generalized inlining of function calls at compile time in Python is
near-impossible, since call targets may be patched at runtime.
Comprehensions are a special case, where we have a call target known
statically in the compiler that can neither be patched (barring
undocumented and unsupported fiddling with bytecode directly) nor
escape.

Inlining also permits other compiler optimizations of bytecode to be
more effective, because they can now "see through" the comprehension
bytecode, instead of it being an opaque call.

Normally a performance improvement would not require a PEP. In this
case, the simplest and most efficient implementation results in some
user-visible effects, so this is not just a performance improvement, it
is a (small) change to the language.

Specification

Given a simple comprehension:

    def f(lst):
        return [x for x in lst]

The compiler currently emits the following bytecode for the function f:

    1           0 RESUME                   0

    2           2 LOAD_CONST               1 (<code object <listcomp> at 0x...)
                4 MAKE_FUNCTION            0
                6 LOAD_FAST                0 (lst)
                8 GET_ITER
               10 CALL                     0
               20 RETURN_VALUE

    Disassembly of <code object <listcomp> at 0x...>:
    2           0 RESUME                   0
                2 BUILD_LIST               0
                4 LOAD_FAST                0 (.0)
          >>    6 FOR_ITER                 4 (to 18)
               10 STORE_FAST               1 (x)
               12 LOAD_FAST                1 (x)
               14 LIST_APPEND              2
               16 JUMP_BACKWARD            6 (to 6)
          >>   18 END_FOR
               20 RETURN_VALUE

The bytecode for the comprehension is in a separate code object. Each
time f() is called, a new single-use function object is allocated (by
MAKE_FUNCTION), called (allocating and then destroying a new frame on
the Python stack), and then immediately thrown away.

Under this PEP, the compiler will emit the following bytecode for f()
instead:

    1           0 RESUME                   0

    2           2 LOAD_FAST                0 (lst)
                4 GET_ITER
                6 LOAD_FAST_AND_CLEAR      1 (x)
                8 SWAP                     2
               10 BUILD_LIST               0
               12 SWAP                     2
          >>   14 FOR_ITER                 4 (to 26)
               18 STORE_FAST               1 (x)
               20 LOAD_FAST                1 (x)
               22 LIST_APPEND              2
               24 JUMP_BACKWARD            6 (to 14)
          >>   26 END_FOR
               28 SWAP                     2
               30 STORE_FAST               1 (x)
               32 RETURN_VALUE

There is no longer a separate code object, nor creation of a single-use
function object, nor any need to create and destroy a Python frame.

Isolation of the x iteration variable is achieved by the combination of
the new LOAD_FAST_AND_CLEAR opcode at offset 6, which saves any outer
value of x on the stack before running the comprehension, and
30 STORE_FAST, which restores the outer value of x (if any) after
running the comprehension.

If the comprehension accesses variables from the outer scope, inlining
avoids the need to place these variables in a cell, allowing the
comprehension (and all other code in the outer function) to access them
as normal fast locals instead. This provides further performance gains.

In some cases, the comprehension iteration variable may be a global or
cellvar or freevar, rather than a simple function local, in the outer
scope. In these cases, the compiler also internally pushes and pops the
scope information for the variable when entering/leaving the
comprehension, so that semantics are maintained. For example, if the
variable is a global outside the comprehension, LOAD_GLOBAL will still
be used where it is referenced outside the comprehension, but LOAD_FAST
/ STORE_FAST will be used within the comprehension. If it is a
cellvar/freevar outside the comprehension, the LOAD_FAST_AND_CLEAR /
STORE_FAST used to save/restore it do not change (there is no
LOAD_DEREF_AND_CLEAR), meaning that the entire cell (not just the value
within it) is saved/restored, so the comprehension does not write to the
outer cell.

Comprehensions occurring in module or class scope are also inlined. In
this case, the comprehension will introduce usage of fast-locals
(LOAD_FAST / STORE_FAST) for the comprehension iteration variable within
the comprehension only, in a scope where otherwise only LOAD_NAME /
STORE_NAME would be used, maintaining isolation.

In effect, comprehensions introduce a sub-scope where local variables
are fully isolated, but without the performance cost or stack frame
entry of a call.

Generator expressions are currently not inlined in the reference
implementation of this PEP. In the future, some generator expressions
may be inlined, where the returned generator object does not leak.

Asynchronous comprehensions are inlined the same as synchronous ones; no
special handling is needed.

Backwards Compatibility

Comprehension inlining will cause the following visible behavior
changes. No changes in the standard library or test suite were necessary
to adapt to these changes in the implementation, suggesting the impact
in user code is likely to be minimal.

Specialized tools depending on undocumented details of compiler bytecode
output may of course be affected in ways beyond the below, but these
tools already must adapt to bytecode changes in each Python version.

locals() includes outer variables

Calling locals() within a comprehension will include all locals of the
function containing the comprehension. E.g. given the following
function:

    def f(lst):
        return [locals() for x in lst]

Calling f([1]) in current Python will return:

    [{'.0': <list_iterator object at 0x7f8d37170460>, 'x': 1}]

where .0 is an internal implementation detail: the synthetic sole
argument to the comprehension "function".

Under this PEP, it will instead return:

    [{'lst': [1], 'x': 1}]

This now includes the outer lst variable as a local, and eliminates the
synthetic .0.

No comprehension frame in tracebacks

Under this PEP, a comprehension will no longer have its own dedicated
frame in a stack trace. For example, given this function:

    def g():
        raise RuntimeError("boom")

    def f():
        return [g() for x in [1]]

Currently, calling f() results in the following traceback:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 5, in f
      File "<stdin>", line 5, in <listcomp>
      File "<stdin>", line 2, in g
    RuntimeError: boom

Note the dedicated frame for <listcomp>.

Under this PEP, the traceback looks like this instead:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 5, in f
      File "<stdin>", line 2, in g
    RuntimeError: boom

There is no longer an extra frame for the list comprehension. The frame
for the f function has the correct line number for the comprehension,
however, so this simply makes the traceback more compact without losing
any useful information.

It is theoretically possible that code using warnings with the
stacklevel argument could observe a behavior change due to the frame
stack change. In practice, however, this seems unlikely. It would
require a warning raised in library code that is always called through a
comprehension in that same library, where the warning is using a
stacklevel of 3+ to bypass the comprehension and its containing function
and point to a calling frame outside the library. In such a scenario it
would usually be simpler and more reliable to raise the warning closer
to the calling code and bypass fewer frames.

Tracing/profiling will no longer show a call/return for the comprehension

Naturally, since list/dict/set comprehensions will no longer be
implemented as a call to a nested function, tracing/profiling using
sys.settrace or sys.setprofile will also no longer reflect that a call
and return have occurred.

Impact on other Python implementations

Per comments from representatives of GraalPython and PyPy, they would
likely feel the need to adapt to the observable behavior changes here,
given the likelihood that someone, at some point, will depend on them.
Thus, all else equal, fewer observable changes would be less work. But
these changes (at least in the case of GraalPython) should be manageable
"without much headache".

How to Teach This

It is not intuitively obvious that comprehension syntax will or should
result in creation and call of a nested function. For new users not
already accustomed to the prior behavior, I suspect the new behavior in
this PEP will be more intuitive and require less explanation. ("Why is
there a <listcomp> line in my traceback when I didn't define any such
function? What is this .0 variable I see in locals()?")

Security Implications

None known.

Reference Implementation

This PEP has a reference implementation in the form of a PR against the
CPython main branch which passes all tests.

The reference implementation performs the micro-benchmark
./python -m pyperf timeit -s 'l = [1]' '[x for x in l]' 1.96x faster
than the main branch (in a build compiled with --enable-optimizations.)

The reference implementation performs the comprehensions benchmark in
the pyperformance benchmark suite (which is not a micro-benchmark of
comprehensions alone, but tests real-world-derived code doing realistic
work using comprehensions) 11% faster than main branch (again in
optimized builds). Other benchmarks in pyperformance (none of which use
comprehensions heavily) don't show any impact outside the noise.

The implementation has no impact on non-comprehension code.

Rejected Ideas

More efficient comprehension calling, without inlining

An alternate approach introduces a new opcode for "calling" a
comprehension in streamlined fashion without the need to create a
throwaway function object, but still creating a new Python frame. This
avoids all of the visible effects listed under Backwards Compatibility,
and provides roughly half of the performance benefit (1.5x improvement
on the microbenchmark, 4% improvement on comprehensions benchmark in
pyperformance.) It also requires adding a new pointer to the
_PyInterpreterFrame struct and a new Py_INCREF on each frame
construction, meaning (unlike this PEP) it has a (very small)
performance cost for all code. It also provides less scope for future
optimizations.

This PEP takes the position that full inlining offers sufficient
additional performance to more than justify the behavior changes.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.