PEP: 659 Title: Specializing Adaptive Interpreter Author: Mark Shannon
<mark@hotpy.org> Status: Draft Type: Informational Content-Type:
text/x-rst Created: 13-Apr-2021 Post-History: 11-May-2021

Abstract

In order to perform well, virtual machines for dynamic languages must
specialize the code that they execute to the types and values in the
program being run. This specialization is often associated with "JIT"
compilers, but is beneficial even without machine code generation.

A specializing, adaptive interpreter is one that speculatively
specializes on the types or values it is currently operating on, and
adapts to changes in those types and values.

Specialization gives us improved performance, and adaptation allows the
interpreter to rapidly change when the pattern of usage in a program
alters, limiting the amount of additional work caused by
mis-specialization.

This PEP proposes using a specializing, adaptive interpreter that
specializes code aggressively, but over a very small region, and is able
to adjust to mis-specialization rapidly and at low cost.

Adding a specializing, adaptive interpreter to CPython will bring
significant performance improvements. It is hard to come up with
meaningful numbers, as it depends very much on the benchmarks and on
work that has not yet happened. Extensive experimentation suggests
speedups of up to 50%. Even if the speedup were only 25%, this would
still be a worthwhile enhancement.

Motivation

Python is widely acknowledged as slow. Whilst Python will never attain
the performance of low-level languages like C, Fortran, or even Java, we
would like it to be competitive with fast implementations of scripting
languages, like V8 for Javascript or luajit for lua. Specifically, we
want to achieve these performance goals with CPython to benefit all
users of Python including those unable to use PyPy or other alternative
virtual machines.

Achieving these performance goals is a long way off, and will require a
lot of engineering effort, but we can make a significant step towards
those goals by speeding up the interpreter. Both academic research and
practical implementations have shown that a fast interpreter is a key
part of a fast virtual machine.

Typical optimizations for virtual machines are expensive, so a long
"warm up" time is required to gain confidence that the cost of
optimization is justified. In order to get speed-ups rapidly, without
noticeable warmup times, the VM should speculate that specialization is
justified even after a few executions of a function. To do that
effectively, the interpreter must be able to optimize and de-optimize
continually and very cheaply.

By using adaptive and speculative specialization at the granularity of
individual virtual machine instructions, we get a faster interpreter
that also generates profiling information for more sophisticated
optimizations in the future.

Rationale

There are many practical ways to speed-up a virtual machine for a
dynamic language. However, specialization is the most important, both in
itself and as an enabler of other optimizations. Therefore it makes
sense to focus our efforts on specialization first, if we want to
improve the performance of CPython.

Specialization is typically done in the context of a JIT compiler, but
research shows specialization in an interpreter can boost performance
significantly, even outperforming a naive compiler[1].

There have been several ways of doing this proposed in the academic
literature, but most attempt to optimize regions larger than a single
bytecode[2][3]. Using larger regions than a single instruction requires
code to handle de-optimization in the middle of a region. Specialization
at the level of individual bytecodes makes de-optimization trivial, as
it cannot occur in the middle of a region.

By speculatively specializing individual bytecodes, we can gain
significant performance improvements without anything but the most
local, and trivial to implement, de-optimizations.

The closest approach to this PEP in the literature is "Inline Caching
meets Quickening"[4]. This PEP has the advantages of inline caching, but
adds the ability to quickly de-optimize making the performance more
robust in cases where specialization fails or is not stable.

Performance

The speedup from specialization is hard to determine, as many
specializations depend on other optimizations. Speedups seem to be in
the range 10% - 60%.

-   Most of the speedup comes directly from specialization. The largest
    contributors are speedups to attribute lookup, global variables, and
    calls.
-   A small, but useful, fraction is from improved dispatch such as
    super-instructions and other optimizations enabled by quickening.

Implementation

Overview

Any instruction that would benefit from specialization will be replaced
by an "adaptive" form of that instruction. When executed, the adaptive
instructions will specialize themselves in response to the types and
values that they see. This process is known as "quickening".

Once an instruction in a code object has executed enough times, that
instruction will be "specialized" by replacing it with a new instruction
that is expected to execute faster for that operation.

Quickening

Quickening is the process of replacing slow instructions with faster
variants.

Quickened code has a number of advantages over immutable bytecode:

-   It can be changed at runtime.
-   It can use super-instructions that span lines and take multiple
    operands.
-   It does not need to handle tracing as it can fallback to the
    original bytecode for that.

In order that tracing can be supported, the quickened instruction format
should match the immutable, user visible, bytecode format: 16-bit
instructions of 8-bit opcode followed by 8-bit operand.

Adaptive instructions

Each instruction that would benefit from specialization is replaced by
an adaptive version during quickening. For example, the LOAD_ATTR
instruction would be replaced with LOAD_ATTR_ADAPTIVE.

Each adaptive instruction periodically attempts to specialize itself.

Specialization

CPython bytecode contains many instructions that represent high-level
operations, and would benefit from specialization. Examples include
CALL, LOAD_ATTR, LOAD_GLOBAL and BINARY_ADD.

By introducing a "family" of specialized instructions for each of these
instructions allows effective specialization, since each new instruction
is specialized to a single task. Each family will include an "adaptive"
instruction, that maintains a counter and attempts to specialize itself
when that counter reaches zero.

Each family will also include one or more specialized instructions that
perform the equivalent of the generic operation much faster provided
their inputs are as expected. Each specialized instruction will maintain
a saturating counter which will be incremented whenever the inputs are
as expected. Should the inputs not be as expected, the counter will be
decremented and the generic operation will be performed. If the counter
reaches the minimum value, the instruction is de-optimized by simply
replacing its opcode with the adaptive version.

Ancillary data

Most families of specialized instructions will require more information
than can fit in an 8-bit operand. To do this, a number of 16 bit entries
immediately following the instruction are used to store this data. This
is a form of inline cache, an "inline data cache". Unspecialized, or
adaptive, instructions will use the first entry of this cache as a
counter, and simply skip over the others.

Example families of instructions

LOAD_ATTR

The LOAD_ATTR instruction loads the named attribute of the object on top
of the stack, then replaces the object on top of the stack with the
attribute.

This is an obvious candidate for specialization. Attributes might belong
to a normal instance, a class, a module, or one of many other special
cases.

LOAD_ATTR would initially be quickened to LOAD_ATTR_ADAPTIVE which would
track how often it is executed, and call the _Py_Specialize_LoadAttr
internal function when executed enough times, or jump to the original
LOAD_ATTR instruction to perform the load. When optimizing, the kind of
the attribute would be examined, and if a suitable specialized
instruction was found, it would replace LOAD_ATTR_ADAPTIVE in place.

Specialization for LOAD_ATTR might include:

-   LOAD_ATTR_INSTANCE_VALUE A common case where the attribute is stored
    in the object's value array, and not shadowed by an overriding
    descriptor.
-   LOAD_ATTR_MODULE Load an attribute from a module.
-   LOAD_ATTR_SLOT Load an attribute from an object whose class defines
    __slots__.

Note how this allows optimizations that complement other optimizations.
The LOAD_ATTR_INSTANCE_VALUE works well with the "lazy dictionary" used
for many objects.

LOAD_GLOBAL

The LOAD_GLOBAL instruction looks up a name in the global namespace and
then, if not present in the global namespace, looks it up in the
builtins namespace. In 3.9 the C code for the LOAD_GLOBAL includes code
to check to see whether the whole code object should be modified to add
a cache, whether either the global or builtins namespace, code to lookup
the value in a cache, and fallback code. This makes it complicated and
bulky. It also performs many redundant operations even when supposedly
optimized.

Using a family of instructions makes the code more maintainable and
faster, as each instruction only needs to handle one concern.

Specializations would include:

-   LOAD_GLOBAL_ADAPTIVE would operate like LOAD_ATTR_ADAPTIVE above.
-   LOAD_GLOBAL_MODULE can be specialized for the case where the value
    is in the globals namespace. After checking that the keys of the
    namespace have not changed, it can load the value from the stored
    index.
-   LOAD_GLOBAL_BUILTIN can be specialized for the case where the value
    is in the builtins namespace. It needs to check that the keys of the
    global namespace have not been added to, and that the builtins
    namespace has not changed. Note that we don't care if the values of
    the global namespace have changed, just the keys.

See[5] for a full implementation.

Note

This PEP outlines the mechanisms for managing specialization, and does
not specify the particular optimizations to be applied. It is likely
that details, or even the entire implementation, may change as the code
is further developed.

Compatibility

There will be no change to the language, library or API.

The only way that users will be able to detect the presence of the new
interpreter is through timing execution, the use of debugging tools, or
measuring memory use.

Costs

Memory use

An obvious concern with any scheme that performs any sort of caching is
"how much more memory does it use?". The short answer is "not that
much".

Comparing memory use to 3.10

CPython 3.10 used 2 bytes per instruction, until the execution count
reached ~2000 when it allocates another byte per instruction and 32
bytes per instruction with a cache (LOAD_GLOBAL and LOAD_ATTR).

The following table shows the additional bytes per instruction to
support the 3.10 opcache or the proposed adaptive interpreter, on a 64
bit machine.

+----------------+----------------+----------------+----------------+
|   Version      | 3.10 cold      |   3.10 hot     |   3.11         |
+----------------+----------------+----------------+----------------+
|   Specialised  |   0%           |   ~15%         |   ~25%         |
+----------------+----------------+----------------+----------------+
| ------------   | ------------   | ------------   | ------------   |
+----------------+----------------+----------------+----------------+
|   code         |   2            |   2            |   2            |
+----------------+----------------+----------------+----------------+
|   opcache_map  |   0            |   1            |   0            |
+----------------+----------------+----------------+----------------+
|   opcache/data |   0            |   4.8          |   4            |
+----------------+----------------+----------------+----------------+
| ------------   | ------------   | ------------   | ------------   |
+----------------+----------------+----------------+----------------+
|   Total        |   2            |   7.8          |   6            |
+----------------+----------------+----------------+----------------+

3.10 cold is before the code has reached the ~2000 limit. 3.10 hot shows
the cache use once the threshold is reached.

The relative memory use depends on how much code is "hot" enough to
trigger creation of the cache in 3.10. The break even point, where the
memory used by 3.10 is the same as for 3.11 is ~70%.

It is also worth noting that the actual bytecode is only part of a code
object. Code objects also include names, constants and quite a lot of
debugging information.

In summary, for most applications where many of the functions are
relatively unused, 3.11 will consume more memory than 3.10, but not by
much.

Security Implications

None

Rejected Ideas

By implementing a specializing adaptive interpreter with inline data
caches, we are implicitly rejecting many alternative ways to optimize
CPython. However, it is worth emphasizing that some ideas, such as
just-in-time compilation, have not been rejected, merely deferred.

Storing data caches before the bytecode.

An earlier implementation of this PEP for 3.11 alpha used a different
caching scheme as described below:

  Quickened instructions will be stored in an array (it is neither
  necessary not desirable to store them in a Python object) with the
  same format as the original bytecode. Ancillary data will be stored in
  a separate array.

  Each instruction will use 0 or more data entries. Each instruction
  within a family must have the same amount of data allocated, although
  some instructions may not use all of it. Instructions that cannot be
  specialized, e.g. POP_TOP, do not need any entries. Experiments show
  that 25% to 30% of instructions can be usefully specialized. Different
  families will need different amounts of data, but most need 2 entries
  (16 bytes on a 64 bit machine).

  In order to support larger functions than 256 instructions, we compute
  the offset of the first data entry for instructions as
  (instruction offset)//2 + (quickened operand).

  Compared to the opcache in Python 3.10, this design:

  -   is faster; it requires no memory reads to compute the offset. 3.10
      requires two reads, which are dependent.
  -   uses much less memory, as the data can be different sizes for
      different instruction families, and doesn't need an additional
      array of offsets. can support much larger functions, up to about
      5000 instructions per function. 3.10 can support about 1000.

We rejected this scheme as the inline cache approach is both faster and
simpler.

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

[1] The construction of high-performance virtual machines for dynamic
languages, Mark Shannon 2011.
https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf

[2] The construction of high-performance virtual machines for dynamic
languages, Mark Shannon 2011.
https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf

[3] Dynamic Interpretation for Dynamic Scripting Languages
https://www.scss.tcd.ie/publications/tech-reports/reports.09/TCD-CS-2009-37.pdf

[4] Inline Caching meets Quickening
https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view

[5] The adaptive and specialized instructions are implemented in
https://github.com/python/cpython/blob/main/Python/ceval.c

The optimizations are implemented in:
https://github.com/python/cpython/blob/main/Python/specialize.c