PEP: 744 Title: JIT Compilation Author: Brandt Bucher
<brandt@python.org>, Savannah Ostrowski <savannahostrowski@gmail.com>,
Discussions-To:
https://discuss.python.org/t/pep-744-jit-compilation/50756 Status: Draft
Type: Informational Created: 11-Apr-2024 Python-Version: 3.13
Post-History: 11-Apr-2024

Abstract

Earlier this year, an experimental "just-in-time" compiler was merged
into CPython's main development branch. While recent CPython releases
have included other substantial internal changes, this addition
represents a particularly significant departure from the way CPython has
traditionally executed Python code. As such, it deserves wider
discussion.

This PEP aims to summarize the design decisions behind this addition,
the current state of the implementation, and future plans for making the
JIT a permanent, non-experimental part of CPython. It does not seek to
provide a comprehensive overview of how the JIT works, instead focusing
on the particular advantages and disadvantages of the chosen approach,
as well as answering many questions that have been asked about the JIT
since its introduction.

Readers interested in learning more about the new JIT are encouraged to
consult the following resources:

-   The presentation which first introduced the JIT at the 2023 CPython
    Core Developer Sprint. It includes relevant background, a light
    technical introduction to the "copy-and-patch" technique used, and
    an open discussion of its design amongst the core developers
    present. Slides for this talk can be found on GitHub.
-   The open access paper originally describing copy-and-patch.
-   The blog post by the paper's author detailing the implementation of
    a copy-and-patch JIT compiler for Lua. While this is a great
    low-level explanation of the approach, note that it also
    incorporates other techniques and makes implementation decisions
    that are not particularly relevant to CPython's JIT.
-   The implementation itself.

Motivation

Until this point, CPython has always executed Python code by compiling
it to bytecode, which is interpreted at runtime. This bytecode is a
more-or-less direct translation of the source code: it is untyped, and
largely unoptimized.

Since the Python 3.11 release, CPython has used a "specializing adaptive
interpreter" (PEP 659), which rewrites these bytecode instructions
in-place with type-specialized versions as they run. This new
interpreter delivers significant performance improvements, despite the
fact that its optimization potential is limited by the boundaries of
individual bytecode instructions. It also collects a wealth of new
profiling information: the types flowing though a program, the memory
layout of particular objects, and what paths through the program are
being executed the most. In other words, what to optimize, and how to
optimize it.

Since the Python 3.12 release, CPython has generated this interpreter
from a C-like domain-specific language (DSL). In addition to taming some
of the complexity of the new adaptive interpreter, the DSL also allows
CPython's maintainers to avoid hand-writing tedious boilerplate code in
many parts of the interpreter, compiler, and standard library that must
be kept in sync with the instruction definitions. This ability to
generate large amounts of runtime infrastructure from a single source of
truth is not only convenient for maintenance; it also unlocks many
possibilities for expanding CPython's execution in new ways. For
instance, it makes it feasible to automatically generate tables for
translating a sequence of instructions into an equivalent sequence of
smaller "micro-ops", generate an optimizer for sequences of these
micro-ops, and even generate an entire second interpreter for executing
them.

In fact, since early in the Python 3.13 release cycle, all CPython
builds have included this exact micro-op translation, optimization, and
execution machinery. However, it is disabled by default; the overhead of
interpreting even optimized traces of micro-ops is just too large for
most code. Heavier optimization probably won't improve the situation
much either, since any efficiency gains made by new optimizations will
likely be offset by the interpretive overhead of even smaller, more
complex micro-ops.

The most obvious strategy to overcome this new bottleneck is to
statically compile these optimized traces. This presents opportunities
to avoid several sources of indirection and overhead introduced by
interpretation. In particular, it allows the removal of dispatch
overhead between micro-ops (by replacing a generic interpreter with a
straight-line sequence of hot code), instruction decoding overhead for
individual micro-ops (by "burning" the values or addresses of arguments,
constants, and cached values directly into machine instructions), and
memory traffic (by moving data off of heap-allocated Python frames and
into physical hardware registers).

Since much of this data varies even between identical runs of a program
and the existing optimization pipeline makes heavy use of runtime
profiling information, it doesn't make much sense to compile these
traces ahead of time and would be a substantial redesign of the existing
specification and micro-op tracing infrastructure that has already been
implemented. As has been demonstrated for many other dynamic languages
(and even Python itself), the most promising approach is to compile the
optimized micro-ops "just in time" for execution.

Rationale

Despite their reputation, JIT compilers are not magic "go faster"
machines. Developing and maintaining any sort of optimizing compiler for
even a single platform, let alone all of CPython's most popular
supported platforms, is an incredibly complicated, expensive task. Using
an existing compiler framework like LLVM can make this task simpler, but
only at the cost of introducing heavy runtime dependencies and
significantly higher JIT compilation overhead.

It's clear that successfully compiling Python code at runtime requires
not only high-quality Python-specific optimizations for the code being
run, but also quick generation of efficient machine code for the
optimized program. The Python core development team has the necessary
skills and experience for the former (a middle-end tightly coupled to
the interpreter), and copy-and-patch compilation provides an attractive
solution for the latter.

In a nutshell, copy-and-patch allows a high-quality template JIT
compiler to be generated from the same DSL used to generate the rest of
the interpreter. For a widely-used, volunteer-driven project like
CPython, this benefit cannot be overstated: CPython's maintainers, by
merely editing the bytecode definitions, will also get the JIT backend
updated "for free", for all JIT-supported platforms, at once. This is
equally true whether instructions are being added, modified, or removed.

Like the rest of the interpreter, the JIT compiler is generated at build
time, and has no runtime dependencies. It supports a wide range of
platforms (see the Support section below), and has comparatively low
maintenance burden. In all, the current implementation is made up of
about 900 lines of build-time Python code and 500 lines of runtime C
code.

Specification

The JIT is currently not part of the default build configuration, and it
is likely to remain that way for the foreseeable future (though official
binaries may include it). That said, the JIT will become
non-experimental once all of the following conditions are met:

1.  It provides a meaningful performance improvement for at least one
    popular platform (realistically, on the order of 5%).
2.  It can be built, distributed, and deployed with minimal disruption.
3.  The Steering Council, upon request, has determined that it would
    provide more value to the community if enabled than if disabled
    (considering tradeoffs such as maintenance burden, memory usage, or
    the feasibility of alternate designs).

These criteria should be considered a starting point, and may be
expanded over time. For example, discussion of this PEP may reveal that
additional requirements (such as multiple committed maintainers, a
security audit, documentation in the devguide, support for
out-of-process debugging, or a runtime option to disable the JIT) should
be added to this list.

Until the JIT is non-experimental, it should not be used in production,
and may be broken or removed at any time without warning.

Once the JIT is no longer experimental, it should be treated in much the
same way as other build options such as --enable-optimizations or
--with-lto. It may be a recommended (or even default) option for some
platforms, and release managers may choose to enable it in official
releases.

Support

The JIT has been developed for all of PEP 11's current tier one
platforms, most of its tier two platforms, and one of its tier three
platforms. Specifically, CPython's main branch has CI building and
testing the JIT for both release and debug builds on:

-   aarch64-apple-darwin/clang
-   aarch64-pc-windows/msvc[1]
-   aarch64-unknown-linux-gnu/clang[2]
-   aarch64-unknown-linux-gnu/gcc[3]
-   i686-pc-windows-msvc/msvc
-   x86_64-apple-darwin/clang
-   x86_64-pc-windows-msvc/msvc
-   x86_64-unknown-linux-gnu/clang
-   x86_64-unknown-linux-gnu/gcc

It's worth noting that some platforms, even future tier one platforms,
may never gain JIT support. This can be for a variety of reasons,
including insufficient LLVM support (powerpc64le-unknown-linux-gnu/gcc),
inherent limitations of the platform (wasm32-unknown-wasi/clang), or
lack of developer interest (x86_64-unknown-freebsd/clang).

Once JIT support for a platform is added (meaning, the JIT builds
successfully without displaying warnings to the user), it should be
treated in much the same way as PEP 11 prescribes: it should have
reliable CI/buildbots, and JIT failures on tier one and tier two
platforms should block releases. Though it's not necessary to update PEP
11 to specify JIT support, it may be helpful to do so anyway. Otherwise,
a list of supported platforms should be maintained in the JIT's README.

Since it should always be possible to build CPython without the JIT,
removing JIT support for a platform should not be considered a
backwards-incompatible change. However, if it is reasonable to do so,
the normal deprecation process should be followed as outlined in PEP
387.

The JIT's build-time dependencies may be changed between releases,
within reason.

Backwards Compatibility

Due to the fact that the current interpreter and the JIT backend are
both generated from the same specification, the behavior of Python code
should be completely unchanged. In practice, observable differences that
have been found and fixed during testing have tended to be bugs in the
existing micro-op translation and optimization stages, rather than bugs
in the copy-and-patch step.

Debugging

Tools that profile and debug Python code will continue to work fine.
This includes in-process tools that use Python-provided functionality
(like sys.monitoring, sys.settrace, or sys.setprofile), as well as
out-of-process tools that walk Python frames from the interpreter state.

However, it appears that profilers and debuggers for C code are
currently unable to trace back through JIT frames. Working with leaf
frames is possible (this is how the JIT itself is debugged), though it
is of limited utility due to the absence of proper debugging information
for JIT frames.

Since the code templates emitted by the JIT are compiled by Clang, it
may be possible to allow JIT frames to be traced through by simply
modifying the compiler flags to use frame pointers more carefully. It
may also be possible to harvest and emit the debugging information
produced by Clang. Neither of these ideas have been explored very
deeply.

While this is an issue that should be fixed, fixing it is not a
particularly high priority at this time. This is probably a problem best
explored by somebody with more domain expertise in collaboration with
those maintaining the JIT, who have little experience with the inner
workings of these tools.

Security Implications

This JIT, like any JIT, produces large amounts of executable data at
runtime. This introduces a potential new attack surface to CPython,
since a malicious actor capable of influencing the contents of this data
is therefore capable of executing arbitrary code. This is a well-known
vulnerability of JIT compilers.

In order to mitigate this risk, the JIT has been written with best
practices in mind. In particular, the data in question is not exposed by
the JIT compiler to other parts of the program while it remains
writable, and at no point is the data both writable and executable_.

The nature of template-based JITs also seriously limits the kinds of
code that can be generated, further reducing the likelihood of a
successful exploit. As an additional precaution, the templates
themselves are stored in static, read-only memory.

However, it would be naive to assume that no possible vulnerabilities
exist in the JIT, especially at this early stage. The author is not a
security expert, but is available to join or work closely with the
Python Security Response Team to triage and fix security issues as they
arise.

Apple Silicon

Though difficult to test without actually signing and packaging a macOS
release, it appears that macOS releases should enable the JIT
Entitlement for the Hardened Runtime.

This shouldn't make installing Python any harder, but may add additional
steps for release managers to perform.

How to Teach This

Choose the sections that best describe you:

-   If you are a Python programmer or end user...
    -   ...nothing changes for you. Nobody should be distributing
        JIT-enabled CPython interpreters to you while it is still an
        experimental feature. Once it is non-experimental, you will
        probably notice slightly better performance and slightly higher
        memory usage. You shouldn't be able to observe any other
        changes.
-   If you maintain third-party packages...
    -   ...nothing changes for you. There are no API or ABI changes, and
        the JIT is not exposed to third-party code. You shouldn't need
        to change your CI matrix, and you shouldn't be able to observe
        differences in the way your packages work when the JIT is
        enabled.
-   If you profile or debug Python code...
    -   ...nothing changes for you. All Python profiling and tracing
        functionality remains.
-   If you profile or debug C code...
    -   ...currently, the ability to trace through JIT frames is
        limited. This may cause issues if you need to observe the entire
        C call stack, rather than just "leaf" frames. See the Debugging
        section above for more information.
-   If you compile your own Python interpreter....
    -   ...if you don't wish to build the JIT, you can simply ignore it.
        Otherwise, you will need to install a compatible version of
        LLVM, and pass the appropriate flag to the build scripts. Your
        build may take up to a minute longer. Note that the JIT should
        not be distributed to end users or used in production while it
        is still in the experimental phase.
-   If you're a maintainer of CPython (or a fork of CPython)...
    -   ...and you change the bytecode definitions or the main
        interpreter loop...
        -   ...in general, the JIT shouldn't be much of an inconvenience
            to you (depending on what you're trying to do). The micro-op
            interpreter isn't going anywhere, and still offers a
            debugging experience similar to what the main bytecode
            interpreter provides today. There is moderate likelihood
            that larger changes to the interpreter (such as adding new
            local variables, changing error handling and deoptimization
            logic, or changing the micro-op format) will require changes
            to the C template used to generate the JIT, which is meant
            to mimic the main interpreter loop. You may also
            occasionally just get unlucky and break JIT code generation,
            which will require you to either modify the Python build
            scripts yourself, or solicit the help of somebody more
            familiar with them (see below).
    -   ...and you work on the JIT itself...
        -   ...you hopefully already have a decent idea of what you're
            getting yourself into. You will be regularly modifying the
            Python build scripts, the C template used to generate the
            JIT, and the C code that actually makes up the runtime
            portion of the JIT. You will also be dealing with all sorts
            of crashes, stepping over machine code in a debugger,
            staring at COFF/ELF/Mach-O dumps, developing on a wide range
            of platforms, and generally being the point of contact for
            the people changing the bytecode when CI starts failing on
            their PRs (see above). Ideally, you're at least familiar
            with assembly, have taken a couple of courses with
            "compilers" in their name, and have read a blog post or two
            about linkers.
    -   ...and you maintain other parts of CPython...
        -   ...nothing changes for you. You shouldn't need to develop
            locally with JIT builds. If you choose to do so (for
            example, to help reproduce and triage JIT issues), your
            builds may take up to a minute longer each time the relevant
            files are modified.

Reference Implementation

Key parts of the implementation include:

-   Tools/jit/README.md_: Instructions for how to build the JIT.
-   Python/jit.c_: The entire runtime portion of the JIT compiler.
-   jit_stencils.h_: An example of the JIT's generated templates.
-   Tools/jit/template.c_: The code which is compiled to produce the
    JIT's templates.
-   Tools/jit/_targets.py_: The code to compile and parse the templates
    at build time.

Rejected Ideas

Maintain it outside of CPython

While it is probably possible to maintain the JIT outside of CPython,
its implementation is tied tightly enough to the rest of the interpreter
that keeping it up-to-date would probably be more difficult than
actually developing the JIT itself. Additionally, contributors working
on the existing micro-op definitions and optimizations would need to
modify and build two separate projects to measure the effects of their
changes under the JIT (whereas today, infrastructure exists to do this
automatically for any proposed change).

Releases of the separate "JIT" project would probably also need to
correspond to specific CPython pre-releases and patch releases,
depending on exactly what changes are present. Individual CPython
commits between releases likely wouldn't have corresponding JIT releases
at all, further complicating debugging efforts (such as bisection to
find breaking changes upstream).

Since the JIT is already quite stable, and the ultimate goal is for it
to be a non-experimental part of CPython, keeping it in main seems to be
the best path forward. With that said, the relevant code is organized in
such a way that the JIT can be easily "deleted" if it does not end up
meeting its goals.

Turn it on by default

On the other hand, some have suggested that the JIT should be enabled by
default in its current form.

Again, it is important to remember that a JIT is not a magic "go faster"
machine; currently, the JIT is about as fast as the existing
specializing interpreter. This may sound underwhelming, but it is
actually a fairly significant achievement, and it's the main reason why
this approach was considered viable enough to be merged into main for
further development.

While the JIT provides significant gains over the existing micro-op
interpreter, it isn't yet a clear win when always enabled (especially
considering its increased memory consumption and additional build-time
dependencies). That's the purpose of this PEP: to clarify expectations
about the objective criteria that should be met in order to "flip the
switch".

At least for now, having this in main, but off by default, seems to be a
good compromise between always turning it on and not having it available
at all.

Support multiple compiler toolchains

Clang is specifically needed because it's the only C compiler with
support for guaranteed tail calls (musttail_), which are required by
CPython's continuation-passing-style approach to JIT compilation.
Without it, the tail-recursive calls between templates could result in
unbounded C stack growth (and eventual overflow).

Since LLVM also includes other functionalities required by the JIT build
process (namely, utilities for object file parsing and disassembly), and
additional toolchains introduce additional testing and maintenance
burden, it's convenient to only support one major version of one
toolchain at this time.

Compile the base interpreter's bytecode

Most of the prior art for copy-and-patch uses it as a fast baseline JIT,
whereas CPython's JIT is using the technique to compile optimized
micro-op traces.

In practice, the new JIT currently sits somewhere between the "baseline"
and "optimizing" compiler tiers of other dynamic language runtimes. This
is because CPython uses its specializing adaptive interpreter to collect
runtime profiling information, which is used to detect and optimize
"hot" paths through the code. This step is carried out using
self-modifying code, a technique which is much more difficult to
implement with a JIT compiler.

While it's possible to compile normal bytecode using copy-and-patch (in
fact, early prototypes predated the micro-op interpreter and did exactly
this), it just doesn't seem to provide enough optimization potential as
the more granular micro-op format.

Add GPU support

The JIT is currently CPU-only. It does not, for example, offload NumPy
array computations to CUDA GPUs, as JITs like Numba do.

There is already a rich ecosystem of tools for accelerating these sorts
of specialized tasks, and CPython's JIT is not intended to replace them.
Instead, it is meant to improve the performance of general-purpose
Python code, which is less likely to benefit from deeper GPU
integration.

Open Issues

Speed

Currently, the JIT is about as fast as the existing specializing
interpreter on most platforms. Improving this is obviously a top
priority at this point, since providing a significant performance gain
is the entire motivation for having a JIT at all. A number of proposed
improvements are already underway, and this ongoing work is being
tracked in GH-115802.

Memory

Because it allocates additional memory for executable machine code, the
JIT does use more memory than the existing interpreter at runtime.
According to the official benchmarks, the JIT currently uses about
10-20% more memory than the base interpreter. The upper end of this
range is due to aarch64-apple-darwin, which has larger page sizes (and
thus, a larger minimum allocation granularity).

However, these numbers should be taken with a grain of salt, as the
benchmarks themselves don't actually have a very high baseline of memory
usage. Since they have a higher ratio of code to data, the JIT's memory
overhead is more pronounced than it would be in a typical workload where
memory pressure is more likely to be a real concern.

Not much effort has been put into optimizing the JIT's memory usage yet,
so these numbers likely represent a maximum that will be reduced over
time. Improving this is a medium priority, and is being tracked in
GH-116017. We may consider exposing configurable parameters for limiting
memory consumption in the future, but no official APIs will be exposed
until the JIT meets the requirements to be considered non-experimental.

Earlier versions of the JIT had a more complicated memory allocation
scheme which imposed a number of fragile limitations on the size and
layout of the emitted code, and significantly bloated the memory
footprint of Python executable. These issues are no longer present in
the current design.

Dependencies

At the time of writing, the JIT has a build-time dependency on LLVM.
LLVM is used to compile individual micro-op instructions into blobs of
machine code, which are then linked together to form the JIT's
templates. These templates are used to build CPython itself. The JIT has
no runtime dependency on LLVM and is therefore not at all exposed as a
dependency to end users.

Building the JIT adds between 3 and 60 seconds to the build process,
depending on platform. It is only rebuilt whenever the generated files
become out-of-date, so only those who are actively developing the main
interpreter loop will be rebuilding it with any frequency.

Unlike many other generated files in CPython, the JIT's generated files
are not tracked by Git. This is because they contain compiled binary
code templates specific to not only the host platform, but also the
current build configuration for that platform. As such, hosting them
would require a significant engineering effort in order to build and
host dozens of large binary files for each commit that changes the
generated code. While perhaps feasible, this is not a priority, since
installing the required tools is not prohibitively difficult for most
people building CPython, and the build step is not particularly
time-consuming.

Since some still remain interested in this possibility, discussion is
being tracked in GH-115869.

Footnotes

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

[1] Due to lack of available hardware, the JIT is built, but not tested,
for this platform.

[2] Due to lack of available hardware, the JIT is built using
cross-compilation and tested using hardware emulation for this platform.
Some tests are skipped because emulation causes them to fail. However,
the JIT has been successfully built and tested for this platform on
non-emulated hardware.

[3] Due to lack of available hardware, the JIT is built using
cross-compilation and tested using hardware emulation for this platform.
Some tests are skipped because emulation causes them to fail. However,
the JIT has been successfully built and tested for this platform on
non-emulated hardware.