PEP 744 – JIT Compilation

Author:: Brandt Bucher <brandt at python.org>, Savannah Ostrowski <savannahostrowski at gmail.com>
Discussions-To:: Discourse thread
Status:: Draft
Type:: Informational
Created:: 11-Apr-2024
Python-Version:: 3.13
Post-History:: 11-Apr-2024

Table of Contents

Abstract

Earlier this year, an experimental “just-in-time” compiler was merged into CPython’s main development branch. While recent CPython releases have included other substantial internal changes, this addition represents a particularly significant departure from the way CPython has traditionally executed Python code. As such, it deserves wider discussion.

This PEP aims to summarize the design decisions behind this addition, the current state of the implementation, and future plans for making the JIT a permanent, non-experimental part of CPython. It does not seek to provide a comprehensive overview of how the JIT works, instead focusing on the particular advantages and disadvantages of the chosen approach, as well as answering many questions that have been asked about the JIT since its introduction.

Readers interested in learning more about the new JIT are encouraged to consult the following resources:

The presentation which first introduced the JIT at the 2023 CPython Core Developer Sprint. It includes relevant background, a light technical introduction to the “copy-and-patch” technique used, and an open discussion of its design amongst the core developers present. Slides for this talk can be found on GitHub.
The open access paper originally describing copy-and-patch.
The blog post by the paper’s author detailing the implementation of a copy-and-patch JIT compiler for Lua. While this is a great low-level explanation of the approach, note that it also incorporates other techniques and makes implementation decisions that are not particularly relevant to CPython’s JIT.
The implementation itself.

Motivation

Until this point, CPython has always executed Python code by compiling it to bytecode, which is interpreted at runtime. This bytecode is a more-or-less direct translation of the source code: it is untyped, and largely unoptimized.

Since the Python 3.11 release, CPython has used a “specializing adaptive interpreter” (PEP 659), which rewrites these bytecode instructions in-place with type-specialized versions as they run. This new interpreter delivers significant performance improvements, despite the fact that its optimization potential is limited by the boundaries of individual bytecode instructions. It also collects a wealth of new profiling information: the types flowing though a program, the memory layout of particular objects, and what paths through the program are being executed the most. In other words, what to optimize, and how to optimize it.

Since the Python 3.12 release, CPython has generated this interpreter from a C-like domain-specific language (DSL). In addition to taming some of the complexity of the new adaptive interpreter, the DSL also allows CPython’s maintainers to avoid hand-writing tedious boilerplate code in many parts of the interpreter, compiler, and standard library that must be kept in sync with the instruction definitions. This ability to generate large amounts of runtime infrastructure from a single source of truth is not only convenient for maintenance; it also unlocks many possibilities for expanding CPython’s execution in new ways. For instance, it makes it feasible to automatically generate tables for translating a sequence of instructions into an equivalent sequence of smaller “micro-ops”, generate an optimizer for sequences of these micro-ops, and even generate an entire second interpreter for executing them.

In fact, since early in the Python 3.13 release cycle, all CPython builds have included this exact micro-op translation, optimization, and execution machinery. However, it is disabled by default; the overhead of interpreting even optimized traces of micro-ops is just too large for most code. Heavier optimization probably won’t improve the situation much either, since any efficiency gains made by new optimizations will likely be offset by the interpretive overhead of even smaller, more complex micro-ops.

The most obvious strategy to overcome this new bottleneck is to statically compile these optimized traces. This presents opportunities to avoid several sources of indirection and overhead introduced by interpretation. In particular, it allows the removal of dispatch overhead between micro-ops (by replacing a generic interpreter with a straight-line sequence of hot code), instruction decoding overhead for individual micro-ops (by “burning” the values or addresses of arguments, constants, and cached values directly into machine instructions), and memory traffic (by moving data off of heap-allocated Python frames and into physical hardware registers).

Since much of this data varies even between identical runs of a program and the existing optimization pipeline makes heavy use of runtime profiling information, it doesn’t make much sense to compile these traces ahead of time. As has been demonstrated for many other dynamic languages (and even Python itself), the most promising approach is to compile the optimized micro-ops “just in time” for execution.

Rationale

Despite their reputation, JIT compilers are not magic “go faster” machines. Developing and maintaining any sort of optimizing compiler for even a single platform, let alone all of CPython’s most popular supported platforms, is an incredibly complicated, expensive task. Using an existing compiler framework like LLVM can make this task simpler, but only at the cost of introducing heavy runtime dependencies and significantly higher JIT compilation overhead.

It’s clear that successfully compiling Python code at runtime requires not only high-quality Python-specific optimizations for the code being run, but also quick generation of efficient machine code for the optimized program. The Python core development team has the necessary skills and experience for the former (a middle-end tightly coupled to the interpreter), and copy-and-patch compilation provides an attractive solution for the latter.

In a nutshell, copy-and-patch allows a high-quality template JIT compiler to be generated from the same DSL used to generate the rest of the interpreter. For a widely-used, volunteer-driven project like CPython, this benefit cannot be overstated: CPython’s maintainers, by merely editing the bytecode definitions, will also get the JIT backend updated “for free”, for all JIT-supported platforms, at once. This is equally true whether instructions are being added, modified, or removed.

Like the rest of the interpreter, the JIT compiler is generated at build time, and has no runtime dependencies. It supports a wide range of platforms (see the Support section below), and has comparatively low maintenance burden. In all, the current implementation is made up of about 900 lines of build-time Python code and 500 lines of runtime C code.

Specification

The JIT will become non-experimental once all of the following conditions are met:

It provides a meaningful performance improvement for at least one popular platform (realistically, on the order of 5%).
It can be built, distributed, and deployed with minimal disruption.
The Steering Council, upon request, has determined that it would provide more value to the community if enabled than if disabled (considering tradeoffs such as maintenance burden, memory usage, or the feasibility of alternate designs).

These criteria should be considered a starting point, and may be expanded over time. For example, discussion of this PEP may reveal that additional requirements (such as multiple committed maintainers, a security audit, documentation in the devguide, support for out-of-process debugging, or a runtime option to disable the JIT) should be added to this list.

Until the JIT is non-experimental, it should not be used in production, and may be broken or removed at any time without warning.

Once the JIT is no longer experimental, it should be treated in much the same way as other build options such as --enable-optimizations or --with-lto. It may be a recommended (or even default) option for some platforms, and release managers may choose to enable it in official releases.

Support

The JIT has been developed for all of PEP 11’s current tier one platforms, most of its tier two platforms, and one of its tier three platforms. Specifically, CPython’s main branch has CI building and testing the JIT for both release and debug builds on:

aarch64-apple-darwin/clang
aarch64-pc-windows/msvc [1]
aarch64-unknown-linux-gnu/clang [2]
aarch64-unknown-linux-gnu/gcc [2]
i686-pc-windows-msvc/msvc
x86_64-apple-darwin/clang
x86_64-pc-windows-msvc/msvc
x86_64-unknown-linux-gnu/clang
x86_64-unknown-linux-gnu/gcc

It’s worth noting that some platforms, even future tier one platforms, may never gain JIT support. This can be for a variety of reasons, including insufficient LLVM support (powerpc64le-unknown-linux-gnu/gcc), inherent limitations of the platform (wasm32-unknown-wasi/clang), or lack of developer interest (x86_64-unknown-freebsd/clang).

Once JIT support for a platform is added (meaning, the JIT builds successfully without displaying warnings to the user), it should be treated in much the same way as PEP 11 prescribes: it should have reliable CI/buildbots, and JIT failures on tier one and tier two platforms should block releases. Though it’s not necessary to update PEP 11 to specify JIT support, it may be helpful to do so anyway. Otherwise, a list of supported platforms should be maintained in the JIT’s README.

Since it should always be possible to build CPython without the JIT, removing JIT support for a platform should not be considered a backwards-incompatible change. However, if it is reasonable to do so, the normal deprecation process should be followed as outlined in PEP 387.

The JIT’s build-time dependencies may be changed between releases, within reason.

Backwards Compatibility

Due to the fact that the current interpreter and the JIT backend are both generated from the same specification, the behavior of Python code should be completely unchanged. In practice, observable differences that have been found and fixed during testing have tended to be bugs in the existing micro-op translation and optimization stages, rather than bugs in the copy-and-patch step.

Debugging

Tools that profile and debug Python code will continue to work fine. This includes in-process tools that use Python-provided functionality (like sys.monitoring, sys.settrace, or sys.setprofile), as well as out-of-process tools that walk Python frames from the interpreter state.

However, it appears that profilers and debuggers for C code are currently unable to trace back through JIT frames. Working with leaf frames is possible (this is how the JIT itself is debugged), though it is of limited utility due to the absence of proper debugging information for JIT frames.

Since the code templates emitted by the JIT are compiled by Clang, it may be possible to allow JIT frames to be traced through by simply modifying the compiler flags to use frame pointers more carefully. It may also be possible to harvest and emit the debugging information produced by Clang. Neither of these ideas have been explored very deeply.

While this is an issue that should be fixed, fixing it is not a particularly high priority at this time. This is probably a problem best explored by somebody with more domain expertise in collaboration with those maintaining the JIT, who have little experience with the inner workings of these tools.

Security Implications

This JIT, like any JIT, produces large amounts of executable data at runtime. This introduces a potential new attack surface to CPython, since a malicious actor capable of influencing the contents of this data is therefore capable of executing arbitrary code. This is a well-known vulnerability of JIT compilers.

In order to mitigate this risk, the JIT has been written with best practices in mind. In particular, the data in question is not exposed by the JIT compiler to other parts of the program while it remains writable, and at no point is the data both writable and executable.

The nature of template-based JITs also seriously limits the kinds of code that can be generated, further reducing the likelihood of a successful exploit. As an additional precaution, the templates themselves are stored in static, read-only memory.

However, it would be naive to assume that no possible vulnerabilities exist in the JIT, especially at this early stage. The author is not a security expert, but is available to join or work closely with the Python Security Response Team to triage and fix security issues as they arise.

Apple Silicon

Though difficult to test without actually signing and packaging a macOS release, it appears that macOS releases should enable the JIT Entitlement for the Hardened Runtime.

This shouldn’t make installing Python any harder, but may add additional steps for release managers to perform.

How to Teach This

Choose the sections that best describe you:

If you are a Python programmer or end user…
- …nothing changes for you. Nobody should be distributing JIT-enabled CPython interpreters to you while it is still an experimental feature. Once it is non-experimental, you will probably notice slightly better performance and slightly higher memory usage. You shouldn’t be able to observe any other changes.
If you maintain third-party packages…
- …nothing changes for you. There are no API or ABI changes, and the JIT is not exposed to third-party code. You shouldn’t need to change your CI matrix, and you shouldn’t be able to observe differences in the way your packages work when the JIT is enabled.
If you profile or debug Python code…
- …nothing changes for you. All Python profiling and tracing functionality remains.
If you profile or debug C code…
- …currently, the ability to trace through JIT frames is limited. This may cause issues if you need to observe the entire C call stack, rather than just “leaf” frames. See the Debugging section above for more information.
If you compile your own Python interpreter….
- …if you don’t wish to build the JIT, you can simply ignore it. Otherwise, you will need to install a compatible version of LLVM, and pass the appropriate flag to the build scripts. Your build may take up to a minute longer. Note that the JIT should not be distributed to end users or used in production while it is still in the experimental phase.
If you’re a maintainer of CPython (or a fork of CPython)…
- …and you change the bytecode definitions or the main interpreter loop…
  - …in general, the JIT shouldn’t be much of an inconvenience to you (depending on what you’re trying to do). The micro-op interpreter isn’t going anywhere, and still offers a debugging experience similer to what the main bytecode interpreter provides today. There is moderate likelihood that larger changes to the interpreter (such as adding new local variables, changing error handling and deoptimization logic, or changing the micro-op format) will require changes to the C template used to generate the JIT, which is meant to mimic the main interpreter loop. You may also occasionally just get unlucky and break JIT code generation, which will require you to either modify the Python build scripts yourself, or solicit the help of somebody more familiar with them (see below).
- …and you work on the JIT itself…
  - …you hopefully already have a decent idea of what you’re getting yourself into. You will be regularly modifying the Python build scripts, the C template used to generate the JIT, and the C code that actually makes up the runtime portion of the JIT. You will also be dealing with all sorts of crashes, stepping over machine code in a debugger, staring at COFF/ELF/Mach-O dumps, developing on a wide range of platforms, and generally being the point of contact for the people changing the bytecode when CI starts failing on their PRs (see above). Ideally, you’re at least familiar with assembly, have taken a couple of courses with “compilers” in their name, and have read a blog post or two about linkers.
- …and you maintain other parts of CPython…
  - …nothing changes for you. You shouldn’t need to develop locally with JIT builds. If you choose to do so (for example, to help reproduce and triage JIT issues), your builds may take up to a minute longer each time the relevant files are modified.

Reference Implementation

Key parts of the implementation include:

Tools/jit/README.md: Instructions for how to build the JIT.
Python/jit.c: The entire runtime portion of the JIT compiler.
jit_stencils.h: An example of the JIT’s generated templates.
Tools/jit/template.c: The code which is compiled to produce the JIT’s templates.
Tools/jit/_targets.py: The code to compile and parse the templates at build time.

Rejected Ideas

Maintain it outside of CPython

While it is probably possible to maintain the JIT outside of CPython, its implementation is tied tightly enough to the rest of the interpreter that keeping it up-to-date would probably be more difficult than actually developing the JIT itself. Additionally, contributors working on the existing micro-op definitions and optimizations would need to modify and build two separate projects to measure the effects of their changes under the JIT (whereas today, infrastructure exists to do this automatically for any proposed change).

Releases of the separate “JIT” project would probably also need to correspond to specific CPython pre-releases and patch releases, depending on exactly what changes are present. Individual CPython commits between releases likely wouldn’t have corresponding JIT releases at all, further complicating debugging efforts (such as bisection to find breaking changes upstream).

Since the JIT is already quite stable, and the ultimate goal is for it to be a non-experimental part of CPython, keeping it in main seems to be the best path forward. With that said, the relevant code is organized in such a way that the JIT can be easily “deleted” if it does not end up meeting its goals.

Turn it on by default

On the other hand, some have suggested that the JIT should be enabled by default in its current form.

Again, it is important to remember that a JIT is not a magic “go faster” machine; currently, the JIT is about as fast as the existing specializing interpreter. This may sound underwhelming, but it is actually a fairly significant achievement, and it’s the main reason why this approach was considered viable enough to be merged into main for further development.

While the JIT provides significant gains over the existing micro-op interpreter, it isn’t yet a clear win when always enabled (especially considering its increased memory consumption and additional build-time dependencies). That’s the purpose of this PEP: to clarify expectations about the objective criteria that should be met in order to “flip the switch”.

At least for now, having this in main, but off by default, seems to be a good compromise between always turning it on and not having it available at all.

Support multiple compiler toolchains

Clang is specifically needed because it’s the only C compiler with support for guaranteed tail calls (musttail), which are required by CPython’s continuation-passing-style approach to JIT compilation. Without it, the tail-recursive calls between templates could result in unbounded C stack growth (and eventual overflow).

Since LLVM also includes other functionalities required by the JIT build process (namely, utilities for object file parsing and disassembly), and additional toolchains introduce additional testing and maintenance burden, it’s convenient to only support one major version of one toolchain at this time.

Compile the base interpreter’s bytecode

Most of the prior art for copy-and-patch uses it as a fast baseline JIT, whereas CPython’s JIT is using the technique to compile optimized micro-op traces.

In practice, the new JIT currently sits somewhere between the “baseline” and “optimizing” compiler tiers of other dynamic language runtimes. This is because CPython uses its specializing adaptive interpreter to collect runtime profiling information, which is used to detect and optimize “hot” paths through the code. This step is carried out using self-modifying code, a technique which is much more difficult to implement with a JIT compiler.

While it’s possible to compile normal bytecode using copy-and-patch (in fact, early prototypes predated the micro-op interpreter and did exactly this), it just doesn’t seem to provide enough optimization potential as the more granular micro-op format.

Add GPU support

The JIT is currently CPU-only. It does not, for example, offload NumPy array computations to CUDA GPUs, as JITs like Numba do.

There is already a rich ecosystem of tools for accelerating these sorts of specialized tasks, and CPython’s JIT is not intended to replace them. Instead, it is meant to improve the performance of general-purpose Python code, which is less likely to benefit from deeper GPU integration.

Open Issues

Speed

Currently, the JIT is about as fast as the existing specializing interpreter on most platforms. Improving this is obviously a top priority at this point, since providing a significant performance gain is the entire motivation for having a JIT at all. A number of proposed improvements are already underway, and this ongoing work is being tracked in GH-115802.

Memory

Because it allocates additional memory for executable machine code, the JIT does use more memory than the existing interpreter at runtime. According to the official benchmarks, the JIT currently uses about 10-20% more memory than the base interpreter. The upper end of this range is due to aarch64-apple-darwin, which has larger page sizes (and thus, a larger minimum allocation granularity).

However, these numbers should be taken with a grain of salt, as the benchmarks themselves don’t actually have a very high baseline of memory usage. Since they have a higher ratio of code to data, the JIT’s memory overhead is more pronounced than it would be in a typical workload where memory pressure is more likely to be a real concern.

Not much effort has been put into optimizing the JIT’s memory usage yet, so these numbers likely represent a maximum that will be reduced over time. Improving this is a medium priority, and is being tracked in GH-116017.

Earlier versions of the JIT had a more complicated memory allocation scheme which imposed a number of fragile limitations on the size and layout of the emitted code, and significantly bloated the memory footprint of Python executable. These issues are no longer present in the current design.

Dependencies

At the time of writing, the JIT has a build-time dependency on LLVM. LLVM is used to compile individual micro-op instructions into blobs of machine code, which are then linked together to form the JIT’s templates. These templates are used to build CPython itself. The JIT has no runtime dependency on LLVM and is therefore not at all exposed as a dependency to end users.

Building the JIT adds between 3 and 60 seconds to the build process, depending on platform. It is only rebuilt whenever the generated files become out-of-date, so only those who are actively developing the main interpreter loop will be rebuilding it with any frequency.

Unlike many other generated files in CPython, the JIT’s generated files are not tracked by Git. This is because they contain compiled binary code templates specific to not only the host platform, but also the current build configuration for that platform. As such, hosting them would require a significant engineering effort in order to build and host dozens of large binary files for each commit that changes the generated code. While perhaps feasible, this is not a priority, since installing the required tools is not prohibitively difficult for most people building CPython, and the build step is not particularly time-consuming.

Since some still remain interested in this possibility, discussion is being tracked in GH-115869.

Footnotes

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.

Source: https://github.com/python/peps/blob/main/peps/pep-0744.rst

Last modified: 2024-07-04 05:08:25 GMT