PEP 703 – Making the Global Interpreter Lock Optional in CPython
- Author:
- Sam Gross <colesbury at gmail.com>
- Sponsor:
- Łukasz Langa <lukasz at python.org>
- Discussions-To:
- Discourse thread
- Status:
- Accepted
- Type:
- Standards Track
- Created:
- 09-Jan-2023
- Python-Version:
- 3.13
- Post-History:
- 09-Jan-2023, 04-May-2023
- Resolution:
- 24-Oct-2023
Table of Contents
- Abstract
- Motivation
- Specification
- Rationale
- Backwards Compatibility
- Distribution
- Performance
- Build Bots
- How to Teach This
- Reference Implementation
- Alternatives
- Related Work
- Rejected Ideas
- Open Issues
- References
- Acknowledgments
- Copyright
Note
The Steering Council accepts PEP 703, but with clear proviso: that the rollout be gradual and break as little as possible, and that we can roll back any changes that turn out to be too disruptive – which includes potentially rolling back all of PEP 703 entirely if necessary (however unlikely or undesirable we expect that to be).
Abstract
CPython’s global interpreter lock (“GIL”) prevents multiple threads
from executing Python code at the same time. The GIL is an obstacle
to using multi-core CPUs from Python efficiently. This PEP proposes
adding a build configuration (--disable-gil
) to CPython to let it
run Python code without the global interpreter lock and with the
necessary changes needed to make the interpreter thread-safe.
Motivation
The GIL is a major obstacle to concurrency. For scientific computing tasks, this lack of concurrency is often a bigger issue than speed of executing Python code, since most of the processor cycles are spent in optimized CPU or GPU kernels. The GIL introduces a global bottleneck that can prevent other threads from making progress if they call any Python code. There are existing ways to enable parallelism in CPython today, but those techniques come with significant limitations (see Alternatives).
This section focuses on the GIL’s impact on scientific computing, particular AI/ML workloads because that is the area with which this author has the most experience, but the GIL also affects other users of Python.
The GIL Makes Many Types of Parallelism Difficult to Express
Neural network-based AI models expose multiple opportunities for parallelism. For example, individual operations may be parallelized internally (“intra-operator”), multiple operations may be executed simultaneously (“inter-operator”), and requests (spanning multiple operations) may also be parallelized. Efficient execution requires exploiting multiple types of parallelism [1].
The GIL makes it difficult to express inter-operator parallelism, as well as some forms of request parallelism, efficiently in Python. In other programming languages, a system might use threads to run different parts of a neural network on separate CPU cores, but this is inefficient in Python due to the GIL. Similarly, latency-sensitive inference workloads frequently use threads to parallelize across requests, but face the same scaling bottlenecks in Python.
The challenges the GIL poses to exploiting parallelism in Python frequently come up in reinforcement learning. Heinrich Kuttler, author of the NetHack Learning Environment and Member of Technical Staff at Inflection AI, writes:
Recent breakthroughs in reinforcement learning, such as on Dota 2, StarCraft, and NetHack rely on running multiple environments (simulated games) in parallel using asynchronous actor-critic methods. Straightforward multithreaded implementations in Python don’t scale beyond more than a few parallel environments due to GIL contention. Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.
Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:
We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers.
Projects that involve interfacing with multiple hardware devices face similar challenges: efficient communication requires use of multiple CPU cores. The Dose-3D project aims to improve cancer radiotherapy with precise dose planning. It uses medical phantoms (stand-ins for human tissue) together with custom hardware and a server application written in Python. Paweł Jurgielewicz, lead software architect for the data acquisition system on the Dose-3D project, describes the scaling challenges posed by the GIL and how using a fork of Python without the GIL simplified the project:
In the Dose-3D project, the key challenge was to maintain a stable, non-trivial concurrent communication link with hardware units while utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we started with the multiprocessing package, but at some point, it became clear that most CPU time was consumed by the data transfers between the data processing stages, not by data processing itself. The CPython multithreading implementation based on GIL was a dead end too. When we found out about the “nogil” fork of Python it took a single person less than half a working day to adjust the codebase to use this fork and the results were astonishing. Now we can focus on data acquisition system development rather than fine-tuning data exchange algorithms.
Allen Goodman, author of CellProfiler and staff engineer at Prescient Design and Genentech, describes how the GIL makes biological methods research more difficult in Python:
Issues with Python’s global interpreter lock are a frequent source of frustration throughout biological methods research.I wanted to better understand the current multithreading situation so I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs. HMMER is a relatively elementary bioinformatics method and newer methods have far bigger multi-threading demands.
Method researchers are begging to use Python (myself included), because of its ease of use, the Python ecosystem, and because “it’s what people know.” Many biologists only know a little bit of programming (and that’s almost always Python). Until Python’s multithreading situation is addressed, C and C++ will remain the lingua franca of the biological methods research community.
The GIL Affects Python Library Usability
The GIL is a CPython implementation detail that limits multithreaded parallelism, so it might seem unintuitive to think of it as a usability issue. However, library authors frequently care a great deal about performance and will design APIs that support working around the GIL. These workaround frequently lead to APIs that are more difficult to use. Consequently, users of these APIs may experience the GIL as a usability issue and not just a performance issue.
For example, PyTorch exposes a multiprocessing-based API called
DataLoader
for building data input pipelines. It uses fork()
on Linux because it is generally faster and uses less memory
than spawn()
, but this leads to additional challenges for users:
creating a DataLoader
after accessing a GPU can lead to confusing
CUDA errors. Accessing GPUs within a DataLoader
worker quickly
leads to out-of-memory errors because processes do not share CUDA
contexts (unlike threads within a process).
Olivier Grisel, scikit-learn developer and software engineer at Inria, describes how having to work around the GIL in scikit-learn related libraries leads to a more complex and confusing user experience:
Over the years, scikit-learn developers have maintained ancillary libraries such asjoblib
andloky
to try to work around some of the limitations of multiprocessing: extra memory usage partially mitigated via semi-automated memory mapping of large data buffers, slow worker startup by transparently reusing a pool of long running workers, fork-safety problems of third-party native runtime libraries such as GNU OpenMP by never using the fork-only start-method, ability to perform parallel calls of interactively defined functions in notebooks and REPLs in cross-platform manner via cloudpickle. Despite our efforts, this multiprocessing-based solution is still brittle, complex to maintain and confusing to datascientists with limited understanding of system-level constraints. Furthermore, there are still irreducible limitations such as the overhead caused by the pickle-based serialization/deserialization steps required for inter-process communication. A lot of this extra work and complexity would not be needed anymore if we could use threads without contention on multicore hosts (sometimes with 64 physical cores or more) to run data science pipelines that alternate between Python-level operations and calls to native libraries.
Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy maintainer, describes how the GIL affects the user experience of NumPy and numeric Python libraries:
A key problem in NumPy and the stack of packages built around it is that NumPy is still (mostly) single-threaded — and that has shaped significant parts of the user experience and projects built around it. NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g.,dask.array
which wrapsnumpy.ndarray
. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package,threadpoolctl
. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.Coordinating on APIs and design decisions to control parallelism is still a major amount of work, and one of the harder challenges across the PyData ecosystem. It would have looked a lot different (better, easier) without a GIL.
GPU-Heavy Workloads Require Multi-Core Processing
Many high-performance computing (HPC) and AI workloads make heavy use of GPUs. These applications frequently require efficient multi-core CPU execution even though the bulk of the computation runs on a GPU.
Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI), describes how the GIL makes multithreaded scaling inefficient even when the bulk of computation is performed outside of Python:
In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.
The use of many processes (instead of threads) makes common tasks more difficult. Zachary DeVito continues:
On three separate occasions in the past couple of months (reducing redundant compute in data loaders, writing model checkpoints asynchronously, and parallelizing compiler optimizations), I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.
Even GPU-heavy workloads frequently have a CPU-intensive component. For example, computer vision tasks typically require multiple “pre-processing” steps in the data input pipeline, like image decoding, cropping, and resizing. These tasks are commonly performed on the CPU and may use Python libraries like Pillow or Pillow-SIMD. It is necessary to run the data input pipeline on multiple CPU cores in order to keep the GPU “fed” with data.
The increase in GPU performance compared to individual CPU cores makes multi-core performance more important. It is progressively more difficult to keep the GPUs fully occupied. To do so requires efficient use of multiple CPU cores, especially on multi-GPU systems. For example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the GPUs “fed” with data.
The GIL Makes Deploying Python AI Models Difficult
Python is widely used to develop neural network-based AI models. In PyTorch, models are frequently deployed as part of multi-threaded, mostly C++, environments. Python is often viewed skeptically because the GIL can be a global bottleneck, preventing efficient scaling even though the vast majority of the computations occur “outside” of Python with the GIL released. The torchdeploy paper [2] shows experimental evidence for these scaling bottlenecks in multiple model architectures.
PyTorch provides a number of mechanisms for deploying Python AI
models that avoid or work around the GIL, but they all come with
substantial limitations. For example, TorchScript captures a
representation of the model that can be executed from C++ without any
Python dependencies, but it only supports a limited subset of Python
and often requires rewriting some of the model’s code. The
torch::deploy API
allows multiple Python interpreters, each with its own GIL, in the
same process(similar to PEP 684). However, torch::deploy
has
limited support for Python modules that use C-API extensions.
Motivation Summary
Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.
The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.
Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.
Specification
Build Configuration Changes
The global interpreter lock will remain the default for CPython builds
and python.org downloads. A new build configuration flag,
--disable-gil
will be added to the configure script that will build
CPython with support for running without the global interpreter lock.
When built with --disable-gil
, CPython will define the Py_GIL_DISABLED
macro in Python/patchlevel.h. The ABI tag will include the letter “t”
(for “threading”).
The --disable-gil
builds of CPython will still support optionally
running with the GIL enabled at runtime (see PYTHONGIL Environment
Variable and Py_mod_gil Slot).
Overview of CPython Changes
Removing the global interpreter lock requires substantial changes to CPython internals, but relatively few changes to the public Python and C APIs. This section describes the required changes to the CPython implementation followed by the proposed API changes.
The implementation changes can be grouped into the following four categories:
- Reference counting
- Memory management
- Container thread-safety
- Locking and atomic APIs
Reference Counting
Removing the GIL requires changes to CPython’s reference counting implementation to make it thread-safe. Furthermore, it needs to have low execution overhead and allow for efficient scaling with multiple threads. This PEP proposes a combination of three techniques to address these constraints. The first is a switch from plain non-atomic reference counting to biased reference counting, which is a thread-safe reference counting technique with lower execution overhead than plain atomic reference counting. The other two techniques are immortalization and a limited form of deferred reference counting; they address some of the multi-threaded scalability issues with reference counting by avoiding some reference count modifications.
Biased reference counting (BRC) is a technique first described in 2018 by Jiho Choi, Thomas Shull, and Josep Torrellas [3]. It is based on the observation that most objects are only accessed by a single thread, even in multi-threaded programs. Each object is associated with an owning thread (the thread that created it). Reference counting operations from the owning thread use non-atomic instructions to modify a “local” reference count. Other threads use atomic instructions to modify a “shared” reference count. This design avoids many atomic read-modify-write operations that are expensive on contemporary processors.
The implementation of BRC proposed in this PEP largely matches the original description of biased reference counting, but differs in details like the size of reference counting fields and special bits in those fields. BRC requires storing three pieces of information in each object’s header: the “local” reference count, the “shared” reference count, and the identifier of the owning thread. The BRC paper packs these three things into a single 64-bit field. This PEP proposes using three separate fields in each object’s header to avoid potential issues due to reference count overflow. Additionally, the PEP supports a faster deallocation path that avoids an atomic operation in the common case.
The proposed PyObject
struct (also called struct _object
) is
below:
struct _object {
_PyObject_HEAD_EXTRA
uintptr_t ob_tid; // owning thread id (4-8 bytes)
uint16_t __padding; // reserved for future use (2 bytes)
PyMutex ob_mutex; // per-object mutex (1 byte)
uint8_t ob_gc_bits; // GC fields (1 byte)
uint32_t ob_ref_local; // local reference count (4 bytes)
Py_ssize_t ob_ref_shared; // shared reference count and state bits (4-8 bytes)
PyTypeObject *ob_type;
};
The ob_tid
, ob_ref_local
, and ob_ref_shared
are used by
the biased reference counting implementation. The ob_gc_bits
field
is used store garbage collection flags that were previously stored in
PyGC_Head
(see Garbage Collection (Cycle Collection)). The
ob_mutex
field provides a per-object lock in a single byte.
Immortalization
Some objects, such as interned strings, small integers, statically
allocated PyTypeObjects, and the True
, False
, and None
objects stay alive for the lifetime of the program. These objects are
marked as immortal by setting the local reference count field
(ob_ref_local
) to UINT32_MAX
.
The Py_INCREF
and Py_DECREF
macros are no-ops for immortal
objects. This avoids contention on the reference count fields of
these objects when multiple threads access them concurrently.
This proposed immortalization scheme is very similar to PEP 683, adopted in Python 3.12, but with slightly different bit representation in the reference count fields for immortal objects in order to work with biased reference counting and deferred reference counting. See also Why Not Use PEP 683 Immortalization?.
Biased Reference Counting
Biased reference counting has a fast-path for objects “owned” by the
current thread and a slow-path for other objects. Ownership is
indicated by the ob_tid
field. Determining the thread id requires
platform specific code [5]. A value of 0
in ob_tid
indicates that the object is not owned by any thread.
The ob_ref_local
field stores the local reference count and two
flags. The two most significant bits are used to indicate the object
is immortal or uses deferred reference counting (see Deferred
reference counting).
The ob_ref_shared
field stores the shared reference count. The
two least significant bits are used to store the reference
counting state. The shared reference count is therefore shifted left by
two. The ob_ref_shared
field uses the least significant bits
because the shared reference count can be temporarily negative; increfs
and decrefs may not be balanced between threads.
The possible reference counting states are listed below:
0b00
- default0b01
- weakrefs0b10
- queued0b11
- merged
The states form a progression: during their lifecycle, objects may
transition to any numerically higher state. Objects can only be
deallocated from the “default” and “merged” states. Other states must
transition to the “merged” state before deallocation. Transitioning
states requires an atomic compare-and-swap on the ob_ref_shared
field.
Default (0b00
)
Objects are initially created in the default state. This is the only state that allows for the quick deallocation code path. Otherwise, the thread must merge the local and shared reference count fields, which requires an atomic compare-and-swap.
This quick deallocation code path would not be thread-safe with concurrent dereferencing of weakrefs, so the first time a weak reference is created, the object is transitioned to the “weakrefs” state if it is currently in the “default” state.
Similarly, the quick deallocation code path would not be thread-safe with the lockless list and dictionary accesses (see Optimistically Avoiding Locking), so the first time a non-owning thread thread attempts to retrieve an object in the “default” state it falls back to the slower locking code path and transitions the object to the “weakrefs” state.
Weakrefs (0b01
)
Objects in weakref and higher states support dereferencing weakrefs as well as the lockless list and dictionary access by non-owning threads. They require transitioning to the merged state before deallocation, which is more expensive than the quick deallocation code path supported by the “default” state.
Queued (0b10
)
The queued state indicates that the a non-owning thread has requested
that the reference count fields be merged. This can happen when the
shared reference count becomes negative (due to an imbalance between
increfs and decrefs between threads). The object is inserted into the
owning thread’s queue of objects to be merged. The owning thread is
notified via the eval_breaker
mechanism. In practice, this
operation is rare. Most objects are only accessed by a single thread
and those objects accessed by multiple threads rarely have negative
shared reference counts.
If the owning thread has terminated, the acting thread immediately merges the local and shared reference count fields and transitions to the merged state.
Merged (0b11
)
The merged state indicates that the object is not owned by any thread.
The ob_tid
field is zero in this state and ob_ref_local
is not
used. Once the shared reference count reaches zero, the object can
be deallocated from the merged state.
Reference counting pseudo-code
The proposed Py_INCREF
and Py_DECREF
operation should behave
as follows (using C-like pseudo-code):
// low two bits of "ob_ref_shared" are used for flags
#define _Py_SHARED_SHIFT 2
void Py_INCREF(PyObject *op)
{
uint32_t new_local = op->ob_ref_local + 1;
if (new_local == 0)
return; // object is immortal
if (op->ob_tid == _Py_ThreadId())
op->ob_ref_local = new_local;
else
atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
}
void Py_DECREF(PyObject *op)
{
if (op->ob_ref_local == _Py_IMMORTAL_REFCNT) {
return; // object is immortal
}
if (op->ob_tid == _Py_ThreadId()) {
op->ob_ref_local -= 1;
if (op->ob_ref_local == 0) {
_Py_MergeZeroRefcount(); // merge refcount
}
}
else {
_Py_DecRefShared(); // slow path
}
}
void _Py_MergeZeroRefcount(PyObject *op)
{
if (op->ob_ref_shared == 0) {
// quick deallocation code path (common case)
op->ob_tid = 0;
_Py_Dealloc(op);
}
else {
// slower merging path not shown
}
}
The reference implementation [17] contains implementations of
_Py_MergeZeroRefcount
and _Py_DecRefShared
.
Note that the above is pseudocode: in practice, the implementation
should use “relaxed atomics” to access ob_tid
and
ob_ref_local
to avoid undefined behavior in C and C++.
Deferred Reference Counting
A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently. These objects don’t necessarily live for the lifetime of the program, so immortalization is not a good fit. This PEP proposes a limited form of deferred reference counting to avoid contention on these objects’ reference count fields in multi-threaded programs.
Typically, the interpreter modifies objects’ reference counts as they are pushed to and popped from the interpreter’s stack. The interpreter skips these reference counting operations for objects that use deferred reference counting. Objects that support deferred reference counting are marked by setting the two most significant bits in the local reference count field to one.
Because some reference counting operations are skipped, the reference count fields no longer reflect the true number of references to these objects. The true reference count is the sum of the reference count fields plus any skipped references from each thread’s interpreter stack. The true reference count can only be safely computed when all threads are paused during cyclic garbage collection. Consequently, objects that use deferred reference counting can only be deallocated during garbage collection cycles.
Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.
Garbage Collector Modifications for Deferred Reference Counting
The tracing garbage collector finds and deallocates unreferenced
objects. Currently, the tracing garbage collector only finds
unreferenced objects that are part of a reference cycle. With
deferred reference counting, the tracing garbage collector will also
find and collect some unreferenced objects that may not be part of
any reference cycle, but whose collection has been delayed due to
deferred reference counting. This requires that all objects that
support deferred reference counting also have a corresponding type
object that supports tracing garbage collection (through the
Py_TPFLAGS_HAVE_GC
flag). Additionally, the garbage collector
will need to traverse each thread’s stack to add references to the GC
reference count at the start of each collection.
Reference Counting Type Objects
Type objects (PyTypeObject
) use a mix of reference counting
techniques. Statically allocated type objects are immortalized because
the objects already live for the lifetime of the program. Heap type
objects use deferred reference counting in combination with per-thread
reference counting. Deferred reference counting is not sufficient to
address the multi-threaded scaling bottlenecks with heap types because
most references to heap types are from object instances, not references
on the interpreter stack.
To address this, heap type reference counts are partially stored in a
distributed manner in per-thread arrays. Every thread stores an
array of local reference counts for each heap type object. Heap type
objects are assigned a unique number that determines its position in
the local reference count arrays. A heap type’s true reference count
is the sum of its entries in the per-thread arrays, plus the reference
count on the PyTypeObject
, plus any deferred references in the
interpreter stack.
Threads may grow their own type reference count arrays as needed when incrementing or decrementing the local reference count of a type object.
Use of the per-thread reference count arrays is limited to a few places:
PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)
: Increments the current thread’s local reference count fortype
, if it is a heap type.subtype_dealloc(PyObject *self)
: Decrements the current thread’s local reference count forself->ob_type
, if the type is a heap type.gcmodule.c
: Adds each thread’s local reference counts to thegc_refs
count for the corresponding heap type object.
Additionally, when a thread terminates, it adds any non-zero local reference counts to each type object’s own reference count field.
Memory Management
CPython currently uses an internal allocator, pymalloc, which is optimized for small object allocation. The pymalloc implementation is not thread-safe without the GIL. This PEP proposes replacing pymalloc with mimalloc, a general-purpose thread-safe allocator with good performance, including for small allocations.
Using mimalloc, with some modifications, also addresses two other issues related to removing the GIL. First, traversing the internal mimalloc structures allows the garbage collector to find all Python objects without maintaining a linked list. This is described in more detail in the garbage collection section. Second, mimalloc heaps and allocations based on size class enable collections like dict to generally avoid acquiring locks during read-only operations. This is described in more detail in the collection thread-safety section.
CPython already requires that objects that support garbage collection
use the GC allocator APIs (typically indirectly by calling
PyType_GenericAlloc
). This PEP would add additional requirements
to the use of the Python allocator APIs. First, Python objects must
be allocated through object allocation APIs, such as
PyType_GenericAlloc
, PyObject_Malloc
, or other Python APIs
that wrap those calls. Python objects should not be allocated through
other APIs, such as raw calls to C’s malloc or the C++ new operator.
Additionally, PyObject_Malloc
should be used only for allocating
Python objects; it should not be used for allocating buffers,
storages, or other data structures that are not PyObjects.
This PEP also imposes restrictions on the pluggable allocator API
(PyMem_SetAllocator
). When compiling without the GIL, allocators
set using this API must eventually delegate the allocation to the
corresponding underlying allocator, such as PyObject_Malloc
, for
Python object allocations. This allows for allocators that “wrap”
underlying allocators, such as Python’s tracemalloc and debug
allocator, but not for wholly replacing the allocator.
CPython Free Lists
CPython makes use of free lists to speed up the allocation of small,
frequently allocated objects like tuples and numbers. These free
lists are moved to PyThreadState
from per-interpreter state.
Garbage Collection (Cycle Collection)
The CPython garbage collector requires the following changes to work with this proposal:
- Use of “stop-the-world” to provide thread-safety guarantees that were previously provided by the GIL.
- Elimination of generational garbage collection in favor of non-generational collector.
- Integration with deferred reference counting and biased reference counting.
Additionally, the above changes enable removing the
_gc_prev
and _gc_next
fields from GC objects. The GC bits
that stored the tracked, finalized, and unreachable states are moved
to the ob_gc_bits
field in the PyObject header.
Stop-the-World
The CPython cycle garbage collector currently relies on the global
interpreter lock to prevent other threads from accessing Python
objects while the collector finds cycles. The GIL is never released
during the cycle-finding routine, so the collector can rely on
stable (i.e., unchanging) reference counts and references for the
duration of that routine. However, following cycle detection, the GIL
may be temporarily released while calling objects’ finalizers and
clear (tp_clear
) functions, allowing other threads to run in an
interleaved fashion.
When running without the GIL, the implementation needs a way to ensure that reference counts remain stable during cycle detection. Threads running Python code must be paused to ensure that references and reference counts remain stable. Once the cycles are identified, other threads are resumed.
The current CPython cyclic garbage collector involves two
cycle-detection passes during each garbage collection cycle.
Consequently, this requires two stop-the-world pauses when running the
garbage collector without the GIL. The first cycle-detection pass
identifies cyclic trash. The second pass runs after finalizers to
identify which objects still remain unreachable. Note that other
threads are resumed before finalizers and tp_clear
functions are
called to avoid introducing potential deadlocks that are not present in
the current CPython behavior.
Thread States
To support pausing threads for garbage collection, the PyThreadState gets a new “status” field. Like the other fields in PyThreadState, the status field is not part of the public CPython API. The status field may be in one of three states:
ATTACHED
DETACHED
GC
The ATTACHED
and DETACHED
states correspond closely to
acquiring and releasing the global interpreter lock. When compiling
without the GIL, functions that previously acquired the GIL instead
transition the thread state to ATTACHED
, and functions that
previously released the GIL transition the thread state
to DETACHED
. Just as threads previously needed to acquire the
GIL before accessing or modifying Python objects, they now must be in
the ATTACHED
state before accessing or modifying Python
objects. Since the same public C-API functions “attach” the thread as
previously acquired the GIL (e.g., PyEval_RestoreThread
), the
requirements for thread initialization in extensions remain the same.
The substantial difference is that multiple threads can be in the
attached state simultaneously, while previously only one thread could
acquire the GIL at a time.
During stop-the-world pauses, the thread performing garbage collection
needs to ensure that no other thread is accessing or modifying Python
objects. All other threads must be in the “GC” state. The garbage
collection thread can transition other threads from the DETACHED
state to the GC state using an atomic compare-and-swap operation on
the status field. Threads in the ATTACHED
state are requested to
pause themselves and set their status to “GC”, using the
existing “eval breaker” mechanism. At the end of the stop-the-world
pause, all threads in the “GC” state are set to DETACHED
and
woken up if they are paused. Threads that were previously attached
(i.e., executing Python bytecode) can re-attach (set their thread
states to ATTACHED
) and resume executing Python code. Threads
that were previously DETACHED
ignore the notification.
Generations
The existing Python garbage collector uses three generations. When compiling without the GIL, the garbage collector will only use a single generation (i.e., it will be non-generational). The primary reason for this change is to reduce the impact of the stop-the-world pauses in multithreaded applications. Frequent stop-the-world pauses for collecting the young generation would have more of an impact on multi-threaded applications than less frequent collections.
Integration With Deferred and Biased Reference Counting
To find unreferenced objects, the cyclic garbage collector computes
the difference between the number of incoming references and the
object’s reference count. This difference is called gc_refs
and
is stored in the _gc_prev
field. If gc_refs
is greater than
zero, then the object is guaranteed to be alive (i.e., not cyclic
trash). If gc_refs
is zero, then the object is only alive if it
is transitively referenced by another live object. When computing
this difference, the collector should traverse each thread’s stack,
and for every deferred reference, increment the gc_refs
for the
referred object. Since generator objects also have stacks with
deferred references, the same procedure is applied to each
generator’s stack.
Python unit tests commonly use gc.collect()
to ensure that any
unreferenced objects are destructed and their finalizers run. Since
biased reference counting can delay the destruction of some objects
that are referenced by multiple threads, it’s convenient to ensure
that those objects are destructed during garbage collection, even
though they may not be part of any reference cycles. While other
threads are paused, the garbage collector thread should merge the
reference counts for any queued objects, but not call any destructors
even if the combined reference count is zero. (Calling destructors
while other threads are paused risks introducing deadlocks.) Once
other threads are resumed, the GC thread should call _Py_Dealloc
on those objects with a zero merged reference count.
Container Thread-Safety
In CPython, the global interpreter lock protects against corruption of
internal interpreter states when multiple threads concurrently access
or modify Python objects. For example, if multiple threads
concurrently modify the same list, the GIL ensures that the length of
the list (ob_size
) accurately matches the number of elements, and
that the reference counts of each element accurately reflect the
number of references to those elements. Without the GIL — and
absent other changes — concurrent modifications would corrupt those
fields and likely lead to program crashes.
The GIL does not necessarily ensure that operations are atomic or
remain correct when multiple operations occur concurrently. For
example, list.extend(iterable)
may not appear atomic if the
iterable has an iterator implemented in Python (or releases the GIL
internally). Similarly, list.remove(x)
can remove the wrong
object if it overlaps with another operation that modifies the list,
depending on the implementation of the equality operator. Still, the
GIL ensures that some operations are effectively atomic. For example,
the constructor list(set)
atomically copies the items of the set
to a new list, and some code relies on that copy being atomic
(i.e., having a snapshot of the items in the set). This PEP preserves
that property.
This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.
Per-object locks with critical sections provide weaker protections than the GIL. Because the GIL doesn’t necessarily ensure that concurrent operations are atomic or correct, the per-object locking scheme also cannot ensure that concurrent operations are atomic or correct. Instead, per-object locking aims for similar protections as the GIL, but with mutual exclusion limited to individual objects.
Most operations on an instance of a container type require locking that object. For example:
list.append
,list.insert
,list.repeat
,PyList_SetItem
dict.__setitem__
,PyDict_SetItem
list.clear
,dict.clear
list.__repr__
,dict.__repr__
, etc.list.extend(iterable)
setiter_iternext
Some operations operate directly on two container objects, with
knowledge about both containers’ internal structure. For example,
there are internal specializations of list.extend(iterable)
for
specific iterable types, like set
. These operations need to lock
both container objects because they access the internals of both
objects simultaneously. Note that the generic implementation of
list.extend
only needs to lock one object (the list) because the
other object is accessed indirectly through the thread-safe iterator
API. Operations that lock two containers are:
list.extend(list)
,list.extend(set)
,list.extend (dictitems)
, and other specializations where the implementation is specialized for argument type.list.concat(list)
list.__eq__(list)
,dict.__eq__(dict)
Some simple operations can be implemented directly with atomic accesses and do not need locks because they only access a single field. These operations include:
len(list)
i.e.,list_length(PyListObject *a)
len(dict)
len(set)
A select few operations optimistically avoid locking to improve performance. These require special implementations and cooperation from the memory allocator:
list[idx]
(list_subscript
)dict[key]
(dict_subscript
)listiter_next
,dictiter_iternextkey/value/item
list.contains
Borrowed References
Per-object locking provides many of the important protections that the GIL provides, but there are a few cases where it’s not sufficient. For example, code that relies on upgrading a borrowed reference to an “owned” reference may be unsafe in certain circumstances:
PyObject *item = PyList_GetItem(list, idx);
Py_INCREF(item);
The GIL ensures that no other thread can modify the list in between
the access and the Py_INCREF
call. Without the GIL – even with
per-object locking – another thread might modify the list leading to
item
being freed between the access and the Py_INCREF
call.
The problematic borrowed reference APIs are supplemented with functions that return “new references” but are otherwise equivalent:
PyList_FetchItem(list, idx)
forPyList_GetItem
PyDict_FetchItem(dict, key)
forPyDict_GetItem
PyWeakref_FetchObject
forPyWeakref_GetObject
Note that some APIs that return borrowed references, such as
PyTuple_GetItem
, are not problematic because tuples are
immutable. Similarly, not all uses of the above APIs are problematic.
For example, PyDict_GetItem
is often used for parsing keyword
argument dictionaries in function calls; those keyword argument
dictionaries are effectively private (not accessible by other
threads).
Python Critical Sections
Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.
This PEP proposes a scheme called “Python critical sections” to implicitly release per-object locks to avoid deadlocks. To understand the scheme, we first introduce a general approach to avoid deadlocks, and then propose a refinement of that approach with better performance.
One way to avoid deadlocks is to allow threads to hold only the lock (or locks) for a single operation at a time (typically a single lock, but some operations involve two locks as described above). When a thread begins a nested operation it should suspend the locks for any outer operation: before beginning the nested operation, the locks for the outer operation are released and when the nested operation completes, the locks for the outer operation are reacquired.
Additionally, the locks for any active operation should be suspended around potentially blocking operations, such as I/O (i.e., operations that would have released the GIL). This is because the interaction between locks and blocking operations can lead to deadlocks in the same way as the interaction between multiple locks.
To improve performance, this PEP proposes a variation of the above scheme that still avoids deadlocks. Instead of immediately suspending locks any time a nested operation begins, locks are only suspended if the thread would block (i.e., would have released the GIL). This reduces the number of lock acquisitions and releases for nested operations, while avoiding deadlocks.
The proposed API for Python critical sections are the following four macros. These are intended to be public (usable by C-API extensions), but not part of the limited API:
Py_BEGIN_CRITICAL_SECTION(PyObject *op);
: Begins a critical section by acquiring the mutex for the referenced object. If the object is already locked, then locks for any outstanding critical sections are released before this thread waits for referenced object to be unlocked.Py_END_CRITICAL_SECTION;
: Ends the most recent operation, unlocking the mutex. The next most recent previous critical section (if any) is resumed if it is currently suspended.Py_BEGIN_CRITICAL_SECTION2(PyObject *a, PyObject *b);
: Begins a critical section by acquiring the mutexes for two objects. To ensure consistent lock ordering, the order of acquisition is determined by memory address (i.e., the mutex with lower memory address is acquired first). If either mutex is already locked, then locks for any outstanding critical sections are released before this thread waits for the referenced objects to be unlocked.Py_END_CRITICAL_SECTION2;
: Behaves the same asPy_END_CRITICAL_SECTION
but unlocks two objects.
Additionally, when a thread transitions from the ATTACHED
state to
the DETACHED
state, it should suspend any active critical
sections. When transitioning from DETACHED
to ATTACHED
, the
most recent suspended critical section, if any, should be resumed.
Note that operations that lock two containers simultaneously need to use
the Py_BEGIN_CRITICAL_SECTION2
macro. It is not sufficient to nest
two calls to Py_BEGIN_CRITICAL_SECTION
because the inner critical
section may release the locks from the outer critical section.
Optimistically Avoiding Locking
A few operations on dict
and list
optimistically avoid
acquiring the per-object locks. They have a fast path operation that
does not acquire locks, but may fall back to a slower operation that
acquires the dictionary’s or list’s lock when another thread is
concurrently modifying that container.
The operations with an optimistic fast path are:
PyDict_FetchItem/GetItem
anddict.__getitem__
PyList_FetchItem/GetItem
andlist.__getitem__
Additionally, iterators for dict
and list
use the above
functions so they also optimistically avoid locking when returning
the next item.
There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.
The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).
This section describes the challenges with implementing dictionary and list accesses without locking followed by a description of this PEP’s changes to the Python interpreter required to address those challenges.
The main challenge is that retrieving an item from a list or dictionary and incrementing the reference count of that item is not an atomic operation. In between the time the item is retrieved and the reference count is incremented, another thread may modify the list or dictionary, possibly freeing the memory for the previously retrieved item.
A partial attempt at addressing this issue would be to convert the reference count increment to a conditional increment, only incrementing the reference count if it’s not zero. This change is not sufficient because when a Python object’s reference count reaches zero, the object’s destructor is called and the memory storing the object may be re-used for other data structures or returned to the operating system. Instead, this PEP proposes a technique to ensure that the reference count fields remain valid for the duration of the access, so that the conditional reference count increment is safe. This technique requires cooperation from the memory allocator (mimalloc) as well as changes to the list and dictionary objects. The proposed technique is similar to read-copy update (RCU) [6], a synchronization mechanism widely used in the Linux kernel.
The current implementation of list_item
(the C function
implementing list.__getitem__
) is the following:
Py_INCREF(a->ob_item[i]);
return a->ob_item[i];
The proposed implementation uses the conditional increment
(_Py_TRY_INCREF
) and has additional checks:
PyObject **ob_item = atomic_load(&a->ob_item);
PyObject *item = atomic_load(&ob_item[i]);
if (!item || !_Py_TRY_INCREF(item)) goto retry;
if (item != atomic_load(&ob_item[i])) {
Py_DECREF(item);
goto retry;
}
if (ob_item != atomic_load(&a->ob_item)) {
Py_DECREF(item);
goto retry;
}
return item;
The “retry” subroutine implements the locked fallback path when concurrent modifications to the list cause the above fast, non-locking path to fail:
retry:
PyObject *item;
Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
item = a->ob_item[i];
Py_INCREF(item);
Py_END_CRITICAL_SECTION(a->ob_mutex);
return item;
The modifications to the dict
implementation are similar, because
the relevant parts of both list and dictionary retrieval involve
loading an item/value from an array at a known index.
The additional checks following the conditional increment are
necessary because the scheme allows immediate re-use of memory,
including the memory that previously held a PyObject
structure or
list
or dict
array. Without these extra checks, the function
might return a Python object that was never in the list, if the
memory occupied by the Python object previously held a different
PyObject
whose memory previously stored an item in the list.
Mimalloc Changes for Optimistic list
and dict
Access
The implementation requires additional constraints to the memory allocator, including some changes to the mimalloc code. Some background on mimalloc’s implementation is helpful to understand the required changes. Individual allocations from mimalloc are called “blocks.” Mimalloc “pages” contain consecutive blocks that are all the same size. A mimalloc “page” is similar to a “superblock” in other allocators; it is NOT an operating system page. A mimalloc “heap” contains pages of various size classes; each page belongs to a single heap. If none of the blocks of a page are allocated, then mimalloc may re-use the page for a different size class or different heap (i.e., it might reinitialize the page).
The list and dictionary access scheme works by partially restricting re-use of mimalloc pages so that reference count fields remains valid for the duration of the access. The restricted re-use of mimalloc pages is enforced by having separate heaps for Python objects [7]. This ensures that even if an item is freed during access and the memory reused for a new object, the new object’s reference count field is placed at the same location in memory. The reference count field remains valid (or zero) across allocations.
Python objects that support Py_TPFLAGS_MANAGED_DICT
have their
dictionary and weak reference fields preceding the PyObject
header, so their reference count fields are at a different offset from
the start of their allocations. They are stored in a separate mimalloc
heap. Additionally, non-GC objects are stored in their own heap so
that the GC only has to look at GC objects. There are therefore three
mimalloc heaps for Python objects, one for non-GC objects, one for GC
objects with managed dictionaries, and one for GC objects without
managed dictionaries.
Mimalloc Page Reuse
It is beneficial to keep the restrictions on mimalloc page reuse to a short period of time to avoid increasing overall memory usage. Precisely limiting the restrictions to list and dictionary accesses would minimize memory usage, but would require expensive synchronizations. At the other extreme, keeping the restrictions until the next GC cycle would avoid introducing any extra synchronizations, but would potentially increase memory usage.
This PEP proposes a system that lies between those two extremes based on FreeBSD’s “GUS” [8]. It uses a combination of global and per-thread counters (or “sequence numbers”) to coordinate the determination of when it is safe to reuse an empty mimalloc page for a different heap or for a different size class, or to return it to the operating system:
- There is a global write sequence number that monotonically increases.
- When a mimalloc page is empty, it’s tagged with the current write sequence number. The thread may also atomically increment the global write sequence number.
- Each thread has a local read sequence number that records the most recent write sequence number it has observed.
- Threads may observe the write sequence number whenever they are not in a list or dictionary access. The reference implementation does this in mimalloc’s slow-path allocation function. This is called regularly enough to be useful, but not so frequently as to introduce significant overhead.
- There is a global read sequence number that stores the minimum of all active threads’ read sequence numbers. A thread may update the global read sequence number by scanning each threads’ local read sequence number. The reference implementation does this before allocating a fresh mimalloc page if there are restricted pages that could possibly be reused.
- An empty mimalloc page may be reused for a different heap or size class when the global read sequence number is larger than the page’s tag number.
The condition that the global read sequence number is larger than the page’s tag is sufficient because it ensures that any thread that had a concurrent optimistic list or dictionary access is finished with that access. In other words, there are no threads accessing the empty blocks in the freed page, so the page can be used for any other purpose or even returned to the operating system.
Optimistic dict
and list
Access Summary
This PEP proposes a technique for thread-safe list and dictionary accesses that typically avoids acquiring locks. This reduces execution overhead and avoids some multi-threaded scaling bottlenecks in common operations, like calling functions and methods. The scheme works by placing temporary restrictions on mimalloc page reuse to ensure that objects’ reference count fields remain valid after objects are freed so that conditional reference count increment operations are safe. The restrictions are placed on mimalloc pages instead of on individual objects to improve opportunities for memory reuse. The restrictions are lifted as soon as the system can determine that there are no outstanding accesses involving the empty mimalloc page. To determine this, the system uses a combination of lightweight per-thread sequence counters and also tags pages when they are empty. Once each thread’s local counter is larger than the page’s tag, it can be reused for any purpose or returned to the operating system. The restrictions are also lifted whenever the cyclic garbage collector runs because the stop-the-world pause ensures that threads do not have any outstanding references to empty mimalloc pages.
Specializing Interpreter
The specializing interpreter requires some changes to be thread-safe when running without the GIL:
- Concurrent specializations are prevented by using a mutex. This prevents multiple threads writing to the same inline cache.
- In multi-threaded programs running without the GIL, each bytecode is only specialized once. This prevents a thread from reading a partially written inline cache.
- Locking also ensures that cached values of
tp_version_tag
andkeys_version
are consistent with the cached descriptors and other values. - Modifications to inline counters use “relaxed atomics”. In other words, some counter decrements may be missed or overwritten, but that does not affect correctness.
Py_mod_gil
Slot
In --disable-gil
builds, when loading an extension, CPython will
check for a new PEP 489-style Py_mod_gil
slot. If the slot is
set to Py_mod_gil_not_used
, then extension loading proceeds as
normal. If the slot is not set, the interpreter pauses all threads and
enables the GIL before continuing. Additionally, the interpreter will
issue a visible warning naming the extension, that the GIL was enabled
(and why) and the steps the user can take to override it.
PYTHONGIL
Environment Variable
In --disable-gil
builds, the user can also override the behavior at
runtime by setting the PYTHONGIL
environment variable. Setting
PYTHONGIL=0
, forces the GIL to be disabled, overriding the module
slot logic. Setting PYTHONGIL=1
, forces the GIL to be enabled.
The PYTHONGIL=0
override is important because extensions that are
not thread-safe can still be useful in multi-threaded applications. For
example, one may want to use the extension from only a single thread or
guard access by locks. For context, there are already some extensions
that are not thread-safe even with the GIL, and users already have to
take these sorts of steps.
The PYTHONGIL=1
override is sometimes useful for debugging.
Rationale
Non-Generational Garbage Collection
This PEP proposes switching from a generational cyclic garbage collector to a non-generational collector (when CPython is built without the GIL). That is equivalent to only having one generation (the “old” generation). There are two reasons for this proposed change.
Cyclic garbage collection, even for just the young generation, requires pausing other threads in the program. The author is concerned that frequent collections of the young generation would inhibit efficient scaling in multi-threaded programs. This is a concern for young generations (but not the old generation) because the young generations are collected after a fixed number of allocations, while the collections for the older generation are scheduled in proportion to the number of live objects in the heap. Additionally, it is difficult to efficiently keep track of objects in each generation without the GIL. For example, CPython currently uses a linked list of objects in each generation. If CPython were to keep that design, those lists would need to be made thread-safe, and it’s not clear how to do that efficiently.
Generational garbage collection is used to good effect in many other language runtimes. For example, many of the Java HotSpot garbage collector implementations use multiple generations [11]. In these runtimes, a young generation is frequently a throughput win: since a large percentage of the young generation is typically “dead,” the GC is able to reclaim a large amount memory relative to the amount of work performed. For example, several Java benchmarks show over 90% of “young” objects are typically collected [12] [13]. This is commonly referred to as the “weak generational hypothesis;” the observation is that most objects die young. This pattern is reversed in CPython due to the use of reference counting. Although most objects still die young, they are collected when their reference counts reach zero. Objects that survive to a garbage collection cycle are most likely to remain alive [14]. This difference means that generational collection is much less effective in CPython than in many other language runtimes [15].
Optimistic Avoiding Locking in dict
and list
Accesses
This proposal relies on a scheme that mostly avoids acquiring locks when accessing individual elements in lists and dictionaries. Note that this is not “lock free” in the sense of “lock-free” and “wait-free” algorithms that guarantee forward progress. It simply avoids acquiring locks (mutexes) in the common case to improve parallelism and reduce overhead.
A much simpler alternative would be to use reader-writer locks to
protect dictionary and list accesses. Reader-writer locks allow
concurrent reads, but not updates, which might seem ideal for list
and dictionaries. The problem is that reader-writer locks have
substantial overhead and poor scalability, particularly when the
critical sections are small, as they are for single-element
dictionary and list accesses [9]. The poor reader
scalability stems from the fact that readers must all update the same
data structure, such as the number of readers in
pthread_rwlocks
.
The technique described in this PEP is related to RCU (“read-copy-update”) [6] and, to a lesser extent, hazard pointers, two well-known schemes for optimizing concurrent, read-mostly data structures. RCU is widely used in the Linux kernel to protect shared data structures in a scalable manner. Both the technique in this PEP and RCU work by deferring reclamation while readers may be accessing the concurrent data structure. RCU is most commonly used to protect individual objects (like hash tables or linked lists), while this PEP proposes a scheme to protect larger blocks of memory (mimalloc “pages”) [10].
The need for this scheme is largely due to the use of reference counting in CPython. If CPython only relied on a tracing garbage collector, then this scheme would probably not be necessary because tracing garbage collectors already defer reclamation in the required manner. This would not “solve” scaling issues, but would shift many of the challenges to the garbage collector implementation.
Backwards Compatibility
This PEP poses a number of backwards compatibility issues when
building CPython with the --disable-gil
flag, but those issues do
not occur when using the default build configuration. Nearly all the
backwards compatibility concerns involve the C-API:
- CPython builds without the GIL will not be ABI compatible with the standard CPython build or with the stable ABI due to changes to the Python object header needed to support biased reference counting. C-API extensions will need to be rebuilt specifically for this version.
- C-API extensions that rely on the GIL to protect global state or object state in C code will need additional explicit locking to remain thread-safe when run without the GIL.
- C-API extensions that use borrowed references in ways that are not safe without the GIL will need to use the equivalent new APIs that return non-borrowed references. Note that only some uses of borrowed references are a concern; only references to objects that might be freed by other threads pose an issue.
- Custom memory allocators (
PyMem_SetAllocator
) are required to delegate the actual allocation to the previously set allocator. For example, the Python debug allocator and tracing allocators will continue to work because they delegate the allocation to the underlying allocator. On the other hand, wholesale replacing of the allocator (e.g., with jemalloc or tcmalloc) will not work correctly. - Python objects must be allocated through the standard APIs, such as
PyType_GenericNew
orPyObject_Malloc
. Non-Python objects must not be allocated through those APIs. For example, it is currently acceptable to allocate buffers(non-Python objects) throughPyObject_Malloc
; that will no longer be allowed and buffers should instead be allocated throughPyMem_Malloc
,PyMem_RawMalloc
, ormalloc
.
There are fewer potential backwards compatibility issues for Python code:
- Destructors and weak reference callbacks for code objects and top-level function objects are delayed until the next cyclic garbage collection due to the use of deferred reference counting.
- Destructors for some objects accessed by multiple threads may be
delayed slightly due to biased reference counting. This is rare:
most objects, even those accessed by multiple threads, are
destroyed immediately as soon as their reference counts are zero.
Two places in the Python standard library tests required
gc.collect()
calls to continue to pass.
Distribution
This PEP poses new challenges for distributing Python. At least for
some time, there will be two versions of Python requiring separately
compiled C-API extensions. It may take some time for C-API extension
authors to build --disable-gil
compatible packages and upload
them to PyPI. Additionally, some authors may be hesitant to support
the --disable-gil
mode until it has wide adoption, but adoption
will likely depend on the availability of Python’s rich set of
extensions.
To mitigate this, the author will work with Anaconda to distribute
a --disable-gil
version of Python together with compatible
packages from conda channels. This centralizes the challenges of
building extensions, and the author believes this will enable more
people to use Python without the GIL sooner than they would otherwise
be able to.
Performance
The changes to make CPython thread-safe without the GIL increase
execution overhead for --disable-gil
builds. The performance
impact is different for programs that use only a single thread compared
to programs that use multiple threads, so the table below reports
execution overhead separately for these types of programs separately.
Intel Skylake | AMD Zen 3 | |
---|---|---|
One thread | 6% | 5% |
Multiple threads | 8% | 7% |
The baseline used to measure overhead is 018be4c
from PR 19474,
which implements immortal objects for Python 3.12. The largest
contribution to execution overhead is biased reference counting
followed by per-object locking. For thread-safety reasons, an
application running with multiple threads will only specialize a given
bytecode once; this is why the overhead for programs that use multiple
threads is larger compared to programs that only use one thread.
However, with the GIL disabled, programs that use multiple threads
should also be able to more effectively use multiple CPU cores.
Note that this PEP would not affect the performance of the default
(non --disable-gil
) builds of CPython.
Build Bots
The stable build bots will also include --disable-gil
builds.
How to Teach This
As part of implementing the --disable-gil
mode, the author will
write a “HOWTO” guide [18] for making packages compatible when
running Python without the GIL.
Reference Implementation
There are two GitHub repositories implementing versions of CPython without the GIL:
The nogil-3.12
is based on Python 3.12.0a4. It is useful for
evaluating single-threaded execution overhead and as a reference
implementation for this PEP. It is less useful for evaluating C-API
extension compatibility because many extensions are not currently
compatible with Python 3.12. Due to limited time for the 3.12 port,
the nogil-3.12
implementation does not skip all deferred reference
counts. As a temporary work around, the implementation immortalizes
objects that use deferred reference counting in programs that spawn
multiple threads.
The nogil
repository is based on Python 3.9.10. It is useful for
evaluating multi-threading scaling in real world applications and
extension compatibility. It is more stable and well tested than the
nogil-3.12
repository.
Alternatives
Python currently supports a number of ways to enable parallelism, but the existing techniques come with significant limitations.
Multiprocessing
The multiprocessing library allows Python programs to start and communicate with Python subprocesses. This allows for parallelism because each subprocess has its own Python interpreter (i.e., there’s one GIL per process). Multiprocessing has a few substantial limitations. Communication between processes is limited: objects generally need to be serialized or copied to shared memory. This introduces overhead (due to serialization) and complicates building APIs on top of multiprocessing. Starting a subprocess is also more expensive than starting a thread, especially with the “spawn” implementation. Starting a thread takes ~100 µs, while spawning a subprocess takes ~50 ms (50,000 µs) due to Python re-initialization.
Finally, many C and C++ libraries support access from multiple threads but do not support access or use across multiple processes.
Releasing the GIL in C-API Extensions
C-API extensions can release the GIL around long running functions. This allows for some degree of parallelism, since multiple threads can run concurrently when the GIL is released, but the overhead of acquiring and releasing the GIL typically prevents this from scaling efficiently beyond a few threads. Many scientific computing libraries release the GIL in computational heavy functions, and the CPython standard library releases the GIL around blocking I/O.
Internal Parallelization
Functions implemented in C may use multiple threads internally. For example, Intel’s NumPy distribution, PyTorch, and TensorFlow all use this technique to internally parallelize individual operations. This works well when the basic operations are large enough to be parallelized efficiently, but not when there are many small operations or when the operations depend on some Python code. Calling into Python from C requires acquiring the GIL – even short snippets of Python code can inhibit scaling.
Rejected Ideas
Why Not Use a Concurrent Garbage Collector?
Many recent garbage collectors are mostly concurrent – they avoid long stop-the-world pauses by allowing the garbage collector to run concurrently with the application. So why not use a concurrent collector?
Concurrent collection requires write barriers (or read barriers). The author is not aware of a way to add write barriers to CPython without substantially breaking the C-API.
Why Not Deprecate PyDict_GetItem
in Favor of PyDict_FetchItem
?
This PEP proposes a new API PyDict_FetchItem
which behaves like
PyDict_GetItem
, but returns a new reference instead of a borrowed
reference. As described in Borrowed References, some uses of
borrowed references that were safe when running with the GIL are
unsafe when running without the GIL and need to be replaced by
functions like PyDict_FetchItem
that return new references.
This PEP does not propose deprecating PyDict_GetItem
and similar
functions that return borrowed references for a few reasons:
- Many of the uses of borrowed references are safe, even when running
without the GIL. For example, C API functions often use
PyDict_GetItem
to retrieve items from the keyword argument dictionary. These calls are safe because the keyword argument dictionary is only visible to a single thread. - I tried this approach early on and found that wholesale replacing of
PyDict_GetItem
withPyDict_FetchItem
frequently introduced new reference counting bugs. In my opinion, the risk of introducing new reference counting bugs generally outweighs the risks of missing aPyDict_GetItem
call that is unsafe without the GIL.
Why Not Use PEP 683 Immortalization?
Like PEP 683, this PEP proposes an immortalization scheme for Python objects, but the PEPs use different bit representations to mark immortal objects. The schemes cannot be identical because this PEP depends on biased reference counting, which has two reference count fields instead of one.
Open Issues
Improved Specialization
The Python 3.11 release introduced quickening and specialization as part of the faster CPython project, substantially improving performance. Specialization replaces slow bytecode instructions with faster variants [19]. To maintain thread-safety, applications that use multiple threads (and run without the GIL) will only specialize each bytecode once, which can lower performance on some programs. It is possible to support specializing multiple times, but that requires more investigation and is not part of this PEP.
Python Build Modes
This PEP introduces a new build mode (--disable-gil
) that is not
ABI compatible with the standard build mode. The additional build
mode adds complexity for both Python core developers and extension
developers. The author believes a worthwhile goal is to combine
these build modes and have the global interpreter lock controlled at
runtime, possibly disabled by default. The path to this goal remains
an open issue, but a possible path might look like the following:
- In 2024, CPython 3.13 is released with support for a
--disable-gil
build time flag. There are two ABIs for CPython, one with the GIL and one without. Extension authors target both ABIs. - After 2–3 releases, (i.e., in 2026–2027), CPython is released with the GIL controlled by a runtime environment variable or flag. The GIL is enabled by default. There is only a single ABI.
- After another 2–3 release (i.e., 2028–2030), CPython switches to the GIL being disabled by default. The GIL can still be enabled at runtime via an environment variable or command line flag.
This PEP covers the first step, with the remaining steps left as open issues. In this scenario, there would be a two to three year period where extension authors would target an extra CPython build per supported CPU architecture and OS.
Integration
The reference implementation changes approximately 15,000 lines of code
in CPython and includes mimalloc, which is also approximately 15,000
lines of code. Most changes are not performance sensitive and can be
included in both --disable-gil
and the default builds. Some
macros, like Py_BEGIN_CRITICAL_SECTION
will be no-ops in the
default build. Thee author does not expect a huge number of #ifdef
statements to support the --disable-gil
builds.
Mitigations for Single-Threaded Performance
The changes proposed in the PEP will increase execution overhead for
--disable-gil
builds compared to Python builds with the GIL. In
other words, it will have slower single-threaded performance. There
are some possible optimizations to reduce execution overhead,
especially for --disable-gil
builds that only use a single
thread. These may be worthwhile if a longer term goal is to have a
single build mode, but the choice of optimizations and their
trade-offs remain an open issue.
References
Acknowledgments
Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providing feedback on drafts of this PEP.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0703.rst
Last modified: 2024-10-17 12:49:39 GMT