PEP: 703 Title: Making the Global Interpreter Lock Optional in CPython
Author: Sam Gross <colesbury at gmail.com> Sponsor: Łukasz Langa <lukasz
at python.org> Discussions-To: https://discuss.python.org/t/22606
Status: Accepted Type: Standards Track Content-Type: text/x-rst Created:
09-Jan-2023 Python-Version: 3.13 Post-History: 09-Jan-2023, 04-May-2023
Resolution: 24-Oct-2023

Note

The Steering Council accepts PEP 703, but with clear proviso: that the
rollout be gradual and break as little as possible, and that we can roll
back any changes that turn out to be too disruptive – which includes
potentially rolling back all of PEP 703 entirely if necessary (however
unlikely or undesirable we expect that to be).

Abstract

CPython's global interpreter lock ("GIL") prevents multiple threads from
executing Python code at the same time. The GIL is an obstacle to using
multi-core CPUs from Python efficiently. This PEP proposes adding a
build configuration (--disable-gil) to CPython to let it run Python code
without the global interpreter lock and with the necessary changes
needed to make the interpreter thread-safe.

Motivation

The GIL is a major obstacle to concurrency. For scientific computing
tasks, this lack of concurrency is often a bigger issue than speed of
executing Python code, since most of the processor cycles are spent in
optimized CPU or GPU kernels. The GIL introduces a global bottleneck
that can prevent other threads from making progress if they call any
Python code. There are existing ways to enable parallelism in CPython
today, but those techniques come with significant limitations (see
Alternatives).

This section focuses on the GIL's impact on scientific computing,
particular AI/ML workloads because that is the area with which this
author has the most experience, but the GIL also affects other users of
Python.

The GIL Makes Many Types of Parallelism Difficult to Express

Neural network-based AI models expose multiple opportunities for
parallelism. For example, individual operations may be parallelized
internally ("intra-operator"), multiple operations may be executed
simultaneously ("inter-operator"), and requests (spanning multiple
operations) may also be parallelized. Efficient execution requires
exploiting multiple types of parallelism[1].

The GIL makes it difficult to express inter-operator parallelism, as
well as some forms of request parallelism, efficiently in Python. In
other programming languages, a system might use threads to run different
parts of a neural network on separate CPU cores, but this is inefficient
in Python due to the GIL. Similarly, latency-sensitive inference
workloads frequently use threads to parallelize across requests, but
face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Python
frequently come up in reinforcement learning. Heinrich Kuttler, author
of the NetHack Learning Environment and Member of Technical Staff at
Inflection AI, writes:

  Recent breakthroughs in reinforcement learning, such as on Dota 2,
  StarCraft, and NetHack rely on running multiple environments
  (simulated games) in parallel using asynchronous actor-critic methods.
  Straightforward multithreaded implementations in Python don't scale
  beyond more than a few parallel environments due to GIL contention.
  Multiprocessing, with communication via shared memory or UNIX sockets,
  adds much complexity and in effect rules out interacting with CUDA
  from different workers, severely restricting the design space.

Manuel Kroiss, software engineer at DeepMind on the reinforcement
learning team, describes how the bottlenecks posed by the GIL lead to
rewriting Python codebases in C++, making the code less accessible:

  We frequently battle issues with the Python GIL at DeepMind. In many
  of our applications, we would like to run on the order of 50-100
  threads per process. However, we often see that even with fewer than
  10 threads the GIL becomes the bottleneck. To work around this
  problem, we sometimes use subprocesses, but in many cases the
  inter-process communication becomes too big of an overhead. To deal
  with the GIL, we usually end up translating large parts of our Python
  codebase into C++. This is undesirable because it makes the code less
  accessible to researchers.

Projects that involve interfacing with multiple hardware devices face
similar challenges: efficient communication requires use of multiple CPU
cores. The Dose-3D project aims to improve cancer radiotherapy with
precise dose planning. It uses medical phantoms (stand-ins for human
tissue) together with custom hardware and a server application written
in Python. Paweł Jurgielewicz, lead software architect for the data
acquisition system on the Dose-3D project, describes the scaling
challenges posed by the GIL and how using a fork of Python without the
GIL simplified the project:

  In the Dose-3D project, the key challenge was to maintain a stable,
  non-trivial concurrent communication link with hardware units while
  utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we
  started with the multiprocessing package, but at some point, it became
  clear that most CPU time was consumed by the data transfers between
  the data processing stages, not by data processing itself. The CPython
  multithreading implementation based on GIL was a dead end too. When we
  found out about the "nogil" fork of Python it took a single person
  less than half a working day to adjust the codebase to use this fork
  and the results were astonishing. Now we can focus on data acquisition
  system development rather than fine-tuning data exchange algorithms.

Allen Goodman, author of CellProfiler and staff engineer at Prescient
Design and Genentech, describes how the GIL makes biological methods
research more difficult in Python:

  Issues with Python's global interpreter lock are a frequent source of
  frustration throughout biological methods research.

  I wanted to better understand the current multithreading situation so
  I reimplemented parts of HMMER, a standard method for
  multiple-sequence alignment. I chose this method because it stresses
  both single-thread performance (scoring) and multi-threaded
  performance (searching a database of sequences). The GIL became the
  bottleneck when using only eight threads. This is a method where the
  current popular implementations rely on 64 or even 128 threads per
  process. I tried moving to subprocesses but was blocked by the
  prohibitive IPC costs. HMMER is a relatively elementary bioinformatics
  method and newer methods have far bigger multi-threading demands.

  Method researchers are begging to use Python (myself included),
  because of its ease of use, the Python ecosystem, and because "it's
  what people know." Many biologists only know a little bit of
  programming (and that's almost always Python). Until Python's
  multithreading situation is addressed, C and C++ will remain the
  lingua franca of the biological methods research community.

The GIL Affects Python Library Usability

The GIL is a CPython implementation detail that limits multithreaded
parallelism, so it might seem unintuitive to think of it as a usability
issue. However, library authors frequently care a great deal about
performance and will design APIs that support working around the GIL.
These workaround frequently lead to APIs that are more difficult to use.
Consequently, users of these APIs may experience the GIL as a usability
issue and not just a performance issue.

For example, PyTorch exposes a multiprocessing-based API called
DataLoader for building data input pipelines. It uses fork() on Linux
because it is generally faster and uses less memory than spawn(), but
this leads to additional challenges for users: creating a DataLoader
after accessing a GPU can lead to confusing CUDA errors. Accessing GPUs
within a DataLoader worker quickly leads to out-of-memory errors because
processes do not share CUDA contexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria,
describes how having to work around the GIL in scikit-learn related
libraries leads to a more complex and confusing user experience:

  Over the years, scikit-learn developers have maintained ancillary
  libraries such as joblib and loky to try to work around some of the
  limitations of multiprocessing: extra memory usage partially mitigated
  via semi-automated memory mapping of large data buffers, slow worker
  startup by transparently reusing a pool of long running workers,
  fork-safety problems of third-party native runtime libraries such as
  GNU OpenMP by never using the fork-only start-method, ability to
  perform parallel calls of interactively defined functions in notebooks
  and REPLs in cross-platform manner via cloudpickle. Despite our
  efforts, this multiprocessing-based solution is still brittle, complex
  to maintain and confusing to datascientists with limited understanding
  of system-level constraints. Furthermore, there are still irreducible
  limitations such as the overhead caused by the pickle-based
  serialization/deserialization steps required for inter-process
  communication. A lot of this extra work and complexity would not be
  needed anymore if we could use threads without contention on multicore
  hosts (sometimes with 64 physical cores or more) to run data science
  pipelines that alternate between Python-level operations and calls to
  native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy
maintainer, describes how the GIL affects the user experience of NumPy
and numeric Python libraries:

  A key problem in NumPy and the stack of packages built around it is
  that NumPy is still (mostly) single-threaded --- and that has shaped
  significant parts of the user experience and projects built around it.
  NumPy does release the GIL in its inner loops (which do the heavy
  lifting), but that is not nearly enough. NumPy doesn't offer a
  solution to utilize all CPU cores of a single machine well, and
  instead leaves that to Dask and other multiprocessing solutions. Those
  aren't very efficient and are also more clumsy to use. That clumsiness
  comes mainly in the extra abstractions and layers the users need to
  concern themselves with when using, e.g., dask.array which wraps
  numpy.ndarray. It also shows up in oversubscription issues that the
  user must explicitly be aware of and manage via either environment
  variables or a third package, threadpoolctl. The main reason is that
  NumPy calls into BLAS for linear algebra - and those calls it has no
  control over, they do use all cores by default via either pthreads or
  OpenMP.

  Coordinating on APIs and design decisions to control parallelism is
  still a major amount of work, and one of the harder challenges across
  the PyData ecosystem. It would have looked a lot different (better,
  easier) without a GIL.

GPU-Heavy Workloads Require Multi-Core Processing

Many high-performance computing (HPC) and AI workloads make heavy use of
GPUs. These applications frequently require efficient multi-core CPU
execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI),
describes how the GIL makes multithreaded scaling inefficient even when
the bulk of computation is performed outside of Python:

  In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU
  threads, growing to 4k GPUs and 32k CPU threads for big models. While
  the heavy lifting is done outside of Python, the speed of GPUs makes
  even just the orchestration in Python not scalable. We often end up
  with 72 processes in place of one because of the GIL. Logging,
  debugging, and performance tuning are orders-of-magnitude more
  difficult in this regime, continuously causing lower developer
  productivity.

The use of many processes (instead of threads) makes common tasks more
difficult. Zachary DeVito continues:

  On three separate occasions in the past couple of months (reducing
  redundant compute in data loaders, writing model checkpoints
  asynchronously, and parallelizing compiler optimizations), I spent an
  order-of-magnitude more time figuring out how to work around GIL
  limitations than actually solving the particular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component. For
example, computer vision tasks typically require multiple
"pre-processing" steps in the data input pipeline, like image decoding,
cropping, and resizing. These tasks are commonly performed on the CPU
and may use Python libraries like Pillow or Pillow-SIMD. It is necessary
to run the data input pipeline on multiple CPU cores in order to keep
the GPU "fed" with data.

The increase in GPU performance compared to individual CPU cores makes
multi-core performance more important. It is progressively more
difficult to keep the GPUs fully occupied. To do so requires efficient
use of multiple CPU cores, especially on multi-GPU systems. For example,
NVIDIA's DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the
GPUs "fed" with data.

The GIL Makes Deploying Python AI Models Difficult

Python is widely used to develop neural network-based AI models. In
PyTorch, models are frequently deployed as part of multi-threaded,
mostly C++, environments. Python is often viewed skeptically because the
GIL can be a global bottleneck, preventing efficient scaling even though
the vast majority of the computations occur "outside" of Python with the
GIL released. The torchdeploy paper[2] shows experimental evidence for
these scaling bottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AI models
that avoid or work around the GIL, but they all come with substantial
limitations. For example, TorchScript captures a representation of the
model that can be executed from C++ without any Python dependencies, but
it only supports a limited subset of Python and often requires rewriting
some of the model's code. The torch::deploy API allows multiple Python
interpreters, each with its own GIL, in the same process(similar to PEP
684). However, torch::deploy has limited support for Python modules that
use C-API extensions.

Motivation Summary

Python's global interpreter lock makes it difficult to use modern
multi-core CPUs efficiently for many scientific and numeric computing
applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz
found that multi-threaded implementations in Python did not scale well
for their tasks and that using multiple processes was not a suitable
alternative.

The scaling bottlenecks are not solely in core numeric tasks. Both
Zachary DeVito and Paweł Jurgielewicz described challenges with
coordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current
workarounds for the GIL are "complex to maintain" and cause "lower
developer productivity." The GIL makes it more difficult to develop and
maintain scientific and numeric computing libraries as well leading to
library designs that are more difficult to use.

Specification

Build Configuration Changes

The global interpreter lock will remain the default for CPython builds
and python.org downloads. A new build configuration flag, --disable-gil
will be added to the configure script that will build CPython with
support for running without the global interpreter lock.

When built with --disable-gil, CPython will define the Py_GIL_DISABLED
macro in Python/patchlevel.h. The ABI tag will include the letter "t"
(for "threading").

The --disable-gil builds of CPython will still support optionally
running with the GIL enabled at runtime (see PYTHONGIL Environment
Variable and Py_mod_gil Slot).

Overview of CPython Changes

Removing the global interpreter lock requires substantial changes to
CPython internals, but relatively few changes to the public Python and C
APIs. This section describes the required changes to the CPython
implementation followed by the proposed API changes.

The implementation changes can be grouped into the following four
categories:

-   Reference counting
-   Memory management
-   Container thread-safety
-   Locking and atomic APIs

Reference Counting

Removing the GIL requires changes to CPython's reference counting
implementation to make it thread-safe. Furthermore, it needs to have low
execution overhead and allow for efficient scaling with multiple
threads. This PEP proposes a combination of three techniques to address
these constraints. The first is a switch from plain non-atomic reference
counting to biased reference counting, which is a thread-safe reference
counting technique with lower execution overhead than plain atomic
reference counting. The other two techniques are immortalization and a
limited form of deferred reference counting; they address some of the
multi-threaded scalability issues with reference counting by avoiding
some reference count modifications.

Biased reference counting (BRC) is a technique first described in 2018
by Jiho Choi, Thomas Shull, and Josep Torrellas[3]. It is based on the
observation that most objects are only accessed by a single thread, even
in multi-threaded programs. Each object is associated with an owning
thread (the thread that created it). Reference counting operations from
the owning thread use non-atomic instructions to modify a "local"
reference count. Other threads use atomic instructions to modify a
"shared" reference count. This design avoids many atomic
read-modify-write operations that are expensive on contemporary
processors.

The implementation of BRC proposed in this PEP largely matches the
original description of biased reference counting, but differs in
details like the size of reference counting fields and special bits in
those fields. BRC requires storing three pieces of information in each
object's header: the "local" reference count, the "shared" reference
count, and the identifier of the owning thread. The BRC paper packs
these three things into a single 64-bit field. This PEP proposes using
three separate fields in each object's header to avoid potential issues
due to reference count overflow. Additionally, the PEP supports a faster
deallocation path that avoids an atomic operation in the common case.

The proposed PyObject struct (also called struct _object) is below:

    struct _object {
      _PyObject_HEAD_EXTRA
      uintptr_t ob_tid;         // owning thread id (4-8 bytes)
      uint16_t __padding;       // reserved for future use (2 bytes)
      PyMutex ob_mutex;         // per-object mutex (1 byte)
      uint8_t ob_gc_bits;       // GC fields (1 byte)
      uint32_t ob_ref_local;    // local reference count (4 bytes)
      Py_ssize_t ob_ref_shared; // shared reference count and state bits (4-8 bytes)
      PyTypeObject *ob_type;
    };

The ob_tid, ob_ref_local, and ob_ref_shared are used by the biased
reference counting implementation. The ob_gc_bits field is used store
garbage collection flags that were previously stored in PyGC_Head (see
Garbage Collection (Cycle Collection)). The ob_mutex field provides a
per-object lock in a single byte.

Immortalization

Some objects, such as interned strings, small integers, statically
allocated PyTypeObjects, and the True, False, and None objects stay
alive for the lifetime of the program. These objects are marked as
immortal by setting the local reference count field (ob_ref_local) to
UINT32_MAX.

The Py_INCREF and Py_DECREF macros are no-ops for immortal objects. This
avoids contention on the reference count fields of these objects when
multiple threads access them concurrently.

This proposed immortalization scheme is very similar to PEP 683, adopted
in Python 3.12, but with slightly different bit representation in the
reference count fields for immortal objects in order to work with biased
reference counting and deferred reference counting. See also Why Not Use
PEP 683 Immortalization?.

Biased Reference Counting

Biased reference counting has a fast-path for objects "owned" by the
current thread and a slow-path for other objects. Ownership is indicated
by the ob_tid field. Determining the thread id requires platform
specific code[4]. A value of 0 in ob_tid indicates that the object is
not owned by any thread.

The ob_ref_local field stores the local reference count and two flags.
The two most significant bits are used to indicate the object is
immortal or uses deferred reference counting (see Deferred reference
counting).

The ob_ref_shared field stores the shared reference count. The two least
significant bits are used to store the reference counting state. The
shared reference count is therefore shifted left by two. The
ob_ref_shared field uses the least significant bits because the shared
reference count can be temporarily negative; increfs and decrefs may not
be balanced between threads.

The possible reference counting states are listed below:

-   0b00 - default
-   0b01 - weakrefs
-   0b10 - queued
-   0b11 - merged

The states form a progression: during their lifecycle, objects may
transition to any numerically higher state. Objects can only be
deallocated from the "default" and "merged" states. Other states must
transition to the "merged" state before deallocation. Transitioning
states requires an atomic compare-and-swap on the ob_ref_shared field.

Default (0b00)

Objects are initially created in the default state. This is the only
state that allows for the quick deallocation code path. Otherwise, the
thread must merge the local and shared reference count fields, which
requires an atomic compare-and-swap.

This quick deallocation code path would not be thread-safe with
concurrent dereferencing of weakrefs, so the first time a weak reference
is created, the object is transitioned to the "weakrefs" state if it is
currently in the "default" state.

Similarly, the quick deallocation code path would not be thread-safe
with the lockless list and dictionary accesses (see Optimistically
Avoiding Locking), so the first time a non-owning thread thread attempts
to retrieve an object in the "default" state it falls back to the slower
locking code path and transitions the object to the "weakrefs" state.

Weakrefs (0b01)

Objects in weakref and higher states support dereferencing weakrefs as
well as the lockless list and dictionary access by non-owning threads.
They require transitioning to the merged state before deallocation,
which is more expensive than the quick deallocation code path supported
by the "default" state.

Queued (0b10)

The queued state indicates that the a non-owning thread has requested
that the reference count fields be merged. This can happen when the
shared reference count becomes negative (due to an imbalance between
increfs and decrefs between threads). The object is inserted into the
owning thread's queue of objects to be merged. The owning thread is
notified via the eval_breaker mechanism. In practice, this operation is
rare. Most objects are only accessed by a single thread and those
objects accessed by multiple threads rarely have negative shared
reference counts.

If the owning thread has terminated, the acting thread immediately
merges the local and shared reference count fields and transitions to
the merged state.

Merged (0b11)

The merged state indicates that the object is not owned by any thread.
The ob_tid field is zero in this state and ob_ref_local is not used.
Once the shared reference count reaches zero, the object can be
deallocated from the merged state.

Reference counting pseudo-code

The proposed Py_INCREF and Py_DECREF operation should behave as follows
(using C-like pseudo-code):

    // low two bits of "ob_ref_shared" are used for flags
    #define _Py_SHARED_SHIFT 2

    void Py_INCREF(PyObject *op)
    {
      uint32_t new_local = op->ob_ref_local + 1;
      if (new_local == 0)
        return;  // object is immortal
      if (op->ob_tid == _Py_ThreadId())
        op->ob_ref_local = new_local;
      else
        atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
    }

    void Py_DECREF(PyObject *op)
    {
      if (op->ob_ref_local == _Py_IMMORTAL_REFCNT) {
        return;  // object is immortal
      }
      if (op->ob_tid == _Py_ThreadId()) {
        op->ob_ref_local -= 1;
        if (op->ob_ref_local == 0) {
          _Py_MergeZeroRefcount(); // merge refcount
        }
      }
      else {
        _Py_DecRefShared(); // slow path
      }
    }

    void _Py_MergeZeroRefcount(PyObject *op)
    {
      if (op->ob_ref_shared == 0) {
        // quick deallocation code path (common case)
        op->ob_tid = 0;
        _Py_Dealloc(op);
      }
      else {
        // slower merging path not shown
      }
    }

The reference implementation[5] contains implementations of
_Py_MergeZeroRefcount and _Py_DecRefShared.

Note that the above is pseudocode: in practice, the implementation
should use "relaxed atomics" to access ob_tid and ob_ref_local to avoid
undefined behavior in C and C++.

Deferred Reference Counting

A few types of objects, such as top-level functions, code objects,
modules, and methods, tend to be frequently accessed by many threads
concurrently. These objects don't necessarily live for the lifetime of
the program, so immortalization is not a good fit. This PEP proposes a
limited form of deferred reference counting to avoid contention on these
objects' reference count fields in multi-threaded programs.

Typically, the interpreter modifies objects' reference counts as they
are pushed to and popped from the interpreter's stack. The interpreter
skips these reference counting operations for objects that use deferred
reference counting. Objects that support deferred reference counting are
marked by setting the two most significant bits in the local reference
count field to one.

Because some reference counting operations are skipped, the reference
count fields no longer reflect the true number of references to these
objects. The true reference count is the sum of the reference count
fields plus any skipped references from each thread's interpreter stack.
The true reference count can only be safely computed when all threads
are paused during cyclic garbage collection. Consequently, objects that
use deferred reference counting can only be deallocated during garbage
collection cycles.

Note that the objects that use deferred reference counting already
naturally form reference cycles in CPython, so they would typically be
deallocated by the garbage collector even without deferred reference
counting. For example, top-level functions and modules form a reference
cycle as do methods and type objects.

Garbage Collector Modifications for Deferred Reference Counting

The tracing garbage collector finds and deallocates unreferenced
objects. Currently, the tracing garbage collector only finds
unreferenced objects that are part of a reference cycle. With deferred
reference counting, the tracing garbage collector will also find and
collect some unreferenced objects that may not be part of any reference
cycle, but whose collection has been delayed due to deferred reference
counting. This requires that all objects that support deferred reference
counting also have a corresponding type object that supports tracing
garbage collection (through the Py_TPFLAGS_HAVE_GC flag). Additionally,
the garbage collector will need to traverse each thread's stack to add
references to the GC reference count at the start of each collection.

Reference Counting Type Objects

Type objects (PyTypeObject) use a mix of reference counting techniques.
Statically allocated type objects are immortalized because the objects
already live for the lifetime of the program. Heap type objects use
deferred reference counting in combination with per-thread reference
counting. Deferred reference counting is not sufficient to address the
multi-threaded scaling bottlenecks with heap types because most
references to heap types are from object instances, not references on
the interpreter stack.

To address this, heap type reference counts are partially stored in a
distributed manner in per-thread arrays. Every thread stores an array of
local reference counts for each heap type object. Heap type objects are
assigned a unique number that determines its position in the local
reference count arrays. A heap type's true reference count is the sum of
its entries in the per-thread arrays, plus the reference count on the
PyTypeObject, plus any deferred references in the interpreter stack.

Threads may grow their own type reference count arrays as needed when
incrementing or decrementing the local reference count of a type object.

Use of the per-thread reference count arrays is limited to a few places:

-   PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems):
    Increments the current thread's local reference count for type, if
    it is a heap type.
-   subtype_dealloc(PyObject *self): Decrements the current thread's
    local reference count for self->ob_type, if the type is a heap type.
-   gcmodule.c: Adds each thread's local reference counts to the gc_refs
    count for the corresponding heap type object.

Additionally, when a thread terminates, it adds any non-zero local
reference counts to each type object's own reference count field.

Memory Management

CPython currently uses an internal allocator, pymalloc, which is
optimized for small object allocation. The pymalloc implementation is
not thread-safe without the GIL. This PEP proposes replacing pymalloc
with mimalloc, a general-purpose thread-safe allocator with good
performance, including for small allocations.

Using mimalloc, with some modifications, also addresses two other issues
related to removing the GIL. First, traversing the internal mimalloc
structures allows the garbage collector to find all Python objects
without maintaining a linked list. This is described in more detail in
the garbage collection section. Second, mimalloc heaps and allocations
based on size class enable collections like dict to generally avoid
acquiring locks during read-only operations. This is described in more
detail in the collection thread-safety section.

CPython already requires that objects that support garbage collection
use the GC allocator APIs (typically indirectly by calling
PyType_GenericAlloc). This PEP would add additional requirements to the
use of the Python allocator APIs. First, Python objects must be
allocated through object allocation APIs, such as PyType_GenericAlloc,
PyObject_Malloc, or other Python APIs that wrap those calls. Python
objects should not be allocated through other APIs, such as raw calls to
C's malloc or the C++ new operator. Additionally, PyObject_Malloc should
be used only for allocating Python objects; it should not be used for
allocating buffers, storages, or other data structures that are not
PyObjects.

This PEP also imposes restrictions on the pluggable allocator API
(PyMem_SetAllocator). When compiling without the GIL, allocators set
using this API must eventually delegate the allocation to the
corresponding underlying allocator, such as PyObject_Malloc, for Python
object allocations. This allows for allocators that "wrap" underlying
allocators, such as Python's tracemalloc and debug allocator, but not
for wholly replacing the allocator.

CPython Free Lists

CPython makes use of free lists to speed up the allocation of small,
frequently allocated objects like tuples and numbers. These free lists
are moved to PyThreadState from per-interpreter state.

Garbage Collection (Cycle Collection)

The CPython garbage collector requires the following changes to work
with this proposal:

-   Use of "stop-the-world" to provide thread-safety guarantees that
    were previously provided by the GIL.
-   Elimination of generational garbage collection in favor of
    non-generational collector.
-   Integration with deferred reference counting and biased reference
    counting.

Additionally, the above changes enable removing the _gc_prev and
_gc_next fields from GC objects. The GC bits that stored the tracked,
finalized, and unreachable states are moved to the ob_gc_bits field in
the PyObject header.

Stop-the-World

The CPython cycle garbage collector currently relies on the global
interpreter lock to prevent other threads from accessing Python objects
while the collector finds cycles. The GIL is never released during the
cycle-finding routine, so the collector can rely on stable (i.e.,
unchanging) reference counts and references for the duration of that
routine. However, following cycle detection, the GIL may be temporarily
released while calling objects' finalizers and clear (tp_clear)
functions, allowing other threads to run in an interleaved fashion.

When running without the GIL, the implementation needs a way to ensure
that reference counts remain stable during cycle detection. Threads
running Python code must be paused to ensure that references and
reference counts remain stable. Once the cycles are identified, other
threads are resumed.

The current CPython cyclic garbage collector involves two
cycle-detection passes during each garbage collection cycle.
Consequently, this requires two stop-the-world pauses when running the
garbage collector without the GIL. The first cycle-detection pass
identifies cyclic trash. The second pass runs after finalizers to
identify which objects still remain unreachable. Note that other threads
are resumed before finalizers and tp_clear functions are called to avoid
introducing potential deadlocks that are not present in the current
CPython behavior.

Thread States

To support pausing threads for garbage collection, the PyThreadState
gets a new "status" field. Like the other fields in PyThreadState, the
status field is not part of the public CPython API. The status field may
be in one of three states:

-   ATTACHED
-   DETACHED
-   GC

The ATTACHED and DETACHED states correspond closely to acquiring and
releasing the global interpreter lock. When compiling without the GIL,
functions that previously acquired the GIL instead transition the thread
state to ATTACHED, and functions that previously released the GIL
transition the thread state to DETACHED. Just as threads previously
needed to acquire the GIL before accessing or modifying Python objects,
they now must be in the ATTACHED state before accessing or modifying
Python objects. Since the same public C-API functions "attach" the
thread as previously acquired the GIL (e.g., PyEval_RestoreThread), the
requirements for thread initialization in extensions remain the same.
The substantial difference is that multiple threads can be in the
attached state simultaneously, while previously only one thread could
acquire the GIL at a time.

During stop-the-world pauses, the thread performing garbage collection
needs to ensure that no other thread is accessing or modifying Python
objects. All other threads must be in the "GC" state. The garbage
collection thread can transition other threads from the DETACHED state
to the GC state using an atomic compare-and-swap operation on the status
field. Threads in the ATTACHED state are requested to pause themselves
and set their status to "GC", using the existing "eval breaker"
mechanism. At the end of the stop-the-world pause, all threads in the
"GC" state are set to DETACHED and woken up if they are paused. Threads
that were previously attached (i.e., executing Python bytecode) can
re-attach (set their thread states to ATTACHED) and resume executing
Python code. Threads that were previously DETACHED ignore the
notification.

Generations

The existing Python garbage collector uses three generations. When
compiling without the GIL, the garbage collector will only use a single
generation (i.e., it will be non-generational). The primary reason for
this change is to reduce the impact of the stop-the-world pauses in
multithreaded applications. Frequent stop-the-world pauses for
collecting the young generation would have more of an impact on
multi-threaded applications than less frequent collections.

Integration With Deferred and Biased Reference Counting

To find unreferenced objects, the cyclic garbage collector computes the
difference between the number of incoming references and the object's
reference count. This difference is called gc_refs and is stored in the
_gc_prev field. If gc_refs is greater than zero, then the object is
guaranteed to be alive (i.e., not cyclic trash). If gc_refs is zero,
then the object is only alive if it is transitively referenced by
another live object. When computing this difference, the collector
should traverse each thread's stack, and for every deferred reference,
increment the gc_refs for the referred object. Since generator objects
also have stacks with deferred references, the same procedure is applied
to each generator's stack.

Python unit tests commonly use gc.collect() to ensure that any
unreferenced objects are destructed and their finalizers run. Since
biased reference counting can delay the destruction of some objects that
are referenced by multiple threads, it's convenient to ensure that those
objects are destructed during garbage collection, even though they may
not be part of any reference cycles. While other threads are paused, the
garbage collector thread should merge the reference counts for any
queued objects, but not call any destructors even if the combined
reference count is zero. (Calling destructors while other threads are
paused risks introducing deadlocks.) Once other threads are resumed, the
GC thread should call _Py_Dealloc on those objects with a zero merged
reference count.

Container Thread-Safety

In CPython, the global interpreter lock protects against corruption of
internal interpreter states when multiple threads concurrently access or
modify Python objects. For example, if multiple threads concurrently
modify the same list, the GIL ensures that the length of the list
(ob_size) accurately matches the number of elements, and that the
reference counts of each element accurately reflect the number of
references to those elements. Without the GIL --- and absent other
changes --- concurrent modifications would corrupt those fields and
likely lead to program crashes.

The GIL does not necessarily ensure that operations are atomic or remain
correct when multiple operations occur concurrently. For example,
list.extend(iterable) may not appear atomic if the iterable has an
iterator implemented in Python (or releases the GIL internally).
Similarly, list.remove(x) can remove the wrong object if it overlaps
with another operation that modifies the list, depending on the
implementation of the equality operator. Still, the GIL ensures that
some operations are effectively atomic. For example, the constructor
list(set) atomically copies the items of the set to a new list, and some
code relies on that copy being atomic (i.e., having a snapshot of the
items in the set). This PEP preserves that property.

This PEP proposes using per-object locks to provide many of the same
protections that the GIL provides. For example, every list, dictionary,
and set will have an associated lightweight lock. All operations that
modify the object must hold the object's lock. Most operations that read
from the object should acquire the object's lock as well; the few read
operations that can proceed without holding a lock are described below.

Per-object locks with critical sections provide weaker protections than
the GIL. Because the GIL doesn't necessarily ensure that concurrent
operations are atomic or correct, the per-object locking scheme also
cannot ensure that concurrent operations are atomic or correct. Instead,
per-object locking aims for similar protections as the GIL, but with
mutual exclusion limited to individual objects.

Most operations on an instance of a container type require locking that
object. For example:

-   list.append, list.insert, list.repeat, PyList_SetItem
-   dict.__setitem__, PyDict_SetItem
-   list.clear, dict.clear
-   list.__repr__, dict.__repr__, etc.
-   list.extend(iterable)
-   setiter_iternext

Some operations operate directly on two container objects, with
knowledge about both containers' internal structure. For example, there
are internal specializations of list.extend(iterable) for specific
iterable types, like set. These operations need to lock both container
objects because they access the internals of both objects
simultaneously. Note that the generic implementation of list.extend only
needs to lock one object (the list) because the other object is accessed
indirectly through the thread-safe iterator API. Operations that lock
two containers are:

-   list.extend(list), list.extend(set), list.extend (dictitems), and
    other specializations where the implementation is specialized for
    argument type.
-   list.concat(list)
-   list.__eq__(list), dict.__eq__(dict)

Some simple operations can be implemented directly with atomic accesses
and do not need locks because they only access a single field. These
operations include:

-   len(list) i.e., list_length(PyListObject *a)
-   len(dict)
-   len(set)

A select few operations optimistically avoid locking to improve
performance. These require special implementations and cooperation from
the memory allocator:

-   list[idx] (list_subscript)
-   dict[key] (dict_subscript)
-   listiter_next, dictiter_iternextkey/value/item
-   list.contains

Borrowed References

Per-object locking provides many of the important protections that the
GIL provides, but there are a few cases where it's not sufficient. For
example, code that relies on upgrading a borrowed reference to an
"owned" reference may be unsafe in certain circumstances:

    PyObject *item = PyList_GetItem(list, idx);
    Py_INCREF(item);

The GIL ensures that no other thread can modify the list in between the
access and the Py_INCREF call. Without the GIL -- even with per-object
locking -- another thread might modify the list leading to item being
freed between the access and the Py_INCREF call.

The problematic borrowed reference APIs are supplemented with functions
that return "new references" but are otherwise equivalent:

-   PyList_FetchItem(list, idx) for PyList_GetItem
-   PyDict_FetchItem(dict, key) for PyDict_GetItem
-   PyWeakref_FetchObject for PyWeakref_GetObject

Note that some APIs that return borrowed references, such as
PyTuple_GetItem, are not problematic because tuples are immutable.
Similarly, not all uses of the above APIs are problematic. For example,
PyDict_GetItem is often used for parsing keyword argument dictionaries
in function calls; those keyword argument dictionaries are effectively
private (not accessible by other threads).

Python Critical Sections

Straightforward per-object locking could introduce deadlocks that were
not present when running with the GIL. Threads may hold locks for
multiple objects simultaneously because Python operations can nest.
Operations on objects can invoke operations on other objects, acquiring
multiple per-object locks. If threads try to acquire the same locks in
different orders, they will deadlock.

This PEP proposes a scheme called "Python critical sections" to
implicitly release per-object locks to avoid deadlocks. To understand
the scheme, we first introduce a general approach to avoid deadlocks,
and then propose a refinement of that approach with better performance.

One way to avoid deadlocks is to allow threads to hold only the lock (or
locks) for a single operation at a time (typically a single lock, but
some operations involve two locks as described above). When a thread
begins a nested operation it should suspend the locks for any outer
operation: before beginning the nested operation, the locks for the
outer operation are released and when the nested operation completes,
the locks for the outer operation are reacquired.

Additionally, the locks for any active operation should be suspended
around potentially blocking operations, such as I/O (i.e., operations
that would have released the GIL). This is because the interaction
between locks and blocking operations can lead to deadlocks in the same
way as the interaction between multiple locks.

To improve performance, this PEP proposes a variation of the above
scheme that still avoids deadlocks. Instead of immediately suspending
locks any time a nested operation begins, locks are only suspended if
the thread would block (i.e., would have released the GIL). This reduces
the number of lock acquisitions and releases for nested operations,
while avoiding deadlocks.

The proposed API for Python critical sections are the following four
macros. These are intended to be public (usable by C-API extensions),
but not part of the limited API:

-   Py_BEGIN_CRITICAL_SECTION(PyObject *op);: Begins a critical section
    by acquiring the mutex for the referenced object. If the object is
    already locked, then locks for any outstanding critical sections are
    released before this thread waits for referenced object to be
    unlocked.
-   Py_END_CRITICAL_SECTION;: Ends the most recent operation, unlocking
    the mutex. The next most recent previous critical section (if any)
    is resumed if it is currently suspended.
-   Py_BEGIN_CRITICAL_SECTION2(PyObject *a, PyObject *b);: Begins a
    critical section by acquiring the mutexes for two objects. To ensure
    consistent lock ordering, the order of acquisition is determined by
    memory address (i.e., the mutex with lower memory address is
    acquired first). If either mutex is already locked, then locks for
    any outstanding critical sections are released before this thread
    waits for the referenced objects to be unlocked.
-   Py_END_CRITICAL_SECTION2;: Behaves the same as
    Py_END_CRITICAL_SECTION but unlocks two objects.

Additionally, when a thread transitions from the ATTACHED state to the
DETACHED state, it should suspend any active critical sections. When
transitioning from DETACHED to ATTACHED, the most recent suspended
critical section, if any, should be resumed.

Note that operations that lock two containers simultaneously need to use
the Py_BEGIN_CRITICAL_SECTION2 macro. It is not sufficient to nest two
calls to Py_BEGIN_CRITICAL_SECTION because the inner critical section
may release the locks from the outer critical section.

Optimistically Avoiding Locking

A few operations on dict and list optimistically avoid acquiring the
per-object locks. They have a fast path operation that does not acquire
locks, but may fall back to a slower operation that acquires the
dictionary's or list's lock when another thread is concurrently
modifying that container.

The operations with an optimistic fast path are:

-   PyDict_FetchItem/GetItem and dict.__getitem__
-   PyList_FetchItem/GetItem and list.__getitem__

Additionally, iterators for dict and list use the above functions so
they also optimistically avoid locking when returning the next item.

There are two motivations for avoiding lock acquisitions in these
functions. The primary reason is that it is necessary for scalable
multi-threaded performance even for simple applications. Dictionaries
hold top-level functions in modules and methods for classes. These
dictionaries are inherently highly shared by many threads in
multi-threaded programs. Contention on these locks in multi-threaded
programs for loading methods and functions would inhibit efficient
scaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overhead and
improve single-threaded performance. Although lock acquisition has low
overhead compared to most operations, accessing individual elements of
lists and dictionaries are fast operations (so the locking overhead is
comparatively larger) and frequent (so the overhead has more impact).

This section describes the challenges with implementing dictionary and
list accesses without locking followed by a description of this PEP's
changes to the Python interpreter required to address those challenges.

The main challenge is that retrieving an item from a list or dictionary
and incrementing the reference count of that item is not an atomic
operation. In between the time the item is retrieved and the reference
count is incremented, another thread may modify the list or dictionary,
possibly freeing the memory for the previously retrieved item.

A partial attempt at addressing this issue would be to convert the
reference count increment to a conditional increment, only incrementing
the reference count if it's not zero. This change is not sufficient
because when a Python object's reference count reaches zero, the
object's destructor is called and the memory storing the object may be
re-used for other data structures or returned to the operating system.
Instead, this PEP proposes a technique to ensure that the reference
count fields remain valid for the duration of the access, so that the
conditional reference count increment is safe. This technique requires
cooperation from the memory allocator (mimalloc) as well as changes to
the list and dictionary objects. The proposed technique is similar to
read-copy update (RCU)[6], a synchronization mechanism widely used in
the Linux kernel.

The current implementation of list_item (the C function implementing
list.__getitem__) is the following:

    Py_INCREF(a->ob_item[i]);
    return a->ob_item[i];

The proposed implementation uses the conditional increment
(_Py_TRY_INCREF) and has additional checks:

    PyObject **ob_item = atomic_load(&a->ob_item);
    PyObject *item = atomic_load(&ob_item[i]);
    if (!item || !_Py_TRY_INCREF(item)) goto retry;
    if (item != atomic_load(&ob_item[i])) {
      Py_DECREF(item);
      goto retry;
    }
    if (ob_item != atomic_load(&a->ob_item)) {
      Py_DECREF(item);
      goto retry;
    }
    return item;

The "retry" subroutine implements the locked fallback path when
concurrent modifications to the list cause the above fast, non-locking
path to fail:

    retry:
      PyObject *item;
      Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
      item = a->ob_item[i];
      Py_INCREF(item);
      Py_END_CRITICAL_SECTION(a->ob_mutex);
      return item;

The modifications to the dict implementation are similar, because the
relevant parts of both list and dictionary retrieval involve loading an
item/value from an array at a known index.

The additional checks following the conditional increment are necessary
because the scheme allows immediate re-use of memory, including the
memory that previously held a PyObject structure or list or dict array.
Without these extra checks, the function might return a Python object
that was never in the list, if the memory occupied by the Python object
previously held a different PyObject whose memory previously stored an
item in the list.

Mimalloc Changes for Optimistic list and dict Access

The implementation requires additional constraints to the memory
allocator, including some changes to the mimalloc code. Some background
on mimalloc's implementation is helpful to understand the required
changes. Individual allocations from mimalloc are called "blocks."
Mimalloc "pages" contain consecutive blocks that are all the same size.
A mimalloc "page" is similar to a "superblock" in other allocators; it
is NOT an operating system page. A mimalloc "heap" contains pages of
various size classes; each page belongs to a single heap. If none of the
blocks of a page are allocated, then mimalloc may re-use the page for a
different size class or different heap (i.e., it might reinitialize the
page).

The list and dictionary access scheme works by partially restricting
re-use of mimalloc pages so that reference count fields remains valid
for the duration of the access. The restricted re-use of mimalloc pages
is enforced by having separate heaps for Python objects [7]. This
ensures that even if an item is freed during access and the memory
reused for a new object, the new object's reference count field is
placed at the same location in memory. The reference count field remains
valid (or zero) across allocations.

Python objects that support Py_TPFLAGS_MANAGED_DICT have their
dictionary and weak reference fields preceding the PyObject header, so
their reference count fields are at a different offset from the start of
their allocations. They are stored in a separate mimalloc heap.
Additionally, non-GC objects are stored in their own heap so that the GC
only has to look at GC objects. There are therefore three mimalloc heaps
for Python objects, one for non-GC objects, one for GC objects with
managed dictionaries, and one for GC objects without managed
dictionaries.

Mimalloc Page Reuse

It is beneficial to keep the restrictions on mimalloc page reuse to a
short period of time to avoid increasing overall memory usage. Precisely
limiting the restrictions to list and dictionary accesses would minimize
memory usage, but would require expensive synchronizations. At the other
extreme, keeping the restrictions until the next GC cycle would avoid
introducing any extra synchronizations, but would potentially increase
memory usage.

This PEP proposes a system that lies between those two extremes based on
FreeBSD's "GUS"[8]. It uses a combination of global and per-thread
counters (or "sequence numbers") to coordinate the determination of when
it is safe to reuse an empty mimalloc page for a different heap or for a
different size class, or to return it to the operating system:

-   There is a global write sequence number that monotonically
    increases.
-   When a mimalloc page is empty, it's tagged with the current write
    sequence number. The thread may also atomically increment the global
    write sequence number.
-   Each thread has a local read sequence number that records the most
    recent write sequence number it has observed.
-   Threads may observe the write sequence number whenever they are not
    in a list or dictionary access. The reference implementation does
    this in mimalloc's slow-path allocation function. This is called
    regularly enough to be useful, but not so frequently as to introduce
    significant overhead.
-   There is a global read sequence number that stores the minimum of
    all active threads' read sequence numbers. A thread may update the
    global read sequence number by scanning each threads' local read
    sequence number. The reference implementation does this before
    allocating a fresh mimalloc page if there are restricted pages that
    could possibly be reused.
-   An empty mimalloc page may be reused for a different heap or size
    class when the global read sequence number is larger than the page's
    tag number.

The condition that the global read sequence number is larger than the
page's tag is sufficient because it ensures that any thread that had a
concurrent optimistic list or dictionary access is finished with that
access. In other words, there are no threads accessing the empty blocks
in the freed page, so the page can be used for any other purpose or even
returned to the operating system.

Optimistic dict and list Access Summary

This PEP proposes a technique for thread-safe list and dictionary
accesses that typically avoids acquiring locks. This reduces execution
overhead and avoids some multi-threaded scaling bottlenecks in common
operations, like calling functions and methods. The scheme works by
placing temporary restrictions on mimalloc page reuse to ensure that
objects' reference count fields remain valid after objects are freed so
that conditional reference count increment operations are safe. The
restrictions are placed on mimalloc pages instead of on individual
objects to improve opportunities for memory reuse. The restrictions are
lifted as soon as the system can determine that there are no outstanding
accesses involving the empty mimalloc page. To determine this, the
system uses a combination of lightweight per-thread sequence counters
and also tags pages when they are empty. Once each thread's local
counter is larger than the page's tag, it can be reused for any purpose
or returned to the operating system. The restrictions are also lifted
whenever the cyclic garbage collector runs because the stop-the-world
pause ensures that threads do not have any outstanding references to
empty mimalloc pages.

Specializing Interpreter

The specializing interpreter requires some changes to be thread-safe
when running without the GIL:

-   Concurrent specializations are prevented by using a mutex. This
    prevents multiple threads writing to the same inline cache.
-   In multi-threaded programs running without the GIL, each bytecode is
    only specialized once. This prevents a thread from reading a
    partially written inline cache.
-   Locking also ensures that cached values of tp_version_tag and
    keys_version are consistent with the cached descriptors and other
    values.
-   Modifications to inline counters use "relaxed atomics". In other
    words, some counter decrements may be missed or overwritten, but
    that does not affect correctness.

Py_mod_gil Slot

In --disable-gil builds, when loading an extension, CPython will check
for a new PEP 489-style Py_mod_gil slot. If the slot is set to
Py_mod_gil_not_used, then extension loading proceeds as normal. If the
slot is not set, the interpreter pauses all threads and enables the GIL
before continuing. Additionally, the interpreter will issue a visible
warning naming the extension, that the GIL was enabled (and why) and the
steps the user can take to override it.

PYTHONGIL Environment Variable

In --disable-gil builds, the user can also override the behavior at
runtime by setting the PYTHONGIL environment variable. Setting
PYTHONGIL=0, forces the GIL to be disabled, overriding the module slot
logic. Setting PYTHONGIL=1, forces the GIL to be enabled.

The PYTHONGIL=0 override is important because extensions that are not
thread-safe can still be useful in multi-threaded applications. For
example, one may want to use the extension from only a single thread or
guard access by locks. For context, there are already some extensions
that are not thread-safe even with the GIL, and users already have to
take these sorts of steps.

The PYTHONGIL=1 override is sometimes useful for debugging.

Rationale

Non-Generational Garbage Collection

This PEP proposes switching from a generational cyclic garbage collector
to a non-generational collector (when CPython is built without the GIL).
That is equivalent to only having one generation (the "old" generation).
There are two reasons for this proposed change.

Cyclic garbage collection, even for just the young generation, requires
pausing other threads in the program. The author is concerned that
frequent collections of the young generation would inhibit efficient
scaling in multi-threaded programs. This is a concern for young
generations (but not the old generation) because the young generations
are collected after a fixed number of allocations, while the collections
for the older generation are scheduled in proportion to the number of
live objects in the heap. Additionally, it is difficult to efficiently
keep track of objects in each generation without the GIL. For example,
CPython currently uses a linked list of objects in each generation. If
CPython were to keep that design, those lists would need to be made
thread-safe, and it's not clear how to do that efficiently.

Generational garbage collection is used to good effect in many other
language runtimes. For example, many of the Java HotSpot garbage
collector implementations use multiple generations[9]. In these
runtimes, a young generation is frequently a throughput win: since a
large percentage of the young generation is typically "dead," the GC is
able to reclaim a large amount memory relative to the amount of work
performed. For example, several Java benchmarks show over 90% of "young"
objects are typically collected[10] [11]. This is commonly referred to
as the "weak generational hypothesis;" the observation is that most
objects die young. This pattern is reversed in CPython due to the use of
reference counting. Although most objects still die young, they are
collected when their reference counts reach zero. Objects that survive
to a garbage collection cycle are most likely to remain alive[12]. This
difference means that generational collection is much less effective in
CPython than in many other language runtimes[13].

Optimistic Avoiding Locking in dict and list Accesses

This proposal relies on a scheme that mostly avoids acquiring locks when
accessing individual elements in lists and dictionaries. Note that this
is not "lock free" in the sense of "lock-free" and "wait-free"
algorithms that guarantee forward progress. It simply avoids acquiring
locks (mutexes) in the common case to improve parallelism and reduce
overhead.

A much simpler alternative would be to use reader-writer locks to
protect dictionary and list accesses. Reader-writer locks allow
concurrent reads, but not updates, which might seem ideal for list and
dictionaries. The problem is that reader-writer locks have substantial
overhead and poor scalability, particularly when the critical sections
are small, as they are for single-element dictionary and list
accesses[14]. The poor reader scalability stems from the fact that
readers must all update the same data structure, such as the number of
readers in pthread_rwlocks.

The technique described in this PEP is related to RCU
("read-copy-update")[15] and, to a lesser extent, hazard pointers, two
well-known schemes for optimizing concurrent, read-mostly data
structures. RCU is widely used in the Linux kernel to protect shared
data structures in a scalable manner. Both the technique in this PEP and
RCU work by deferring reclamation while readers may be accessing the
concurrent data structure. RCU is most commonly used to protect
individual objects (like hash tables or linked lists), while this PEP
proposes a scheme to protect larger blocks of memory (mimalloc
"pages")[16].

The need for this scheme is largely due to the use of reference counting
in CPython. If CPython only relied on a tracing garbage collector, then
this scheme would probably not be necessary because tracing garbage
collectors already defer reclamation in the required manner. This would
not "solve" scaling issues, but would shift many of the challenges to
the garbage collector implementation.

Backwards Compatibility

This PEP poses a number of backwards compatibility issues when building
CPython with the --disable-gil flag, but those issues do not occur when
using the default build configuration. Nearly all the backwards
compatibility concerns involve the C-API:

-   CPython builds without the GIL will not be ABI compatible with the
    standard CPython build or with the stable ABI due to changes to the
    Python object header needed to support biased reference counting.
    C-API extensions will need to be rebuilt specifically for this
    version.
-   C-API extensions that rely on the GIL to protect global state or
    object state in C code will need additional explicit locking to
    remain thread-safe when run without the GIL.
-   C-API extensions that use borrowed references in ways that are not
    safe without the GIL will need to use the equivalent new APIs that
    return non-borrowed references. Note that only some uses of borrowed
    references are a concern; only references to objects that might be
    freed by other threads pose an issue.
-   Custom memory allocators (PyMem_SetAllocator) are required to
    delegate the actual allocation to the previously set allocator. For
    example, the Python debug allocator and tracing allocators will
    continue to work because they delegate the allocation to the
    underlying allocator. On the other hand, wholesale replacing of the
    allocator (e.g., with jemalloc or tcmalloc) will not work correctly.
-   Python objects must be allocated through the standard APIs, such as
    PyType_GenericNew or PyObject_Malloc. Non-Python objects must not be
    allocated through those APIs. For example, it is currently
    acceptable to allocate buffers(non-Python objects) through
    PyObject_Malloc; that will no longer be allowed and buffers should
    instead be allocated through PyMem_Malloc, PyMem_RawMalloc, or
    malloc.

There are fewer potential backwards compatibility issues for Python
code:

-   Destructors and weak reference callbacks for code objects and
    top-level function objects are delayed until the next cyclic garbage
    collection due to the use of deferred reference counting.
-   Destructors for some objects accessed by multiple threads may be
    delayed slightly due to biased reference counting. This is rare:
    most objects, even those accessed by multiple threads, are destroyed
    immediately as soon as their reference counts are zero. Two places
    in the Python standard library tests required gc.collect() calls to
    continue to pass.

Distribution

This PEP poses new challenges for distributing Python. At least for some
time, there will be two versions of Python requiring separately compiled
C-API extensions. It may take some time for C-API extension authors to
build --disable-gil compatible packages and upload them to PyPI.
Additionally, some authors may be hesitant to support the --disable-gil
mode until it has wide adoption, but adoption will likely depend on the
availability of Python's rich set of extensions.

To mitigate this, the author will work with Anaconda to distribute a
--disable-gil version of Python together with compatible packages from
conda channels. This centralizes the challenges of building extensions,
and the author believes this will enable more people to use Python
without the GIL sooner than they would otherwise be able to.

Performance

The changes to make CPython thread-safe without the GIL increase
execution overhead for --disable-gil builds. The performance impact is
different for programs that use only a single thread compared to
programs that use multiple threads, so the table below reports execution
overhead separately for these types of programs separately.

                     Intel Skylake   AMD Zen 3
  ------------------ --------------- -----------
  One thread         6%              5%
  Multiple threads   8%              7%

  : Execution Overhead on pyperformance 1.0.6

The baseline used to measure overhead is 018be4c from PR 19474, which
implements immortal objects for Python 3.12. The largest contribution to
execution overhead is biased reference counting followed by per-object
locking. For thread-safety reasons, an application running with multiple
threads will only specialize a given bytecode once; this is why the
overhead for programs that use multiple threads is larger compared to
programs that only use one thread. However, with the GIL disabled,
programs that use multiple threads should also be able to more
effectively use multiple CPU cores.

Note that this PEP would not affect the performance of the default (non
--disable-gil) builds of CPython.

Build Bots

The stable build bots will also include --disable-gil builds.

How to Teach This

As part of implementing the --disable-gil mode, the author will write a
"HOWTO" guide[17] for making packages compatible when running Python
without the GIL.

Reference Implementation

There are two GitHub repositories implementing versions of CPython
without the GIL:

-   https://github.com/colesbury/nogil-3.12
-   https://github.com/colesbury/nogil

The nogil-3.12 is based on Python 3.12.0a4. It is useful for evaluating
single-threaded execution overhead and as a reference implementation for
this PEP. It is less useful for evaluating C-API extension compatibility
because many extensions are not currently compatible with Python 3.12.
Due to limited time for the 3.12 port, the nogil-3.12 implementation
does not skip all deferred reference counts. As a temporary work around,
the implementation immortalizes objects that use deferred reference
counting in programs that spawn multiple threads.

The nogil repository is based on Python 3.9.10. It is useful for
evaluating multi-threading scaling in real world applications and
extension compatibility. It is more stable and well tested than the
nogil-3.12 repository.

Alternatives

Python currently supports a number of ways to enable parallelism, but
the existing techniques come with significant limitations.

Multiprocessing

The multiprocessing library allows Python programs to start and
communicate with Python subprocesses. This allows for parallelism
because each subprocess has its own Python interpreter (i.e., there's
one GIL per process). Multiprocessing has a few substantial limitations.
Communication between processes is limited: objects generally need to be
serialized or copied to shared memory. This introduces overhead (due to
serialization) and complicates building APIs on top of multiprocessing.
Starting a subprocess is also more expensive than starting a thread,
especially with the "spawn" implementation. Starting a thread takes ~100
µs, while spawning a subprocess takes ~50 ms (50,000 µs) due to Python
re-initialization.

Finally, many C and C++ libraries support access from multiple threads
but do not support access or use across multiple processes.

Releasing the GIL in C-API Extensions

C-API extensions can release the GIL around long running functions. This
allows for some degree of parallelism, since multiple threads can run
concurrently when the GIL is released, but the overhead of acquiring and
releasing the GIL typically prevents this from scaling efficiently
beyond a few threads. Many scientific computing libraries release the
GIL in computational heavy functions, and the CPython standard library
releases the GIL around blocking I/O.

Internal Parallelization

Functions implemented in C may use multiple threads internally. For
example, Intel's NumPy distribution, PyTorch, and TensorFlow all use
this technique to internally parallelize individual operations. This
works well when the basic operations are large enough to be parallelized
efficiently, but not when there are many small operations or when the
operations depend on some Python code. Calling into Python from C
requires acquiring the GIL -- even short snippets of Python code can
inhibit scaling.

Related Work

Per-Interpreter GIL

The recently accepted PEP 684 proposes a per-interpreter GIL to address
multi-core parallelism. This would allow parallelism between
interpreters in the same process, but places substantial restrictions on
sharing Python data between interpreters. Both this PEP and PEP 684
address the multi-core parallelism, but with different tradeoffs and
techniques. It is feasible to implement both PEPs in CPython at the same
time.

Gilectomy

Gilectomy[18] was a project by Larry Hastings to remove the GIL in
CPython. Like the design proposed by this PEP, the Gilectomy supported
multiple threads running in parallel within the same interpreter (i.e.,
"free-threading") and made use of fine-grained locking. The reference
implementation in this PEP improves on single-threaded performance and
scalability compared to the Gilectomy.

PyParallel

PyParallel[19] was a proof-of-concept fork of Python 3.3 by Trent Nelson
that supported multiple threads running simultaneously in a single
Python process. The fork introduced the concept of "parallel threads" --
threads that can run simultaneously while the main Python thread is
suspended. Parallel threads had read-only access to objects created by
the main thread. Objects created within parallel threads lived for the
lifetime of the creating thread. For HTTP servers, this might correspond
to the lifetime of a request.

python-safethread

The python-safethread[20] project was a patch to Python 3.0 by Adam
Olsen to remove the GIL. Some aspects of the project are similar to the
design proposed by this PEP. Both use fine-grained locking and optimize
reference counting for cases where the object is created and accessed by
the same thread.

Greg Stein's Free-Threading Patch

In 1996, Greg Stein published a patch against Python 1.4 that removed
the GIL[21]. The patch used atomic reference counting on Windows and a
global reference count lock on Linux. List and dictionary accesses were
protected by mutexes. Parts of the patch were adopted in CPython. In
particular, the patch introduced a PyThreadState structure and correct
per-thread exception handling.

Dave Beazley revisited the patch in a 2011 blog post[22].

Jython and IronPython

Some alternative Python implementations like Jython[23] and
IronPython[24] do not have a global interpreter lock. However, they do
not support CPython extensions. (The implementations can interface with
code written in Java or C#).

PyPy-STM

The pypy-stm[25] interpreter is a variant of PyPy that uses software
transactional memory. The authors report single-threaded performance
overhead in the 20%-50% range compared to PyPy. It is not compatible
with CPython extensions.

Rejected Ideas

Why Not Use a Concurrent Garbage Collector?

Many recent garbage collectors are mostly concurrent -- they avoid long
stop-the-world pauses by allowing the garbage collector to run
concurrently with the application. So why not use a concurrent
collector?

Concurrent collection requires write barriers (or read barriers). The
author is not aware of a way to add write barriers to CPython without
substantially breaking the C-API.

Why Not Deprecate PyDict_GetItem in Favor of PyDict_FetchItem?

This PEP proposes a new API PyDict_FetchItem which behaves like
PyDict_GetItem, but returns a new reference instead of a borrowed
reference. As described in Borrowed References, some uses of borrowed
references that were safe when running with the GIL are unsafe when
running without the GIL and need to be replaced by functions like
PyDict_FetchItem that return new references.

This PEP does not propose deprecating PyDict_GetItem and similar
functions that return borrowed references for a few reasons:

-   Many of the uses of borrowed references are safe, even when running
    without the GIL. For example, C API functions often use
    PyDict_GetItem to retrieve items from the keyword argument
    dictionary. These calls are safe because the keyword argument
    dictionary is only visible to a single thread.
-   I tried this approach early on and found that wholesale replacing of
    PyDict_GetItem with PyDict_FetchItem frequently introduced new
    reference counting bugs. In my opinion, the risk of introducing new
    reference counting bugs generally outweighs the risks of missing a
    PyDict_GetItem call that is unsafe without the GIL.

Why Not Use PEP 683 Immortalization?

Like PEP 683, this PEP proposes an immortalization scheme for Python
objects, but the PEPs use different bit representations to mark immortal
objects. The schemes cannot be identical because this PEP depends on
biased reference counting, which has two reference count fields instead
of one.

Open Issues

Improved Specialization

The Python 3.11 release introduced quickening and specialization as part
of the faster CPython project, substantially improving performance.
Specialization replaces slow bytecode instructions with faster
variants[26]. To maintain thread-safety, applications that use multiple
threads (and run without the GIL) will only specialize each bytecode
once, which can lower performance on some programs. It is possible to
support specializing multiple times, but that requires more
investigation and is not part of this PEP.

Python Build Modes

This PEP introduces a new build mode (--disable-gil) that is not ABI
compatible with the standard build mode. The additional build mode adds
complexity for both Python core developers and extension developers. The
author believes a worthwhile goal is to combine these build modes and
have the global interpreter lock controlled at runtime, possibly
disabled by default. The path to this goal remains an open issue, but a
possible path might look like the following:

1.  In 2024, CPython 3.13 is released with support for a --disable-gil
    build time flag. There are two ABIs for CPython, one with the GIL
    and one without. Extension authors target both ABIs.
2.  After 2--3 releases, (i.e., in 2026--2027), CPython is released with
    the GIL controlled by a runtime environment variable or flag. The
    GIL is enabled by default. There is only a single ABI.
3.  After another 2--3 release (i.e., 2028--2030), CPython switches to
    the GIL being disabled by default. The GIL can still be enabled at
    runtime via an environment variable or command line flag.

This PEP covers the first step, with the remaining steps left as open
issues. In this scenario, there would be a two to three year period
where extension authors would target an extra CPython build per
supported CPU architecture and OS.

Integration

The reference implementation changes approximately 15,000 lines of code
in CPython and includes mimalloc, which is also approximately 15,000
lines of code. Most changes are not performance sensitive and can be
included in both --disable-gil and the default builds. Some macros, like
Py_BEGIN_CRITICAL_SECTION will be no-ops in the default build. Thee
author does not expect a huge number of #ifdef statements to support the
--disable-gil builds.

Mitigations for Single-Threaded Performance

The changes proposed in the PEP will increase execution overhead for
--disable-gil builds compared to Python builds with the GIL. In other
words, it will have slower single-threaded performance. There are some
possible optimizations to reduce execution overhead, especially for
--disable-gil builds that only use a single thread. These may be
worthwhile if a longer term goal is to have a single build mode, but the
choice of optimizations and their trade-offs remain an open issue.

References

Acknowledgments

Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providing
feedback on drafts of this PEP.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

[1] "Exploiting Parallelism Opportunities with Deep Learning
Frameworks." Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood,
David Brooks. 2019. https://arxiv.org/abs/1908.04705.

[2] "Using Python for Model Inference in Deep Learning." Zachary DeVito,
Jason Ansel, Will Constable, Michael Suo, Ailing Zhang, Kim Hazelwood.
2021. https://arxiv.org/abs/2104.00254. See Figure 5.

[3] "Biased reference counting: minimizing atomic operations in garbage
collection". Jiho Choi, Thomas Shull, and Josep Torrellas. PACT 2018.
https://dl.acm.org/doi/abs/10.1145/3243176.3243195.

[4] https://github.com/colesbury/nogil/blob/f7e45d6bfbbd48c8d5cf851c116b73b85add9fc6/Include/object.h#L428-L455.

[5] https://github.com/colesbury/nogil-3.12.

[6] "What is RCU, Fundamentally?" Paul E. McKenney, Jonathan Walpole.
2017. https://lwn.net/Articles/262464/

[7] There are two heaps for Python objects because PyObjects that
support cyclic garbage collection have extra fields preceding the
PyObject struct.

[8] "Global Unbounded Sequences (GUS)"
https://github.com/freebsd/freebsd-src/blob/9408f36627b74a472dc82f7a43320235c0c9055a/sys/kern/subr_smr.c#L44.
See also https://people.kernel.org/joelfernandes/gus-vs-rcu.

[9] "HotSpot Virtual Machine Garbage Collection Tuning Guide."
https://docs.oracle.com/en/java/javase/12/gctuning/hotspot-virtual-machine-garbage-collection-tuning-guide.pdf.
Most of the hotspot garbage collectors are generational, with the
notable exception of ZGC, although there is ongoing work to make that
generational.

[10] The DaCapo Benchmarks: Java Benchmarking Development and Analysis.
See column "Nursery Survival" in Table 4.

[11] "Exploiting memory usage patterns to improve garbage collections in
Java." https://dl.acm.org/doi/abs/10.1145/1852761.1852768.

[12] "most things usually turn out to be reachable"
https://github.com/python/cpython/blob/cd6655a8589e99ae4088b3bed4a692a19ed48779/Modules/gcmodule.c#L1106.

[13] The Go team observed something similar in Go, but due to escape
analysis and pass-by-value instead of reference counting. Recent
versions of Go use a non-generational garbage collector.
https://go.dev/blog/ismmkeynote.

[14] "Is Parallel Programming Hard, And, If So, What Can You Do About
It?" Paul E. McKenney. 2022.
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.

[15] "What is RCU, Fundamentally?" Paul E. McKenney, Jonathan Walpole.
2017. https://lwn.net/Articles/262464/

[16] SLAB_TYPESAFE_BY_RCU is an example in which RCU protects blocks of
memory and not any individual object. See
https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#analogy-with-reference-counting.

[17] Python HOWTOs. https://docs.python.org/3/howto/index.html.

[18] Gilectomy. Larry Hastings. 2016.
https://github.com/larryhastings/gilectomy/tree/gilectomy.

[19] PyParallel. Trent Nelson. 2016. http://pyparallel.org/.

[20] python-safethread. Adam Olsen. 2008.
https://launchpad.net/python-safethread

[21] https://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.tar.gz.

[22] An Inside Look at the GIL Removal Patch of Lore. David Beazley.
2011.
https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-patch-of.html.

[23] Jython. https://www.jython.org/

[24] IronPython. https://ironpython.net/

[25] PyPy: Software Transactional Memory.
https://doc.pypy.org/en/latest/stm.html

[26] PEP 659 -- Specializing Adaptive Interpreter.