PEP 583 – A Concurrency Memory Model for Python
- Jeffrey Yasskin <jyasskin at google.com>
Table of Contents
- A couple definitions
- Two simple memory models
- Surprising behaviors with races
- The rules for Python
- Implementation Details
This PEP describes how Python programs may behave in the presence of concurrent reads and writes to shared variables from multiple threads. We use a happens before relation to define when variable accesses are ordered or concurrent. Nearly all programs should simply use locks to guard their shared variables, and this PEP highlights some of the strange things that can happen when they don’t, but programmers often assume that it’s ok to do “simple” things without locking, and it’s somewhat unpythonic to let the language surprise them. Unfortunately, avoiding surprise often conflicts with making Python run quickly, so this PEP tries to find a good tradeoff between the two.
So far, we have 4 major Python implementations – CPython, Jython, IronPython, and PyPy – as well as lots of minor ones. Some of these already run on platforms that do aggressive optimizations. In general, these optimizations are invisible within a single thread of execution, but they can be visible to other threads executing concurrently. CPython currently uses a GIL to ensure that other threads see the results they expect, but this limits it to a single processor. Jython and IronPython run on Java’s or .NET’s threading system respectively, which allows them to take advantage of more cores but can also show surprising values to other threads.
So that threaded Python programs continue to be portable between implementations, implementers and library authors need to agree on some ground rules.
A couple definitions
- A name that refers to an object. Variables are generally
introduced by assigning to them, and may be destroyed by passing
del. Variables are fundamentally mutable, while objects may not be. There are several varieties of variables: module variables (often called “globals” when accessed from within the module), class variables, instance variables (also known as fields), and local variables. All of these can be shared between threads (the local variables if they’re saved into a closure). The object in which the variables are scoped notionally has a
dictwhose keys are the variables’ names.
- A collection of instance variables (a.k.a. fields) and methods. At least, that’ll do for this PEP.
- Program Order
- The order that actions (reads and writes) happen within a thread, which is very similar to the order they appear in the text.
- Conflicting actions
- Two actions on the same variable, at least one of which is a write.
- Data race
- A situation in which two conflicting actions happen at the same time. “The same time” is defined by the memory model.
Two simple memory models
Before talking about the details of data races and the surprising behaviors they produce, I’ll present two simple memory models. The first is probably too strong for Python, and the second is probably too weak.
In a sequentially-consistent concurrent execution, actions appear to happen in a global total order with each read of a particular variable seeing the value written by the last write that affected that variable. The total order for actions must be consistent with the program order. A program has a data race on a given input when one of its sequentially consistent executions puts two conflicting actions next to each other.
This is the easiest memory model for humans to understand, although it doesn’t eliminate all confusion, since operations can be split in odd places.
The program contains a collection of synchronization actions, which
in Python currently include lock acquires and releases and thread
starts and joins. Synchronization actions happen in a global total
order that is consistent with the program order (they don’t have to
happen in a total order, but it simplifies the description of the
model). A lock release synchronizes with all later acquires of the
same lock. Similarly, given
t = threading.Thread(target=worker):
- A call to
t.start()synchronizes with the first statement in
- The return from
worker()synchronizes with the return from
- If the return from
t.start()happens before (see below) a call to
False, the return from
worker()synchronizes with that call.
We call the source of the synchronizes-with edge a release operation on the relevant variable, and we call the target an acquire operation.
The happens before order is the transitive closure of the program order with the synchronizes-with edges. That is, action A happens before action B if:
- A falls before B in the program order (which means they run in the same thread)
- A synchronizes with B
- You can get to B by following happens-before edges from A.
An execution of a program is happens-before consistent if each read R sees the value of a write W to the same variable such that:
- R does not happen before W, and
- There is no other write V that overwrote W before R got a chance to see it. (That is, it can’t be the case that W happens before V happens before R.)
You have a data race if two conflicting actions aren’t related by happens-before.
Let’s use the rules from the happens-before model to prove that the following program prints “”:
class Queue: def __init__(self): self.l =  self.cond = threading.Condition() def get(): with self.cond: while not self.l: self.cond.wait() ret = self.l self.l = self.l[1:] return ret def put(x): with self.cond: self.l.append(x) self.cond.notify() myqueue = Queue() def worker1(): x =  myqueue.put(x) def worker2(): y = myqueue.get() print y thread1 = threading.Thread(target=worker1) thread2 = threading.Thread(target=worker2) thread2.start() thread1.start()
myqueueis initialized in the main thread before
thread2is started, that initialization happens before
worker2begin running, so there’s no way for either to raise a NameError, and both
myqueue.condare set to their final objects.
- The initialization of
worker1happens before it calls
myqueue.put(), which happens before it calls
myqueue.l.append(x), which happens before the call to
myqueue.cond.release(), all because they run in the same thread.
myqueue.condwill be released and re-acquired until
myqueue.lcontains a value (
x). The call to
worker1happens before that last call to
- That last call to
myqueue.l, which happens before
myqueue.get()returns, which happens before
print y, again all because they run in the same thread.
- Because happens-before is transitive, the list initially stored in
xin thread1 is initialized before it is printed in thread2.
Usually, we wouldn’t need to look all the way into a thread-safe queue’s implementation in order to prove that uses were safe. Its interface would specify that puts happen before gets, and we’d reason directly from that.
Surprising behaviors with races
Lots of strange things can happen when code has data races. It’s easy to avoid all of these problems by just protecting shared variables with locks. This is not a complete list of race hazards; it’s just a collection that seem relevant to Python.
In all of these examples, variables starting with
r are local
variables, and other variables are shared between threads.
This example comes from the Java memory model:
p is qand
p.x == 0.
Thread 1 Thread 2 r1 = p r6 = p r2 = r1.x r6.x = 3 r3 = q r4 = r3.x r5 = r1.x
r2 == r5 == 0but
r4 == 3, proving that
p.xwent from 0 to 3 and back to 0.
A good compiler would like to optimize out the redundant load of
p.x in initializing
r5 by just re-using the value already
r2. We get the strange result if thread 1 sees memory
in this order:
Evaluation Computes Why r1 = p r2 = r1.x r2 == 0 r3 = q r3 is p p.x = 3 Side-effect of thread 2 r4 = r3.x r4 == 3 r5 = r2 r5 == 0 Optimized from r5 = r1.x because r2 == r1.x.
From N2177: Sequential Consistency for Atomics, and also known as Independent Read of Independent Write (IRIW).
a == b == 0.
Thread 1 Thread 2 Thread 3 Thread 4 r1 = a r3 = b a = 1 b = 1 r2 = b r4 = a
We may get
r1 == r3 == 1and
r2 == r4 == 0, proving both that
awas written before
b(thread 1’s data), and that
bwas written before
a(thread 2’s data). See Special Relativity for a real-world example.
This can happen if thread 1 and thread 3 are running on processors that are close to each other, but far away from the processors that threads 2 and 4 are running on and the writes are not being transmitted all the way across the machine before becoming visible to nearby threads.
Neither acquire/release semantics nor explicit memory barriers can help with this. Making the orders consistent without locking requires detailed knowledge of the architecture’s memory model, but Java requires it for volatiles so we could use documentation aimed at its implementers.
A happens-before race that’s not a sequentially-consistent race
From the POPL paper about the Java memory model [#JMM-popl].
x == y == 0.
Thread 1 Thread 2 r1 = x r2 = y if r1 != 0: if r2 != 0: y = 42 x = 42
r1 == r2 == 42???
In a sequentially-consistent execution, there’s no way to get an
adjacent read and write to the same variable, so the program should be
considered correctly synchronized (albeit fragile), and should only
r1 == r2 == 0. However, the following execution is
Statement Value Thread r1 = x 42 1 if r1 != 0: true 1 y = 42 1 r2 = y 42 2 if r2 != 0: true 2 x = 42 2
WTF, you are asking yourself. Because there were no inter-thread happens-before edges in the original program, the read of x in thread 1 can see any of the writes from thread 2, even if they only happened because the read saw them. There are data races in the happens-before model.
We don’t want to allow this, so the happens-before model isn’t enough for Python. One rule we could add to happens-before that would prevent this execution is:
If there are no data races in any sequentially-consistent execution of a program, the program should have sequentially consistent semantics.
Java gets this rule as a theorem, but Python may not want all of the machinery you need to prove it.
Also from the POPL paper about the Java memory model [#JMM-popl].
x == y == 0.
Thread 1 Thread 2 r1 = x r2 = y y = r1 x = r2
x == y == 42???
In a sequentially consistent execution, no. In a happens-before consistent execution, yes: The read of x in thread 1 is allowed to see the value written in thread 2 because there are no happens-before relations between the threads. This could happen if the compiler or processor transforms the code into:
Thread 1 Thread 2 y = 42 r2 = y r1 = x x = r2 if r1 != 42: y = r1
It can produce a security hole if the speculated value is a secret object, or points to the memory that an object used to occupy. Java cares a lot about such security holes, but Python may not.
Uninitialized values (direct)
From several classic double-checked locking examples.
d == None.
Thread 1 Thread 2 while not d: pass d = [3, 4] assert d == 4
This could raise an IndexError, fail the assertion, or, without some care in the implementation, cause a crash or other undefined behavior.
Thread 2 may actually be implemented as:
r1 = list() r1.append(3) r1.append(4) d = r1
Because the assignment to d and the item assignments are independent, the compiler and processor may optimize that to:
r1 = list() d = r1 r1.append(3) r1.append(4)
Which is obviously incorrect and explains the IndexError. If we then
look deeper into the implementation of
r1.append(3), we may find
that it and
d cannot run concurrently without causing their own
race conditions. In CPython (without the GIL), those race conditions
would produce undefined behavior.
There’s also a subtle issue on the reading side that can cause the
value of d to be out of date. Somewhere in the implementation of
list, it stores its contents as an array in memory. This array may
happen to be in thread 1’s cache. If thread 1’s processor reloads
d from main memory without reloading the memory that ought to
contain the values 3 and 4, it could see stale values instead. As far
as I know, this can only actually happen on Alphas and maybe Itaniums,
and we probably have to prevent it anyway to avoid crashes.
Uninitialized values (flag)
From several more double-checked locking examples.
d == dict()and
initialized == False.
Thread 1 Thread 2 while not initialized: pass d[‘a’] = 3 r1 = d[‘a’] initialized = True r2 = r1 == 3 assert r2
This could raise a KeyError, fail the assertion, or, without some care in the implementation, cause a crash or other undefined behavior.
initialized are independent (except in the
programmer’s mind), the compiler and processor can rearrange these
almost arbitrarily, except that thread 1’s assertion has to stay after
Inconsistent guarantees from relying on data dependencies
This is a problem with Java
final variables and the proposed
data-dependency ordering in C++0x.
First execute:g =  def Init(): g.extend([1,2,3]) return [1,2,3] h = None
Then in two threads:
Thread 1 Thread 2 while not h: pass r1 = Init() assert h == [1,2,3] freeze(r1) assert h == g h = r1
If h has semantics similar to a Java
finalvariable (except for being write-once), then even though the first assertion is guaranteed to succeed, the second could fail.
Data-dependent guarantees like those
final provides only work if
the access is through the final variable. It’s not even safe to
access the same object through a different route. Unfortunately,
because of how processors work, final’s guarantees are only cheap when
The rules for Python
The first rule is that Python interpreters can’t crash due to race conditions in user code. For CPython, this means that race conditions can’t make it down into C. For Jython, it means that NullPointerExceptions can’t escape the interpreter.
Presumably we also want a model at least as strong as happens-before consistency because it lets us write a simple description of how concurrent queues and thread launching and joining work.
Other rules are more debatable, so I’ll present each one with pros and cons.
Data-race-free programs are sequentially consistent
We’d like programmers to be able to reason about their programs as if they were sequentially consistent. Since it’s hard to tell whether you’ve written a happens-before race, we only want to require programmers to prevent sequential races. The Java model does this through a complicated definition of causality, but if we don’t want to include that, we can just assert this property directly.
No security holes from out-of-thin-air reads
If the program produces a self-justifying value, it could expose access to an object that the user would rather the program not see. Again, Java’s model handles this with the causality definition. We might be able to prevent these security problems by banning speculative writes to shared variables, but I don’t have a proof of that, and Python may not need those security guarantees anyway.
Restrict reorderings instead of defining happens-before
The .NET [#CLR-msdn] and x86 [#x86-model] memory models are based on defining which reorderings compilers may allow. I think that it’s easier to program to a happens-before model than to reason about all of the possible reorderings of a program, and it’s easier to insert enough happens-before edges to make a program correct, than to insert enough memory fences to do the same thing. So, although we could layer some reordering restrictions on top of the happens-before base, I don’t think Python’s memory model should be entirely reordering restrictions.
Atomic, unordered assignments
Assignments of primitive types are already atomic. If you assign
3<<72 + 5 to a variable, no thread can see only part of the value.
Jeremy Manson suggested that we extend this to all objects. This
allows compilers to reorder operations to optimize them, without
allowing some of the more confusing uninitialized values. The
basic idea here is that when you assign a shared variable, readers
can’t see any changes made to the new value before the assignment, or
to the old value after the assignment. So, if we have a program like:
(d.a, d.b) == (1, 2), and
(e.c, e.d) == (3, 4). We also have
class Obj(object): pass.
Thread 1 Thread 2 r1 = Obj() r3 = d r1.a = 3 r4, r5 = r3.a, r3.b r1.b = 4 r6 = e d = r1 r7, r8 = r6.c, r6.d r2 = Obj() r2.c = 6 r2.d = 7 e = r2
(r4, r5)can be
(3, 4)but nothing else, and
(r7, r8)can be either
(6, 7)but nothing else. Unlike if writes were releases and reads were acquires, it’s legal for thread 2 to see
(e.c, e.d) == (6, 7) and (d.a, d.b) == (1, 2)(out of order).
This allows the compiler a lot of flexibility to optimize without allowing users to see some strange values. However, because it relies on data dependencies, it introduces some surprises of its own. For example, the compiler could freely optimize the above example to:
Thread 1 Thread 2 r1 = Obj() r3 = d r2 = Obj() r6 = e r1.a = 3 r4, r7 = r3.a, r6.c r2.c = 6 r5, r8 = r3.b, r6.d r2.d = 7 e = r2 r1.b = 4 d = r1
As long as it didn’t let the initialization of
e move above any of
the initializations of members of
r2, and similarly for
This also helps to ground happens-before consistency. To see the problem, imagine that the user unsafely publishes a reference to an object as soon as she gets it. The model needs to constrain what values can be read through that reference. Java says that every field is initialized to 0 before anyone sees the object for the first time, but Python would have trouble defining “every field”. If instead we say that assignments to shared variables have to see a value at least as up to date as when the assignment happened, then we don’t run into any trouble with early publication.
Two tiers of guarantees
Most other languages with any guarantees for unlocked variables distinguish between ordinary variables and volatile/atomic variables. They provide many more guarantees for the volatile ones. Python can’t easily do this because we don’t declare variables. This may or may not matter, since python locks aren’t significantly more expensive than ordinary python code. If we want to get those tiers back, we could:
- Introduce a set of atomic types similar to Java’s 
or C++’s . Unfortunately, we couldn’t assign to
- Without requiring variable declarations, we could also specify that all of the fields on a given object are atomic.
- Extend the
__slots__mechanism  with a parallel
__volatiles__list, and maybe a
We could just adopt sequential consistency for Python. This avoids all of the hazards mentioned above, but it prohibits lots of optimizations too. As far as I know, this is the current model of CPython, but if CPython learned to optimize out some variable reads, it would lose this property.
If we adopt this, Jython’s
dict implementation may no longer be
able to use ConcurrentHashMap because that only promises to create
appropriate happens-before edges, not to be sequentially consistent
(although maybe the fact that Java volatiles are totally ordered
carries over). Both Jython and IronPython would probably need to use
or the equivalent for any
Adapt the x86 model
The x86 model is:
- Loads are not reordered with other loads.
- Stores are not reordered with other stores.
- Stores are not reordered with older loads.
- Loads may be reordered with older stores to different locations but not with older stores to the same location.
- In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
- In a multiprocessor system, stores to the same location have a total order.
- In a multiprocessor system, locked instructions have a total order.
- Loads and stores are not reordered with locked instructions.
In acquire/release terminology, this appears to say that every store is a release and every load is an acquire. This is slightly weaker than sequential consistency, in that it allows inconsistent orderings, but it disallows zombie values and the compiler optimizations that produce them. We would probably want to weaken the model somehow to explicitly allow compilers to eliminate redundant variable reads. The x86 model may also be expensive to implement on other platforms, although because x86 is so common, that may not matter much.
Upgrading or downgrading to an alternate model
We can adopt an initial memory model without totally restricting
future implementations. If we start with a weak model and want to get
stronger later, we would only have to change the implementations, not
programs. Individual implementations could also guarantee a stronger
memory model than the language demands, although that could hurt
interoperability. On the other hand, if we start with a strong model
and want to weaken it later, we can add a
from __future__ import
weak_memory statement to declare that some modules are safe.
The required model is weaker than any particular implementation. This section tries to document the actual guarantees each implementation provides, and should be updated as the implementations change.
Uses the GIL to guarantee that other threads don’t see funny reorderings, and does few enough optimizations that I believe it’s actually sequentially consistent at the bytecode level. Threads can switch between any two bytecodes (instead of only between statements), so two threads that concurrently execute:
i = i + 1
0 could easily end up with
of the expected
i==2. If they execute:
i += 1
instead, CPython 2.6 will always give the right answer, but it’s easy to imagine another implementation in which this statement won’t be atomic.
Also uses a GIL, but probably does enough optimization to violate sequential consistency. I know very little about this implementation.
Provides true concurrency under the Java memory model and stores
all object fields (except for those in
__slots__?) in a
which provides fairly strong ordering guarantees. Local variables in
a function may have fewer guarantees, which would become visible if
they were captured into a closure that was then passed to another
Provides true concurrency under the CLR memory model, which probably protects it from uninitialized values. IronPython uses a locked map to store object fields, providing at least as many guarantees as Jython.
- The Java Memory Model, by Jeremy Manson, Bill Pugh, and Sarita Adve (http://www.cs.umd.edu/users/jmanson/java/journal.pdf). This paper is an excellent introduction to memory models in general and has lots of examples of compiler/processor optimizations and the strange program behaviors they can produce.
- N2480: A Less Formal Explanation of the Proposed C++ Concurrency Memory Model, Hans Boehm (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.html)
- Memory Models: Understand the Impact of Low-Lock Techniques in Multithreaded Apps, Vance Morrison (http://msdn2.microsoft.com/en-us/magazine/cc163715.aspx)
- Intel(R) 64 Architecture Memory Ordering White Paper (http://www.intel.com/products/processor/manuals/318147.pdf)
- Package java.util.concurrent.atomic (http://java.sun.com/javase/6/docs/api/java/util/concurrent/atomic/package-summary.html)
- C++ Atomic Types and Operations, Hans Boehm and Lawrence Crowl (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html)
- __slots__ (http://docs.python.org/ref/slots.html)
- Alternatives to SC, a thread on the cpp-threads mailing list, which includes lots of good examples. (http://www.decadentplace.org.uk/pipermail/cpp-threads/2007-January/001287.html)
- python-safethread, a patch by Adam Olsen for CPython that removes the GIL and statically guarantees that all objects shared between threads are consistently locked. (http://code.google.com/p/python-safethread/)
Thanks to Jeremy Manson and Alex Martelli for detailed discussions on what this PEP should look like.
This document has been placed in the public domain.
Last modified: 2019-03-01 19:39:16 GMT