PEP: 583 Title: A Concurrency Memory Model for Python Version:
$Revision: 56116 $ Last-Modified: $Date: 2007-06-28 12:53:41 -0700 (Thu,
28 Jun 2007) $ Author: Jeffrey Yasskin <jyasskin@google.com> Status:
Withdrawn Type: Informational Content-Type: text/x-rst Created:
22-Mar-2008 Post-History:

Abstract

This PEP describes how Python programs may behave in the presence of
concurrent reads and writes to shared variables from multiple threads.
We use a happens before relation to define when variable accesses are
ordered or concurrent. Nearly all programs should simply use locks to
guard their shared variables, and this PEP highlights some of the
strange things that can happen when they don't, but programmers often
assume that it's ok to do "simple" things without locking, and it's
somewhat unpythonic to let the language surprise them. Unfortunately,
avoiding surprise often conflicts with making Python run quickly, so
this PEP tries to find a good tradeoff between the two.

Rationale

So far, we have 4 major Python implementations -- CPython, Jython,
IronPython, and PyPy -- as well as lots of minor ones. Some of these
already run on platforms that do aggressive optimizations. In general,
these optimizations are invisible within a single thread of execution,
but they can be visible to other threads executing concurrently. CPython
currently uses a GIL to ensure that other threads see the results they
expect, but this limits it to a single processor. Jython and IronPython
run on Java's or .NET's threading system respectively, which allows them
to take advantage of more cores but can also show surprising values to
other threads.

So that threaded Python programs continue to be portable between
implementations, implementers and library authors need to agree on some
ground rules.

A couple definitions

Variable

    A name that refers to an object. Variables are generally introduced
    by assigning to them, and may be destroyed by passing them to del.
    Variables are fundamentally mutable, while objects may not be. There
    are several varieties of variables: module variables (often called
    "globals" when accessed from within the module), class variables,
    instance variables (also known as fields), and local variables. All
    of these can be shared between threads (the local variables if
    they're saved into a closure). The object in which the variables are
    scoped notionally has a dict whose keys are the variables' names.

Object

    A collection of instance variables (a.k.a. fields) and methods. At
    least, that'll do for this PEP.

Program Order

    The order that actions (reads and writes) happen within a thread,
    which is very similar to the order they appear in the text.

Conflicting actions

    Two actions on the same variable, at least one of which is a write.

Data race

    A situation in which two conflicting actions happen at the same
    time. "The same time" is defined by the memory model.

Two simple memory models

Before talking about the details of data races and the surprising
behaviors they produce, I'll present two simple memory models. The first
is probably too strong for Python, and the second is probably too weak.

Sequential Consistency

In a sequentially-consistent concurrent execution, actions appear to
happen in a global total order with each read of a particular variable
seeing the value written by the last write that affected that variable.
The total order for actions must be consistent with the program order. A
program has a data race on a given input when one of its sequentially
consistent executions puts two conflicting actions next to each other.

This is the easiest memory model for humans to understand, although it
doesn't eliminate all confusion, since operations can be split in odd
places.

Happens-before consistency

The program contains a collection of synchronization actions, which in
Python currently include lock acquires and releases and thread starts
and joins. Synchronization actions happen in a global total order that
is consistent with the program order (they don't have to happen in a
total order, but it simplifies the description of the model). A lock
release synchronizes with all later acquires of the same lock.
Similarly, given t = threading.Thread(target=worker):

-   A call to t.start() synchronizes with the first statement in
    worker().
-   The return from worker() synchronizes with the return from t.join().
-   If the return from t.start() happens before (see below) a call to
    t.isAlive() that returns False, the return from worker()
    synchronizes with that call.

We call the source of the synchronizes-with edge a release operation on
the relevant variable, and we call the target an acquire operation.

The happens before order is the transitive closure of the program order
with the synchronizes-with edges. That is, action A happens before
action B if:

-   A falls before B in the program order (which means they run in the
    same thread)
-   A synchronizes with B
-   You can get to B by following happens-before edges from A.

An execution of a program is happens-before consistent if each read R
sees the value of a write W to the same variable such that:

-   R does not happen before W, and
-   There is no other write V that overwrote W before R got a chance to
    see it. (That is, it can't be the case that W happens before V
    happens before R.)

You have a data race if two conflicting actions aren't related by
happens-before.

An example

Let's use the rules from the happens-before model to prove that the
following program prints "[7]":

    class Queue:
        def __init__(self):
            self.l = []
            self.cond = threading.Condition()

        def get():
            with self.cond:
                while not self.l:
                    self.cond.wait()
                ret = self.l[0]
                self.l = self.l[1:]
                return ret

        def put(x):
            with self.cond:
                self.l.append(x)
                self.cond.notify()

    myqueue = Queue()

    def worker1():
        x = [7]
        myqueue.put(x)

    def worker2():
        y = myqueue.get()
        print y

    thread1 = threading.Thread(target=worker1)
    thread2 = threading.Thread(target=worker2)
    thread2.start()
    thread1.start()

1.  Because myqueue is initialized in the main thread before thread1 or
    thread2 is started, that initialization happens before worker1 and
    worker2 begin running, so there's no way for either to raise a
    NameError, and both myqueue.l and myqueue.cond are set to their
    final objects.
2.  The initialization of x in worker1 happens before it calls
    myqueue.put(), which happens before it calls myqueue.l.append(x),
    which happens before the call to myqueue.cond.release(), all because
    they run in the same thread.
3.  In worker2, myqueue.cond will be released and re-acquired until
    myqueue.l contains a value (x). The call to myqueue.cond.release()
    in worker1 happens before that last call to myqueue.cond.acquire()
    in worker2.
4.  That last call to myqueue.cond.acquire() happens before
    myqueue.get() reads myqueue.l, which happens before myqueue.get()
    returns, which happens before print y, again all because they run in
    the same thread.
5.  Because happens-before is transitive, the list initially stored in x
    in thread1 is initialized before it is printed in thread2.

Usually, we wouldn't need to look all the way into a thread-safe queue's
implementation in order to prove that uses were safe. Its interface
would specify that puts happen before gets, and we'd reason directly
from that.

Surprising behaviors with races

Lots of strange things can happen when code has data races. It's easy to
avoid all of these problems by just protecting shared variables with
locks. This is not a complete list of race hazards; it's just a
collection that seem relevant to Python.

In all of these examples, variables starting with r are local variables,
and other variables are shared between threads.

Zombie values

This example comes from the Java memory model:

  Initially p is q and p.x == 0.

    Thread 1    Thread 2
    ----------- ----------
    r1 = p      r6 = p
    r2 = r1.x   r6.x = 3
    r3 = q      
    r4 = r3.x   
    r5 = r1.x   

  Can produce r2 == r5 == 0 but r4 == 3, proving that p.x went from 0 to
  3 and back to 0.

A good compiler would like to optimize out the redundant load of p.x in
initializing r5 by just re-using the value already loaded into r2. We
get the strange result if thread 1 sees memory in this order:

  
    Evaluation   Computes   Why
    ------------ ---------- ----------------------------------------------
    r1 = p                  
    r2 = r1.x    r2 == 0    
    r3 = q       r3 is p    
    p.x = 3                 Side-effect of thread 2
    r4 = r3.x    r4 == 3    
    r5 = r2      r5 == 0    Optimized from r5 = r1.x because r2 == r1.x.

Inconsistent Orderings

From N2177: Sequential Consistency for Atomics, and also known as
Independent Read of Independent Write (IRIW).

  Initially, a == b == 0.

    Thread 1   Thread 2   Thread 3   Thread 4
    ---------- ---------- ---------- ----------
    r1 = a     r3 = b     a = 1      b = 1
    r2 = b     r4 = a                

  We may get r1 == r3 == 1 and r2 == r4 == 0, proving both that a was
  written before b (thread 1's data), and that b was written before a
  (thread 2's data). See Special Relativity for a real-world example.

This can happen if thread 1 and thread 3 are running on processors that
are close to each other, but far away from the processors that threads 2
and 4 are running on and the writes are not being transmitted all the
way across the machine before becoming visible to nearby threads.

Neither acquire/release semantics nor explicit memory barriers can help
with this. Making the orders consistent without locking requires
detailed knowledge of the architecture's memory model, but Java requires
it for volatiles so we could use documentation aimed at its
implementers.

A happens-before race that's not a sequentially-consistent race

From the POPL paper about the Java memory model [#JMM-popl].

  Initially, x == y == 0.

  +-------------+-------------+
  | Thread 1    | Thread 2    |
  +=============+=============+
  | r1 = x      | r2 = y      |
  +-------------+-------------+
  | if r1 != 0: | if r2 != 0: |
  +-------------+-------------+
  |   y = 42    |   x = 42    |
  +-------------+-------------+

  Can r1 == r2 == 42???

In a sequentially-consistent execution, there's no way to get an
adjacent read and write to the same variable, so the program should be
considered correctly synchronized (albeit fragile), and should only
produce r1 == r2 == 0. However, the following execution is
happens-before consistent:

  +-------------+-------+--------+
  | Statement   | Value | Thread |
  +=============+=======+========+
  | r1 = x      | 42    | 1      |
  +-------------+-------+--------+
  | if r1 != 0: | true  | 1      |
  +-------------+-------+--------+
  |   y = 42    |       | 1      |
  +-------------+-------+--------+
  | r2 = y      | 42    | 2      |
  +-------------+-------+--------+
  | if r2 != 0: | true  | 2      |
  +-------------+-------+--------+
  |   x = 42    |       | 2      |
  +-------------+-------+--------+

WTF, you are asking yourself. Because there were no inter-thread
happens-before edges in the original program, the read of x in thread 1
can see any of the writes from thread 2, even if they only happened
because the read saw them. There are data races in the happens-before
model.

We don't want to allow this, so the happens-before model isn't enough
for Python. One rule we could add to happens-before that would prevent
this execution is:

  If there are no data races in any sequentially-consistent execution of
  a program, the program should have sequentially consistent semantics.

Java gets this rule as a theorem, but Python may not want all of the
machinery you need to prove it.

Self-justifying values

Also from the POPL paper about the Java memory model [#JMM-popl].

  Initially, x == y == 0.

    Thread 1   Thread 2
    ---------- ----------
    r1 = x     r2 = y
    y = r1     x = r2

  Can x == y == 42???

In a sequentially consistent execution, no. In a happens-before
consistent execution, yes: The read of x in thread 1 is allowed to see
the value written in thread 2 because there are no happens-before
relations between the threads. This could happen if the compiler or
processor transforms the code into:

  +--------------+----------+
  | Thread 1     | Thread 2 |
  +==============+==========+
  | y = 42       | r2 = y   |
  +--------------+----------+
  | r1 = x       | x = r2   |
  +--------------+----------+
  | if r1 != 42: |          |
  +--------------+----------+
  |   y = r1     |          |
  +--------------+----------+

It can produce a security hole if the speculated value is a secret
object, or points to the memory that an object used to occupy. Java
cares a lot about such security holes, but Python may not.

Uninitialized values (direct)

From several classic double-checked locking examples.

  Initially, d == None.

    Thread 1            Thread 2
    ------------------- ------------
    while not d: pass   d = [3, 4]
    assert d[1] == 4    

  This could raise an IndexError, fail the assertion, or, without some
  care in the implementation, cause a crash or other undefined behavior.

Thread 2 may actually be implemented as:

    r1 = list()
    r1.append(3)
    r1.append(4)
    d = r1

Because the assignment to d and the item assignments are independent,
the compiler and processor may optimize that to:

    r1 = list()
    d = r1
    r1.append(3)
    r1.append(4)

Which is obviously incorrect and explains the IndexError. If we then
look deeper into the implementation of r1.append(3), we may find that it
and d[1] cannot run concurrently without causing their own race
conditions. In CPython (without the GIL), those race conditions would
produce undefined behavior.

There's also a subtle issue on the reading side that can cause the value
of d[1] to be out of date. Somewhere in the implementation of list, it
stores its contents as an array in memory. This array may happen to be
in thread 1's cache. If thread 1's processor reloads d from main memory
without reloading the memory that ought to contain the values 3 and 4,
it could see stale values instead. As far as I know, this can only
actually happen on Alphas and maybe Itaniums, and we probably have to
prevent it anyway to avoid crashes.

Uninitialized values (flag)

From several more double-checked locking examples.

  Initially, d == dict() and initialized == False.

    Thread 1                      Thread 2
    ----------------------------- --------------------
    while not initialized: pass   d['a'] = 3
    r1 = d['a']                   initialized = True
    r2 = r1 == 3                  
    assert r2                     

  This could raise a KeyError, fail the assertion, or, without some care
  in the implementation, cause a crash or other undefined behavior.

Because d and initialized are independent (except in the programmer's
mind), the compiler and processor can rearrange these almost
arbitrarily, except that thread 1's assertion has to stay after the
loop.

Inconsistent guarantees from relying on data dependencies

This is a problem with Java final variables and the proposed
data-dependency ordering in C++0x.

  First execute:

      g = []
      def Init():
          g.extend([1,2,3])
          return [1,2,3]
      h = None

  Then in two threads:

    Thread 1              Thread 2
    --------------------- -------------
    while not h: pass     r1 = Init()
    assert h == [1,2,3]   freeze(r1)
    assert h == g         h = r1

  If h has semantics similar to a Java final variable (except for being
  write-once), then even though the first assertion is guaranteed to
  succeed, the second could fail.

Data-dependent guarantees like those final provides only work if the
access is through the final variable. It's not even safe to access the
same object through a different route. Unfortunately, because of how
processors work, final's guarantees are only cheap when they're weak.

The rules for Python

The first rule is that Python interpreters can't crash due to race
conditions in user code. For CPython, this means that race conditions
can't make it down into C. For Jython, it means that
NullPointerExceptions can't escape the interpreter.

Presumably we also want a model at least as strong as happens-before
consistency because it lets us write a simple description of how
concurrent queues and thread launching and joining work.

Other rules are more debatable, so I'll present each one with pros and
cons.

Data-race-free programs are sequentially consistent

We'd like programmers to be able to reason about their programs as if
they were sequentially consistent. Since it's hard to tell whether
you've written a happens-before race, we only want to require
programmers to prevent sequential races. The Java model does this
through a complicated definition of causality, but if we don't want to
include that, we can just assert this property directly.

No security holes from out-of-thin-air reads

If the program produces a self-justifying value, it could expose access
to an object that the user would rather the program not see. Again,
Java's model handles this with the causality definition. We might be
able to prevent these security problems by banning speculative writes to
shared variables, but I don't have a proof of that, and Python may not
need those security guarantees anyway.

Restrict reorderings instead of defining happens-before

The .NET [#CLR-msdn] and x86 [#x86-model] memory models are based on
defining which reorderings compilers may allow. I think that it's easier
to program to a happens-before model than to reason about all of the
possible reorderings of a program, and it's easier to insert enough
happens-before edges to make a program correct, than to insert enough
memory fences to do the same thing. So, although we could layer some
reordering restrictions on top of the happens-before base, I don't think
Python's memory model should be entirely reordering restrictions.

Atomic, unordered assignments

Assignments of primitive types are already atomic. If you assign
3<<72 + 5 to a variable, no thread can see only part of the value.
Jeremy Manson suggested that we extend this to all objects. This allows
compilers to reorder operations to optimize them, without allowing some
of the more confusing uninitialized values. The basic idea here is that
when you assign a shared variable, readers can't see any changes made to
the new value before the assignment, or to the old value after the
assignment. So, if we have a program like:

  Initially, (d.a, d.b) == (1, 2), and (e.c, e.d) == (3, 4). We also
  have class Obj(object): pass.

    Thread 1     Thread 2
    ------------ ---------------------
    r1 = Obj()   r3 = d
    r1.a = 3     r4, r5 = r3.a, r3.b
    r1.b = 4     r6 = e
    d = r1       r7, r8 = r6.c, r6.d
    r2 = Obj()   
    r2.c = 6     
    r2.d = 7     
    e = r2       

  (r4, r5) can be (1, 2) or (3, 4) but nothing else, and (r7, r8) can be
  either (3, 4) or (6, 7) but nothing else. Unlike if writes were
  releases and reads were acquires, it's legal for thread 2 to see
  (e.c, e.d) == (6, 7) and (d.a, d.b) == (1, 2) (out of order).

This allows the compiler a lot of flexibility to optimize without
allowing users to see some strange values. However, because it relies on
data dependencies, it introduces some surprises of its own. For example,
the compiler could freely optimize the above example to:

  
    Thread 1     Thread 2
    ------------ ---------------------
    r1 = Obj()   r3 = d
    r2 = Obj()   r6 = e
    r1.a = 3     r4, r7 = r3.a, r6.c
    r2.c = 6     r5, r8 = r3.b, r6.d
    r2.d = 7     
    e = r2       
    r1.b = 4     
    d = r1       

As long as it didn't let the initialization of e move above any of the
initializations of members of r2, and similarly for d and r1.

This also helps to ground happens-before consistency. To see the
problem, imagine that the user unsafely publishes a reference to an
object as soon as she gets it. The model needs to constrain what values
can be read through that reference. Java says that every field is
initialized to 0 before anyone sees the object for the first time, but
Python would have trouble defining "every field". If instead we say that
assignments to shared variables have to see a value at least as up to
date as when the assignment happened, then we don't run into any trouble
with early publication.

Two tiers of guarantees

Most other languages with any guarantees for unlocked variables
distinguish between ordinary variables and volatile/atomic variables.
They provide many more guarantees for the volatile ones. Python can't
easily do this because we don't declare variables. This may or may not
matter, since python locks aren't significantly more expensive than
ordinary python code. If we want to get those tiers back, we could:

1.  Introduce a set of atomic types similar to Java's[1] or C++'s[2].
    Unfortunately, we couldn't assign to them with =.
2.  Without requiring variable declarations, we could also specify that
    all of the fields on a given object are atomic.
3.  Extend the __slots__ mechanism[3] with a parallel __volatiles__
    list, and maybe a __finals__ list.

Sequential Consistency

We could just adopt sequential consistency for Python. This avoids all
of the hazards mentioned above, but it prohibits lots of optimizations
too. As far as I know, this is the current model of CPython, but if
CPython learned to optimize out some variable reads, it would lose this
property.

If we adopt this, Jython's dict implementation may no longer be able to
use ConcurrentHashMap because that only promises to create appropriate
happens-before edges, not to be sequentially consistent (although maybe
the fact that Java volatiles are totally ordered carries over). Both
Jython and IronPython would probably need to use AtomicReferenceArray or
the equivalent for any __slots__ arrays.

Adapt the x86 model

The x86 model is:

1.  Loads are not reordered with other loads.
2.  Stores are not reordered with other stores.
3.  Stores are not reordered with older loads.
4.  Loads may be reordered with older stores to different locations but
    not with older stores to the same location.
5.  In a multiprocessor system, memory ordering obeys causality (memory
    ordering respects transitive visibility).
6.  In a multiprocessor system, stores to the same location have a total
    order.
7.  In a multiprocessor system, locked instructions have a total order.
8.  Loads and stores are not reordered with locked instructions.

In acquire/release terminology, this appears to say that every store is
a release and every load is an acquire. This is slightly weaker than
sequential consistency, in that it allows inconsistent orderings, but it
disallows zombie values and the compiler optimizations that produce
them. We would probably want to weaken the model somehow to explicitly
allow compilers to eliminate redundant variable reads. The x86 model may
also be expensive to implement on other platforms, although because x86
is so common, that may not matter much.

Upgrading or downgrading to an alternate model

We can adopt an initial memory model without totally restricting future
implementations. If we start with a weak model and want to get stronger
later, we would only have to change the implementations, not programs.
Individual implementations could also guarantee a stronger memory model
than the language demands, although that could hurt interoperability. On
the other hand, if we start with a strong model and want to weaken it
later, we can add a from __future__ import weak_memory statement to
declare that some modules are safe.

Implementation Details

The required model is weaker than any particular implementation. This
section tries to document the actual guarantees each implementation
provides, and should be updated as the implementations change.

CPython

Uses the GIL to guarantee that other threads don't see funny
reorderings, and does few enough optimizations that I believe it's
actually sequentially consistent at the bytecode level. Threads can
switch between any two bytecodes (instead of only between statements),
so two threads that concurrently execute:

    i = i + 1

with i initially 0 could easily end up with i==1 instead of the expected
i==2. If they execute:

    i += 1

instead, CPython 2.6 will always give the right answer, but it's easy to
imagine another implementation in which this statement won't be atomic.

PyPy

Also uses a GIL, but probably does enough optimization to violate
sequential consistency. I know very little about this implementation.

Jython

Provides true concurrency under the Java memory model and stores all
object fields (except for those in __slots__?) in a ConcurrentHashMap,
which provides fairly strong ordering guarantees. Local variables in a
function may have fewer guarantees, which would become visible if they
were captured into a closure that was then passed to another thread.

IronPython

Provides true concurrency under the CLR memory model, which probably
protects it from uninitialized values. IronPython uses a locked map to
store object fields, providing at least as many guarantees as Jython.

References

Acknowledgements

Thanks to Jeremy Manson and Alex Martelli for detailed discussions on
what this PEP should look like.

Copyright

This document has been placed in the public domain.

[1] Package java.util.concurrent.atomic
(http://java.sun.com/javase/6/docs/api/java/util/concurrent/atomic/package-summary.html)

[2] C++ Atomic Types and Operations, Hans Boehm and Lawrence Crowl
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html)

[3] __slots__ (http://docs.python.org/ref/slots.html)