PEP: 255 Title: Simple Generators Author: Neil Schemenauer
<nas@arctrix.com>, Tim Peters <tim.peters@gmail.com>, Magnus Lie Hetland
<magnus@hetland.org> Status: Final Type: Standards Track Content-Type:
text/x-rst Requires: 234 Created: 18-May-2001 Python-Version: 2.2
Post-History: 14-Jun-2001, 23-Jun-2001

Abstract

This PEP introduces the concept of generators to Python, as well as a
new statement used in conjunction with them, the yield statement.

Motivation

When a producer function has a hard enough job that it requires
maintaining state between values produced, most programming languages
offer no pleasant and efficient solution beyond adding a callback
function to the producer's argument list, to be called with each value
produced.

For example, tokenize.py in the standard library takes this approach:
the caller must pass a tokeneater function to tokenize(), called
whenever tokenize() finds the next token. This allows tokenize to be
coded in a natural way, but programs calling tokenize are typically
convoluted by the need to remember between callbacks which token(s) were
seen last. The tokeneater function in tabnanny.py is a good example of
that, maintaining a state machine in global variables, to remember
across callbacks what it has already seen and what it hopes to see next.
This was difficult to get working correctly, and is still difficult for
people to understand. Unfortunately, that's typical of this approach.

An alternative would have been for tokenize to produce an entire parse
of the Python program at once, in a large list. Then tokenize clients
could be written in a natural way, using local variables and local
control flow (such as loops and nested if statements) to keep track of
their state. But this isn't practical: programs can be very large, so no
a priori bound can be placed on the memory needed to materialize the
whole parse; and some tokenize clients only want to see whether
something specific appears early in the program (e.g., a future
statement, or, as is done in IDLE, just the first indented statement),
and then parsing the whole program first is a severe waste of time.

Another alternative would be to make tokenize an iterator <234>,
delivering the next token whenever its .next() method is invoked. This
is pleasant for the caller in the same way a large list of results would
be, but without the memory and "what if I want to get out early?"
drawbacks. However, this shifts the burden on tokenize to remember its
state between .next() invocations, and the reader need only glance at
tokenize.tokenize_loop() to realize what a horrid chore that would be.
Or picture a recursive algorithm for producing the nodes of a general
tree structure: to cast that into an iterator framework requires
removing the recursion manually and maintaining the state of the
traversal by hand.

A fourth option is to run the producer and consumer in separate threads.
This allows both to maintain their states in natural ways, and so is
pleasant for both. Indeed, Demo/threads/Generator.py in the Python
source distribution provides a usable synchronized-communication class
for doing that in a general way. This doesn't work on platforms without
threads, though, and is very slow on platforms that do (compared to what
is achievable without threads).

A final option is to use the Stackless[1] (PEP 219) variant
implementation of Python instead, which supports lightweight coroutines.
This has much the same programmatic benefits as the thread option, but
is much more efficient. However, Stackless is a controversial rethinking
of the Python core, and it may not be possible for Jython to implement
the same semantics. This PEP isn't the place to debate that, so suffice
it to say here that generators provide a useful subset of Stackless
functionality in a way that fits easily into the current CPython
implementation, and is believed to be relatively straightforward for
other Python implementations.

That exhausts the current alternatives. Some other high-level languages
provide pleasant solutions, notably iterators in Sather[2], which were
inspired by iterators in CLU; and generators in Icon[3], a novel
language where every expression is a generator. There are differences
among these, but the basic idea is the same: provide a kind of function
that can return an intermediate result ("the next value") to its caller,
but maintaining the function's local state so that the function can be
resumed again right where it left off. A very simple example:

    def fib():
        a, b = 0, 1
        while 1:
           yield b
           a, b = b, a+b

When fib() is first invoked, it sets a to 0 and b to 1, then yields b
back to its caller. The caller sees 1. When fib is resumed, from its
point of view the yield statement is really the same as, say, a print
statement: fib continues after the yield with all local state intact. a
and b then become 1 and 1, and fib loops back to the yield, yielding 1
to its invoker. And so on. From fib's point of view it's just delivering
a sequence of results, as if via callback. But from its caller's point
of view, the fib invocation is an iterable object that can be resumed at
will. As in the thread approach, this allows both sides to be coded in
the most natural ways; but unlike the thread approach, this can be done
efficiently and on all platforms. Indeed, resuming a generator should be
no more expensive than a function call.

The same kind of approach applies to many producer/consumer functions.
For example, tokenize.py could yield the next token instead of invoking
a callback function with it as argument, and tokenize clients could
iterate over the tokens in a natural way: a Python generator is a kind
of Python iterator <234>, but of an especially powerful kind.

Specification: Yield

A new statement is introduced:

    yield_stmt:    "yield" expression_list

yield is a new keyword, so a future statement (PEP 236) is needed to
phase this in: in the initial release, a module desiring to use
generators must include the line:

    from __future__ import generators

near the top (see PEP 236) for details). Modules using the identifier
yield without a future statement will trigger warnings. In the following
release, yield will be a language keyword and the future statement will
no longer be needed.

The yield statement may only be used inside functions. A function that
contains a yield statement is called a generator function. A generator
function is an ordinary function object in all respects, but has the new
CO_GENERATOR flag set in the code object's co_flags member.

When a generator function is called, the actual arguments are bound to
function-local formal argument names in the usual way, but no code in
the body of the function is executed. Instead a generator-iterator
object is returned; this conforms to the iterator protocol <234>, so in
particular can be used in for-loops in a natural way. Note that when the
intent is clear from context, the unqualified name "generator" may be
used to refer either to a generator-function or a generator-iterator.

Each time the .next() method of a generator-iterator is invoked, the
code in the body of the generator-function is executed until a yield or
return statement (see below) is encountered, or until the end of the
body is reached.

If a yield statement is encountered, the state of the function is
frozen, and the value of expression_list is returned to .next()'s
caller. By "frozen" we mean that all local state is retained, including
the current bindings of local variables, the instruction pointer, and
the internal evaluation stack: enough information is saved so that the
next time .next() is invoked, the function can proceed exactly as if the
yield statement were just another external call.

Restriction: A yield statement is not allowed in the try clause of a
try/finally construct. The difficulty is that there's no guarantee the
generator will ever be resumed, hence no guarantee that the finally
block will ever get executed; that's too much a violation of finally's
purpose to bear.

Restriction: A generator cannot be resumed while it is actively running:

    >>> def g():
    ...     i = me.next()
    ...     yield i
    >>> me = g()
    >>> me.next()
    Traceback (most recent call last):
     ...
     File "<string>", line 2, in g
    ValueError: generator already executing

Specification: Return

A generator function can also contain return statements of the form:

    return

Note that an expression_list is not allowed on return statements in the
body of a generator (although, of course, they may appear in the bodies
of non-generator functions nested within the generator).

When a return statement is encountered, control proceeds as in any
function return, executing the appropriate finally clauses (if any
exist). Then a StopIteration exception is raised, signalling that the
iterator is exhausted. A StopIteration exception is also raised if
control flows off the end of the generator without an explicit return.

Note that return means "I'm done, and have nothing interesting to
return", for both generator functions and non-generator functions.

Note that return isn't always equivalent to raising StopIteration: the
difference lies in how enclosing try/except constructs are treated. For
example,:

    >>> def f1():
    ...     try:
    ...         return
    ...     except:
    ...        yield 1
    >>> print list(f1())
    []

because, as in any function, return simply exits, but:

    >>> def f2():
    ...     try:
    ...         raise StopIteration
    ...     except:
    ...         yield 42
    >>> print list(f2())
    [42]

because StopIteration is captured by a bare except, as is any exception.

Specification: Generators and Exception Propagation

If an unhandled exception-- including, but not limited to, StopIteration
--is raised by, or passes through, a generator function, then the
exception is passed on to the caller in the usual way, and subsequent
attempts to resume the generator function raise StopIteration. In other
words, an unhandled exception terminates a generator's useful life.

Example (not idiomatic but to illustrate the point):

    >>> def f():
    ...     return 1/0
    >>> def g():
    ...     yield f()  # the zero division exception propagates
    ...     yield 42   # and we'll never get here
    >>> k = g()
    >>> k.next()
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "<stdin>", line 2, in g
      File "<stdin>", line 2, in f
    ZeroDivisionError: integer division or modulo by zero
    >>> k.next()  # and the generator cannot be resumed
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    StopIteration
    >>>

Specification: Try/Except/Finally

As noted earlier, yield is not allowed in the try clause of a
try/finally construct. A consequence is that generators should allocate
critical resources with great care. There is no restriction on yield
otherwise appearing in finally clauses, except clauses, or in the try
clause of a try/except construct:

    >>> def f():
    ...     try:
    ...         yield 1
    ...         try:
    ...             yield 2
    ...             1/0
    ...             yield 3  # never get here
    ...         except ZeroDivisionError:
    ...             yield 4
    ...             yield 5
    ...             raise
    ...         except:
    ...             yield 6
    ...         yield 7     # the "raise" above stops this
    ...     except:
    ...         yield 8
    ...     yield 9
    ...     try:
    ...         x = 12
    ...     finally:
    ...        yield 10
    ...     yield 11
    >>> print list(f())
    [1, 2, 4, 5, 8, 9, 10, 11]
    >>>

Example

    # A binary tree class.
    class Tree:

        def __init__(self, label, left=None, right=None):
            self.label = label
            self.left = left
            self.right = right

        def __repr__(self, level=0, indent="    "):
            s = level*indent + `self.label`
            if self.left:
                s = s + "\n" + self.left.__repr__(level+1, indent)
            if self.right:
                s = s + "\n" + self.right.__repr__(level+1, indent)
            return s

        def __iter__(self):
            return inorder(self)

    # Create a Tree from a list.
    def tree(list):
        n = len(list)
        if n == 0:
            return []
        i = n / 2
        return Tree(list[i], tree(list[:i]), tree(list[i+1:]))

    # A recursive generator that generates Tree labels in in-order.
    def inorder(t):
        if t:
            for x in inorder(t.left):
                yield x
            yield t.label
            for x in inorder(t.right):
                yield x

    # Show it off: create a tree.
    t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
    # Print the nodes of the tree in in-order.
    for x in t:
        print x,
    print

    # A non-recursive generator.
    def inorder(node):
        stack = []
        while node:
            while node.left:
                stack.append(node)
                node = node.left
            yield node.label
            while not node.right:
                try:
                    node = stack.pop()
                except IndexError:
                    return
                yield node.label
            node = node.right

    # Exercise the non-recursive generator.
    for x in t:
        print x,
    print

Both output blocks display:

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Q & A

Why not a new keyword instead of reusing def?

See BDFL Pronouncements section below.

Why a new keyword for yield? Why not a builtin function instead?

Control flow is much better expressed via keyword in Python, and yield
is a control construct. It's also believed that efficient implementation
in Jython requires that the compiler be able to determine potential
suspension points at compile-time, and a new keyword makes that easy.
The CPython reference implementation also exploits it heavily, to detect
which functions are generator-functions (although a new keyword in place
of def would solve that for CPython -- but people asking the "why a new
keyword?" question don't want any new keyword).

Then why not some other special syntax without a new keyword?

For example, one of these instead of yield 3:

    return 3 and continue
    return and continue 3
    return generating 3
    continue return 3
    return >> , 3
    from generator return 3
    return >> 3
    return << 3
    >> 3
    << 3
    * 3

Did I miss one <wink>? Out of hundreds of messages, I counted three
suggesting such an alternative, and extracted the above from them. It
would be nice not to need a new keyword, but nicer to make yield very
clear -- I don't want to have to deduce that a yield is occurring from
making sense of a previously senseless sequence of keywords or
operators. Still, if this attracts enough interest, proponents should
settle on a single consensus suggestion, and Guido will Pronounce on it.

Why allow return at all? Why not force termination to be spelled raise StopIteration?

The mechanics of StopIteration are low-level details, much like the
mechanics of IndexError in Python 2.1: the implementation needs to do
something well-defined under the covers, and Python exposes these
mechanisms for advanced users. That's not an argument for forcing
everyone to work at that level, though. return means "I'm done" in any
kind of function, and that's easy to explain and to use. Note that
return isn't always equivalent to raise StopIteration in try/except
construct, either (see the "Specification: Return" section).

Then why not allow an expression on return too?

Perhaps we will someday. In Icon, return expr means both "I'm done", and
"but I have one final useful value to return too, and this is it". At
the start, and in the absence of compelling uses for return expr, it's
simply cleaner to use yield exclusively for delivering values.

BDFL Pronouncements

Issue

Introduce another new keyword (say, gen or generator) in place of def,
or otherwise alter the syntax, to distinguish generator-functions from
non-generator functions.

Con

In practice (how you think about them), generators are functions, but
with the twist that they're resumable. The mechanics of how they're set
up is a comparatively minor technical issue, and introducing a new
keyword would unhelpfully overemphasize the mechanics of how generators
get started (a vital but tiny part of a generator's life).

Pro

In reality (how you think about them), generator-functions are actually
factory functions that produce generator-iterators as if by magic. In
this respect they're radically different from non-generator functions,
acting more like a constructor than a function, so reusing def is at
best confusing. A yield statement buried in the body is not enough
warning that the semantics are so different.

BDFL

def it stays. No argument on either side is totally convincing, so I
have consulted my language designer's intuition. It tells me that the
syntax proposed in the PEP is exactly right - not too hot, not too cold.
But, like the Oracle at Delphi in Greek mythology, it doesn't tell me
why, so I don't have a rebuttal for the arguments against the PEP
syntax. The best I can come up with (apart from agreeing with the
rebuttals ... already made) is "FUD". If this had been part of the
language from day one, I very much doubt it would have made Andrew
Kuchling's "Python Warts" page.

Reference Implementation

The current implementation, in a preliminary state (no docs, but well
tested and solid), is part of Python's CVS development tree[4]. Using
this requires that you build Python from source.

This was derived from an earlier patch by Neil Schemenauer[5].

Footnotes and References

Copyright

This document has been placed in the public domain.

[1] http://www.stackless.com/

[2] "Iteration Abstraction in Sather" Murer, Omohundro, Stoutamire and
Szyperski http://www.icsi.berkeley.edu/~sather/Publications/toplas.html

[3] http://www.cs.arizona.edu/icon/

[4] To experiment with this implementation, check out Python from CVS
according to the instructions at http://sf.net/cvs/?group_id=5470 Note
that the std test Lib/test/test_generators.py contains many examples,
including all those in this PEP.

[5] http://python.ca/nas/python/generator.diff