PEP: 289 Title: Generator Expressions Author: Raymond Hettinger
<python@rcn.com> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 30-Jan-2002 Python-Version: 2.4 Post-History:
22-Oct-2003

Abstract

This PEP introduces generator expressions as a high performance, memory
efficient generalization of list comprehensions PEP 202 and generators
PEP 255.

Rationale

Experience with list comprehensions has shown their widespread utility
throughout Python. However, many of the use cases do not need to have a
full list created in memory. Instead, they only need to iterate over the
elements one at a time.

For instance, the following summation code will build a full list of
squares in memory, iterate over those values, and, when the reference is
no longer needed, delete the list:

    sum([x*x for x in range(10)])

Memory is conserved by using a generator expression instead:

    sum(x*x for x in range(10))

Similar benefits are conferred on constructors for container objects:

    s = set(word  for line in page  for word in line.split())
    d = dict( (k, func(k)) for k in keylist)

Generator expressions are especially useful with functions like sum(),
min(), and max() that reduce an iterable input to a single value:

    max(len(line)  for line in file  if line.strip())

Generator expressions also address some examples of functionals coded
with lambda:

    reduce(lambda s, a: s + a.myattr, data, 0)
    reduce(lambda s, a: s + a[3], data, 0)

These simplify to:

    sum(a.myattr for a in data)
    sum(a[3] for a in data)

List comprehensions greatly reduced the need for filter() and map().
Likewise, generator expressions are expected to minimize the need for
itertools.ifilter() and itertools.imap(). In contrast, the utility of
other itertools will be enhanced by generator expressions:

    dotproduct = sum(x*y for x,y in itertools.izip(x_vector, y_vector))

Having a syntax similar to list comprehensions also makes it easy to
convert existing code into a generator expression when scaling up
application.

Early timings showed that generators had a significant performance
advantage over list comprehensions. However, the latter were highly
optimized for Py2.4 and now the performance is roughly comparable for
small to mid-sized data sets. As the data volumes grow larger, generator
expressions tend to perform better because they do not exhaust cache
memory and they allow Python to re-use objects between iterations.

BDFL Pronouncements

This PEP is ACCEPTED for Py2.4.

The Details

(None of this is exact enough in the eye of a reader from Mars, but I
hope the examples convey the intention well enough for a discussion in
c.l.py. The Python Reference Manual should contain a 100% exact semantic
and syntactic specification.)

1.  The semantics of a generator expression are equivalent to creating
    an anonymous generator function and calling it. For example:

        g = (x**2 for x in range(10))
        print g.next()

    is equivalent to:

        def __gen(exp):
            for x in exp:
                yield x**2
        g = __gen(iter(range(10)))
        print g.next()

    Only the outermost for-expression is evaluated immediately, the
    other expressions are deferred until the generator is run:

        g = (tgtexp  for var1 in exp1 if exp2 for var2 in exp3 if exp4)

    is equivalent to:

        def __gen(bound_exp):
            for var1 in bound_exp:
                if exp2:
                    for var2 in exp3:
                        if exp4:
                            yield tgtexp
        g = __gen(iter(exp1))
        del __gen

2.  The syntax requires that a generator expression always needs to be
    directly inside a set of parentheses and cannot have a comma on
    either side. With reference to the file Grammar/Grammar in CVS, two
    rules change:

    a)  The rule:

            atom: '(' [testlist] ')'

        changes to:

            atom: '(' [testlist_gexp] ')'

        where testlist_gexp is almost the same as listmaker, but only
        allows a single test after 'for' ... 'in':

            testlist_gexp: test ( gen_for | (',' test)* [','] )

    b)  The rule for arglist needs similar changes.

    This means that you can write:

        sum(x**2 for x in range(10))

    but you would have to write:

        reduce(operator.add, (x**2 for x in range(10)))

    and also:

        g = (x**2 for x in range(10))

    i.e. if a function call has a single positional argument, it can be
    a generator expression without extra parentheses, but in all other
    cases you have to parenthesize it.

    The exact details were checked in to Grammar/Grammar version 1.49.

3.  The loop variable (if it is a simple variable or a tuple of simple
    variables) is not exposed to the surrounding function. This
    facilitates the implementation and makes typical use cases more
    reliable. In some future version of Python, list comprehensions will
    also hide the induction variable from the surrounding code (and, in
    Py2.4, warnings will be issued for code accessing the induction
    variable).

    For example:

        x = "hello"
        y = list(x for x in "abc")
        print x    # prints "hello", not "c"

4.  List comprehensions will remain unchanged. For example:

        [x for x in S]    # This is a list comprehension.
        [(x for x in S)]  # This is a list containing one generator
                          # expression.

    Unfortunately, there is currently a slight syntactic difference. The
    expression:

        [x for x in 1, 2, 3]

    is legal, meaning:

        [x for x in (1, 2, 3)]

    But generator expressions will not allow the former version:

        (x for x in 1, 2, 3)

    is illegal.

    The former list comprehension syntax will become illegal in Python
    3.0, and should be deprecated in Python 2.4 and beyond.

    List comprehensions also "leak" their loop variable into the
    surrounding scope. This will also change in Python 3.0, so that the
    semantic definition of a list comprehension in Python 3.0 will be
    equivalent to list(<generator expression>). Python 2.4 and beyond
    should issue a deprecation warning if a list comprehension's loop
    variable has the same name as a variable used in the immediately
    surrounding scope.

Early Binding versus Late Binding

After much discussion, it was decided that the first (outermost)
for-expression should be evaluated immediately and that the remaining
expressions be evaluated when the generator is executed.

Asked to summarize the reasoning for binding the first expression, Guido
offered[1]:

    Consider sum(x for x in foo()). Now suppose there's a bug in foo()
    that raises an exception, and a bug in sum() that raises an
    exception before it starts iterating over its argument. Which
    exception would you expect to see? I'd be surprised if the one in
    sum() was raised rather the one in foo(), since the call to foo()
    is part of the argument to sum(), and I expect arguments to be
    processed before the function is called.

    OTOH, in sum(bar(x) for x in foo()), where sum() and foo()
    are bugfree, but bar() raises an exception, we have no choice but
    to delay the call to bar() until sum() starts iterating -- that's
    part of the contract of generators. (They do nothing until their
    next() method is first called.)

Various use cases were proposed for binding all free variables when the
generator is defined. And some proponents felt that the resulting
expressions would be easier to understand and debug if bound
immediately.

However, Python takes a late binding approach to lambda expressions and
has no precedent for automatic, early binding. It was felt that
introducing a new paradigm would unnecessarily introduce complexity.

After exploring many possibilities, a consensus emerged that binding
issues were hard to understand and that users should be strongly
encouraged to use generator expressions inside functions that consume
their arguments immediately. For more complex applications, full
generator definitions are always superior in terms of being obvious
about scope, lifetime, and binding[2].

Reduction Functions

The utility of generator expressions is greatly enhanced when combined
with reduction functions like sum(), min(), and max(). The heapq module
in Python 2.4 includes two new reduction functions: nlargest() and
nsmallest(). Both work well with generator expressions and keep no more
than n items in memory at one time.

Acknowledgements

-   Raymond Hettinger first proposed the idea of "generator
    comprehensions" in January 2002.
-   Peter Norvig resurrected the discussion in his proposal for
    Accumulation Displays.
-   Alex Martelli provided critical measurements that proved the
    performance benefits of generator expressions. He also provided
    strong arguments that they were a desirable thing to have.
-   Phillip Eby suggested "iterator expressions" as the name.
-   Subsequently, Tim Peters suggested the name "generator expressions".
-   Armin Rigo, Tim Peters, Guido van Rossum, Samuele Pedroni, Hye-Shik
    Chang and Raymond Hettinger teased out the issues surrounding early
    versus late binding[3].
-   Jiwon Seo single-handedly implemented various versions of the
    proposal including the final version loaded into CVS. Along the way,
    there were periodic code reviews by Hye-Shik Chang and Raymond
    Hettinger. Guido van Rossum made the key design decisions after
    comments from Armin Rigo and newsgroup discussions. Raymond
    Hettinger provided the test suite, documentation, tutorial, and
    examples[4].

References

Copyright

This document has been placed in the public domain.

[1] Discussion over the relative merits of early versus late binding
https://mail.python.org/pipermail/python-dev/2004-April/044555.html

[2] Patch discussion and alternative patches on Source Forge
https://bugs.python.org/issue872326

[3] Discussion over the relative merits of early versus late binding
https://mail.python.org/pipermail/python-dev/2004-April/044555.html

[4] Patch discussion and alternative patches on Source Forge
https://bugs.python.org/issue872326