PEP: 234 Title: Iterators Author: Ka-Ping Yee <ping@zesty.ca>, Guido van
Rossum <guido@python.org> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 30-Jan-2001 Python-Version: 2.1
Post-History: 30-Apr-2001

Abstract

This document proposes an iteration interface that objects can provide
to control the behaviour of for loops. Looping is customized by
providing a method that produces an iterator object. The iterator
provides a get next value operation that produces the next item in the
sequence each time it is called, raising an exception when no more items
are available.

In addition, specific iterators over the keys of a dictionary and over
the lines of a file are proposed, and a proposal is made to allow
spelling dict.has_key(key) as key in dict.

Note: this is an almost complete rewrite of this PEP by the second
author, describing the actual implementation checked into the trunk of
the Python 2.2 CVS tree. It is still open for discussion. Some of the
more esoteric proposals in the original version of this PEP have been
withdrawn for now; these may be the subject of a separate PEP in the
future.

C API Specification

A new exception is defined, StopIteration, which can be used to signal
the end of an iteration.

A new slot named tp_iter for requesting an iterator is added to the type
object structure. This should be a function of one PyObject * argument
returning a PyObject *, or NULL. To use this slot, a new C API function
PyObject_GetIter() is added, with the same signature as the tp_iter slot
function.

Another new slot, named tp_iternext, is added to the type structure, for
obtaining the next value in the iteration. To use this slot, a new C API
function PyIter_Next() is added. The signature for both the slot and the
API function is as follows, although the NULL return conditions differ:
the argument is a PyObject * and so is the return value. When the return
value is non-NULL, it is the next value in the iteration. When it is
NULL, then for the tp_iternext slot there are three possibilities:

-   No exception is set; this implies the end of the iteration.
-   The StopIteration exception (or a derived exception class) is set;
    this implies the end of the iteration.
-   Some other exception is set; this means that an error occurred that
    should be propagated normally.

The higher-level PyIter_Next() function clears the StopIteration
exception (or derived exception) when it occurs, so its NULL return
conditions are simpler:

-   No exception is set; this means iteration has ended.
-   Some exception is set; this means an error occurred, and should be
    propagated normally.

Iterators implemented in C should not implement a next() method with
similar semantics as the tp_iternext slot! When the type's dictionary is
initialized (by PyType_Ready()), the presence of a tp_iternext slot
causes a method next() wrapping that slot to be added to the type's
tp_dict. (Exception: if the type doesn't use PyObject_GenericGetAttr()
to access instance attributes, the next() method in the type's tp_dict
may not be seen.) (Due to a misunderstanding in the original text of
this PEP, in Python 2.2, all iterator types implemented a next() method
that was overridden by the wrapper; this has been fixed in Python 2.3.)

To ensure binary backwards compatibility, a new flag
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags field,
and to the default flags macro. This flag must be tested before
accessing the tp_iter or tp_iternext slots. The macro PyIter_Check()
tests whether an object has the appropriate flag set and has a non-NULL
tp_iternext slot. There is no such macro for the tp_iter slot (since the
only place where this slot is referenced should be PyObject_GetIter(),
and this can check for the Py_TPFLAGS_HAVE_ITER flag directly).

(Note: the tp_iter slot can be present on any object; the tp_iternext
slot should only be present on objects that act as iterators.)

For backwards compatibility, the PyObject_GetIter() function implements
fallback semantics when its argument is a sequence that does not
implement a tp_iter function: a lightweight sequence iterator object is
constructed in that case which iterates over the items of the sequence
in the natural order.

The Python bytecode generated for for loops is changed to use new
opcodes, GET_ITER and FOR_ITER, that use the iterator protocol rather
than the sequence protocol to get the next value for the loop variable.
This makes it possible to use a for loop to loop over non-sequence
objects that support the tp_iter slot. Other places where the
interpreter loops over the values of a sequence should also be changed
to use iterators.

Iterators ought to implement the tp_iter slot as returning a reference
to themselves; this is needed to make it possible to use an iterator (as
opposed to a sequence) in a for loop.

Iterator implementations (in C or in Python) should guarantee that once
the iterator has signalled its exhaustion, subsequent calls to
tp_iternext or to the next() method will continue to do so. It is not
specified whether an iterator should enter the exhausted state when an
exception (other than StopIteration) is raised. Note that Python cannot
guarantee that user-defined or 3rd party iterators implement this
requirement correctly.

Python API Specification

The StopIteration exception is made visible as one of the standard
exceptions. It is derived from Exception.

A new built-in function is defined, iter(), which can be called in two
ways:

-   iter(obj) calls PyObject_GetIter(obj).
-   iter(callable, sentinel) returns a special kind of iterator that
    calls the callable to produce a new value, and compares the return
    value to the sentinel value. If the return value equals the
    sentinel, this signals the end of the iteration and StopIteration is
    raised rather than returning normal; if the return value does not
    equal the sentinel, it is returned as the next value from the
    iterator. If the callable raises an exception, this is propagated
    normally; in particular, the function is allowed to raise
    StopIteration as an alternative way to end the iteration. (This
    functionality is available from the C API as
    PyCallIter_New(callable, sentinel).)

Iterator objects returned by either form of iter() have a next() method.
This method either returns the next value in the iteration, or raises
StopIteration (or a derived exception class) to signal the end of the
iteration. Any other exception should be considered to signify an error
and should be propagated normally, not taken to mean the end of the
iteration.

Classes can define how they are iterated over by defining an __iter__()
method; this should take no additional arguments and return a valid
iterator object. A class that wants to be an iterator should implement
two methods: a next() method that behaves as described above, and an
__iter__() method that returns self.

The two methods correspond to two distinct protocols:

1.  An object can be iterated over with for if it implements __iter__()
    or __getitem__().
2.  An object can function as an iterator if it implements next().

Container-like objects usually support protocol 1. Iterators are
currently required to support both protocols. The semantics of iteration
come only from protocol 2; protocol 1 is present to make iterators
behave like sequences; in particular so that code receiving an iterator
can use a for-loop over the iterator.

Dictionary Iterators

-   Dictionaries implement a sq_contains slot that implements the same
    test as the has_key() method. This means that we can write

        if k in dict: ...

    which is equivalent to

        if dict.has_key(k): ...

-   Dictionaries implement a tp_iter slot that returns an efficient
    iterator that iterates over the keys of the dictionary. During such
    an iteration, the dictionary should not be modified, except that
    setting the value for an existing key is allowed (deletions or
    additions are not, nor is the update() method). This means that we
    can write

        for k in dict: ...

    which is equivalent to, but much faster than

        for k in dict.keys(): ...

    as long as the restriction on modifications to the dictionary
    (either by the loop or by another thread) are not violated.

-   Add methods to dictionaries that return different kinds of iterators
    explicitly:

        for key in dict.iterkeys(): ...

        for value in dict.itervalues(): ...

        for key, value in dict.iteritems(): ...

    This means that for x in dict is shorthand for
    for x in dict.iterkeys().

Other mappings, if they support iterators at all, should also iterate
over the keys. However, this should not be taken as an absolute rule;
specific applications may have different requirements.

File Iterators

The following proposal is useful because it provides us with a good
answer to the complaint that the common idiom to iterate over the lines
of a file is ugly and slow.

-   Files implement a tp_iter slot that is equivalent to
    iter(f.readline, ""). This means that we can write

        for line in file:
            ...

    as a shorthand for

        for line in iter(file.readline, ""):
            ...

    which is equivalent to, but faster than

        while 1:
            line = file.readline()
            if not line:
                break
            ...

This also shows that some iterators are destructive: they consume all
the values and a second iterator cannot easily be created that iterates
independently over the same values. You could open the file for a second
time, or seek() to the beginning, but these solutions don't work for all
file types, e.g. they don't work when the open file object really
represents a pipe or a stream socket.

Because the file iterator uses an internal buffer, mixing this with
other file operations (e.g. file.readline()) doesn't work right. Also,
the following code:

    for line in file:
        if line == "\n":
            break
    for line in file:
       print line,

doesn't work as you might expect, because the iterator created by the
second for-loop doesn't take the buffer read-ahead by the first for-loop
into account. A correct way to write this is:

    it = iter(file)
    for line in it:
        if line == "\n":
            break
    for line in it:
        print line,

(The rationale for these restrictions are that for line in file ought to
become the recommended, standard way to iterate over the lines of a
file, and this should be as fast as can be. The iterator version is
considerable faster than calling readline(), due to the internal buffer
in the iterator.)

Rationale

If all the parts of the proposal are included, this addresses many
concerns in a consistent and flexible fashion. Among its chief virtues
are the following four -- no, five -- no, six -- points:

1.  It provides an extensible iterator interface.
2.  It allows performance enhancements to list iteration.
3.  It allows big performance enhancements to dictionary iteration.
4.  It allows one to provide an interface for just iteration without
    pretending to provide random access to elements.
5.  It is backward-compatible with all existing user-defined classes and
    extension objects that emulate sequences and mappings, even mappings
    that only implement a subset of {__getitem__, keys, values, items}.
6.  It makes code iterating over non-sequence collections more concise
    and readable.

Resolved Issues

The following topics have been decided by consensus or BDFL
pronouncement.

-   Two alternative spellings for next() have been proposed but
    rejected: __next__(), because it corresponds to a type object slot
    (tp_iternext); and __call__(), because this is the only operation.

    Arguments against __next__(): while many iterators are used in for
    loops, it is expected that user code will also call next() directly,
    so having to write __next__() is ugly; also, a possible extension of
    the protocol would be to allow for prev(), current() and reset()
    operations; surely we don't want to use __prev__(), __current__(),
    __reset__().

    Arguments against __call__() (the original proposal): taken out of
    context, x() is not very readable, while x.next() is clear; there's
    a danger that every special-purpose object wants to use __call__()
    for its most common operation, causing more confusion than clarity.

    (In retrospect, it might have been better to go for __next__() and
    have a new built-in, next(it), which calls it.__next__(). But alas,
    it's too late; this has been deployed in Python 2.2 since December
    2001.)

-   Some folks have requested the ability to restart an iterator. This
    should be dealt with by calling iter() on a sequence repeatedly, not
    by the iterator protocol itself. (See also requested extensions
    below.)

-   It has been questioned whether an exception to signal the end of the
    iteration isn't too expensive. Several alternatives for the
    StopIteration exception have been proposed: a special value End to
    signal the end, a function end() to test whether the iterator is
    finished, even reusing the IndexError exception.

    -   A special value has the problem that if a sequence ever contains
        that special value, a loop over that sequence will end
        prematurely without any warning. If the experience with
        null-terminated C strings hasn't taught us the problems this can
        cause, imagine the trouble a Python introspection tool would
        have iterating over a list of all built-in names, assuming that
        the special End value was a built-in name!
    -   Calling an end() function would require two calls per iteration.
        Two calls is much more expensive than one call plus a test for
        an exception. Especially the time-critical for loop can test
        very cheaply for an exception.
    -   Reusing IndexError can cause confusion because it can be a
        genuine error, which would be masked by ending the loop
        prematurely.

-   Some have asked for a standard iterator type. Presumably all
    iterators would have to be derived from this type. But this is not
    the Python way: dictionaries are mappings because they support
    __getitem__() and a handful other operations, not because they are
    derived from an abstract mapping type.

-   Regarding if key in dict: there is no doubt that the dict.has_key(x)
    interpretation of x in dict is by far the most useful
    interpretation, probably the only useful one. There has been
    resistance against this because x in list checks whether x is
    present among the values, while the proposal makes x in dict check
    whether x is present among the keys. Given that the symmetry between
    lists and dictionaries is very weak, this argument does not have
    much weight.

-   The name iter() is an abbreviation. Alternatives proposed include
    iterate(), traverse(), but these appear too long. Python has a
    history of using abbrs for common builtins, e.g. repr(), str(),
    len().

    Resolution: iter() it is.

-   Using the same name for two different operations (getting an
    iterator from an object and making an iterator for a function with a
    sentinel value) is somewhat ugly. I haven't seen a better name for
    the second operation though, and since they both return an iterator,
    it's easy to remember.

    Resolution: the builtin iter() takes an optional argument, which is
    the sentinel to look for.

-   Once a particular iterator object has raised StopIteration, will it
    also raise StopIteration on all subsequent next() calls? Some say
    that it would be useful to require this, others say that it is
    useful to leave this open to individual iterators. Note that this
    may require an additional state bit for some iterator
    implementations (e.g. function-wrapping iterators).

    Resolution: once StopIteration is raised, calling it.next()
    continues to raise StopIteration.

    Note: this was in fact not implemented in Python 2.2; there are many
    cases where an iterator's next() method can raise StopIteration on
    one call but not on the next. This has been remedied in Python 2.3.

-   It has been proposed that a file object should be its own iterator,
    with a next() method returning the next line. This has certain
    advantages, and makes it even clearer that this iterator is
    destructive. The disadvantage is that this would make it even more
    painful to implement the "sticky StopIteration" feature proposed in
    the previous bullet.

    Resolution: tentatively rejected (though there are still people
    arguing for this).

-   Some folks have requested extensions of the iterator protocol, e.g.
    prev() to get the previous item, current() to get the current item
    again, finished() to test whether the iterator is finished, and
    maybe even others, like rewind(), __len__(), position().

    While some of these are useful, many of these cannot easily be
    implemented for all iterator types without adding arbitrary
    buffering, and sometimes they can't be implemented at all (or not
    reasonably). E.g. anything to do with reversing directions can't be
    done when iterating over a file or function. Maybe a separate PEP
    can be drafted to standardize the names for such operations when
    they are implementable.

    Resolution: rejected.

-   There has been a long discussion about whether

        for x in dict: ...

    should assign x the successive keys, values, or items of the
    dictionary. The symmetry between if x in y and for x in y suggests
    that it should iterate over keys. This symmetry has been observed by
    many independently and has even been used to "explain" one using the
    other. This is because for sequences, if x in y iterates over y
    comparing the iterated values to x. If we adopt both of the above
    proposals, this will also hold for dictionaries.

    The argument against making for x in dict iterate over the keys
    comes mostly from a practicality point of view: scans of the
    standard library show that there are about as many uses of
    for x in dict.items() as there are of for x in dict.keys(), with the
    items() version having a small majority. Presumably many of the
    loops using keys() use the corresponding value anyway, by writing
    dict[x], so (the argument goes) by making both the key and value
    available, we could support the largest number of cases. While this
    is true, I (Guido) find the correspondence between for x in dict and
    if x in dict too compelling to break, and there's not much overhead
    in having to write dict[x] to explicitly get the value.

    For fast iteration over items, use
    for key, value in dict.iteritems(). I've timed the difference
    between

        for key in dict: dict[key]

    and

        for key, value in dict.iteritems(): pass

    and found that the latter is only about 7% faster.

    Resolution: By BDFL pronouncement, for x in dict iterates over the
    keys, and dictionaries have iteritems(), iterkeys(), and
    itervalues() to return the different flavors of dictionary
    iterators.

Mailing Lists

The iterator protocol has been discussed extensively in a mailing list
on SourceForge:

  http://lists.sourceforge.net/lists/listinfo/python-iterators

Initially, some of the discussion was carried out at Yahoo; archives are
still accessible:

  http://groups.yahoo.com/group/python-iter

Copyright

This document is in the public domain.