PEP: 323 Title: Copyable Iterators Version: $Revision$ Last-Modified:
$Date$ Author: Alex Martelli <aleaxit@gmail.com> Status: Deferred Type:
Standards Track Content-Type: text/x-rst Created: 25-Oct-2003
Python-Version: 2.5 Post-History: 29-Oct-2003

Deferral

This PEP has been deferred. Copyable iterators are a nice idea, but
after four years, no implementation or widespread interest has emerged.

Abstract

This PEP suggests that some iterator types should support shallow copies
of their instances by exposing a __copy__ method which meets some
specific requirements, and indicates how code using an iterator might
exploit such a __copy__ method when present.

Update and Comments

Support for __copy__ was included in Py2.4's itertools.tee().

Adding __copy__ methods to existing iterators will change the behavior
under tee(). Currently, the copied iterators remain tied to the original
iterator. If the original advances, then so do all of the copies. Good
practice is to overwrite the original so that anomalies don't result:
a,b=tee(a). Code that doesn't follow that practice may observe a
semantic change if a __copy__ method is added to an iterator.

Motivation

In Python up to 2.3, most built-in iterator types don't let the user
copy their instances. User-coded iterators that do let their clients
call copy.copy on their instances may, or may not, happen to return, as
a result of the copy, a separate iterator object that may be iterated
upon independently from the original.

Currently, "support" for copy.copy in a user-coded iterator type is
almost invariably "accidental" -- i.e., the standard machinery of the
copy method in Python's standard library's copy module does build and
return a copy. However, the copy will be independently iterable with
respect to the original only if calling .next() on an instance of that
class happens to change instance state solely by rebinding some
attributes to new values, and not by mutating some attributes' existing
values.

For example, an iterator whose "index" state is held as an integer
attribute will probably give usable copies, since (integers being
immutable) .next() presumably just rebinds that attribute. On the other
hand, another iterator whose "index" state is held as a list attribute
will probably mutate the same list object when .next() executes, and
therefore copies of such an iterator will not be iterable separately and
independently from the original.

Given this existing situation, copy.copy(it) on some iterator object
isn't very useful, nor, therefore, is it at all widely used. However,
there are many cases in which being able to get a "snapshot" of an
iterator, as a "bookmark", so as to be able to keep iterating along the
sequence but later iterate again on the same sequence from the bookmark
onwards, is useful. To support such "bookmarking", module itertools, in
2.4, has grown a 'tee' function, to be used as:

    it, bookmark = itertools.tee(it)

The previous value of 'it' must not be used again, which is why this
typical usage idiom rebinds the name. After this call, 'it' and
'bookmark' are independently-iterable iterators on the same underlying
sequence as the original value of 'it': this satisfies application needs
for "iterator copying".

However, when itertools.tee can make no hypotheses about the nature of
the iterator it is passed as an argument, it must save in memory all
items through which one of the two 'teed' iterators, but not yet both,
have stepped. This can be quite costly in terms of memory, if the two
iterators get very far from each other in their stepping; indeed, in
some cases it may be preferable to make a list from the iterator so as
to be able to step repeatedly through the subsequence, or, if that is
too costy in terms of memory, save items to disk, again in order to be
able to iterate through them repeatedly.

This PEP proposes another idea that will, in some important cases, allow
itertools.tee to do its job with minimal cost in terms of memory; user
code may also occasionally be able to exploit the idea in order to
decide whether to copy an iterator, make a list from it, or use an
auxiliary disk file.

The key consideration is that some important iterators, such as those
which built-in function iter builds over sequences, would be
intrinsically easy to copy: just get another reference to the same
sequence, and a copy of the integer index. However, in Python 2.3, those
iterators don't expose the state, and don't support copy.copy.

The purpose of this PEP, therefore, is to have those iterator types
expose a suitable __copy__ method. Similarly, user-coded iterator types
that can provide copies of their instances, suitable for separate and
independent iteration, with limited costs in time and space, should also
expose a suitable __copy__ method. While copy.copy also supports other
ways to let a type control the way its instances are copied, it is
suggested, for simplicity, that iterator types that support copying
always do so by exposing a __copy__ method, and not in the other ways
copy.copy supports.

Having iterators expose a suitable __copy__ when feasible will afford
easy optimization of itertools.tee and similar user code, as in:

    def tee(it):
        it = iter(it)
        try: copier = it.__copy__
        except AttributeError:
            # non-copyable iterator, do all the needed hard work
            # [snipped!]
        else:
            return it, copier()

Note that this function does NOT call "copy.copy(it)", which (even after
this PEP is implemented) might well still "just happen to succeed". for
some iterator type that is implemented as a user-coded class. without
really supplying an adequate "independently iterable" copy object as its
result.

Specification

Any iterator type X may expose a method __copy__ that is callable
without arguments on any instance x of X. The method should be exposed
if and only if the iterator type can provide copyability with reasonably
little computational and memory effort. Furthermore, the new object y
returned by method __copy__ should be a new instance of X that is
iterable independently and separately from x, stepping along the same
"underlying sequence" of items.

For example, suppose a class Iter essentially duplicated the
functionality of the iter builtin for iterating on a sequence:

    class Iter(object):

        def __init__(self, sequence):
            self.sequence = sequence
            self.index = 0

        def __iter__(self):
            return self

        def next(self):
            try: result = self.sequence[self.index]
            except IndexError: raise StopIteration
            self.index += 1
            return result

To make this Iter class compliant with this PEP, the following addition
to the body of class Iter would suffice:

    def __copy__(self):
        result = self.__class__(self.sequence)
        result.index = self.index
        return result

Note that __copy__, in this case, does not even try to copy the
sequence; if the sequence is altered while either or both of the
original and copied iterators are still stepping on it, the iteration
behavior is quite likely to go awry anyway -- it is not __copy__'s
responsibility to change this normal Python behavior for iterators which
iterate on mutable sequences (that might, perhaps, be the specification
for a __deepcopy__ method of iterators, which, however, this PEP does
not deal with).

Consider also a "random iterator", which provides a nonterminating
sequence of results from some method of a random instance, called with
given arguments:

    class RandomIterator(object):

        def __init__(self, bound_method, *args):
            self.call = bound_method
            self.args = args

        def __iter__(self):
            return self

        def next(self):
            return self.call(*self.args)

        def __copy__(self):
            import copy, new
            im_self = copy.copy(self.call.im_self)
            method = new.instancemethod(self.call.im_func, im_self)
            return self.__class__(method, *self.args)

This iterator type is slightly more general than its name implies, as it
supports calls to any bound method (or other callable, but if the
callable is not a bound method, then method __copy__ will fail). But the
use case is for the purpose of generating random streams, as in:

    import random

    def show5(it):
        for i, result in enumerate(it):
            print '%6.3f'%result,
            if i==4: break
        print

    normit = RandomIterator(random.Random().gauss, 0, 1)
    show5(normit)
    copit = normit.__copy__()
    show5(normit)
    show5(copit)

which will display some output such as:

    -0.536  1.936 -1.182 -1.690 -1.184
     0.666 -0.701  1.214  0.348  1.373
     0.666 -0.701  1.214  0.348  1.373

the key point being that the second and third lines are equal, because
the normit and copit iterators will step along the same "underlying
sequence". (As an aside, note that to get a copy of self.call.im_self we
must use copy.copy, NOT try getting at a __copy__ method directly,
because for example instances of random.Random support copying via
__getstate__ and __setstate__, NOT via __copy__; indeed, using copy.copy
is the normal way to get a shallow copy of any object --copyable
iterators are different because of the already-mentioned uncertainty
about the result of copy.copy supporting these "copyable iterator"
specs).

Details

Besides adding to the Python docs a recommendation that user-coded
iterator types support a __copy__ method (if and only if it can be
implemented with small costs in memory and runtime, and produce an
independently-iterable copy of an iterator object), this PEP's
implementation will specifically include the addition of copyability to
the iterators over sequences that built-in iter returns, and also to the
iterators over a dictionary returned by the methods __iter__, iterkeys,
itervalues, and iteritems of built-in type dict.

Iterators produced by generator functions will not be copyable. However,
iterators produced by the new "generator expressions" of Python 2.4 (PEP
289) should be copyable if their underlying iterator[s] are; the strict
limitations on what is possible in a generator expression, compared to
the much vaster generality of a generator, should make that feasible.
Similarly, the iterators produced by the built-in function enumerate,
and certain functions suppiled by module itertools, should be copyable
if the underlying iterators are.

The implementation of this PEP will also include the optimization of the
new itertools.tee function mentioned in the Motivation section.

Rationale

The main use case for (shallow) copying of an iterator is the same as
for the function itertools.tee (new in 2.4). User code will not directly
attempt to copy an iterator, because it would have to deal separately
with uncopyable cases; calling itertools.tee will internally perform the
copy when appropriate, and implicitly fallback to a maximally efficient
non-copying strategy for iterators that are not copyable. (Occasionally,
user code may want more direct control, specifically in order to deal
with non-copyable iterators by other strategies, such as making a list
or saving the sequence to disk).

A tee'd iterator may serve as a "reference point", allowing processing
of a sequence to continue or resume from a known point, while the other
independent iterator can be freely advanced to "explore" a further part
of the sequence as needed. A simple example: a generator function which,
given an iterator of numbers (assumed to be positive), returns a
corresponding iterator, each of whose items is the fraction of the total
corresponding to each corresponding item of the input iterator. The
caller may pass the total as a value, if known in advance; otherwise,
the iterator returned by calling this generator function will first
compute the total.

    def fractions(numbers, total=None):
        if total is None:
            numbers, aux = itertools.tee(numbers)
            total = sum(aux)
        total = float(total)
        for item in numbers:
            yield item / total

The ability to tee the numbers iterator allows this generator to
precompute the total, if needed, without necessarily requiring O(N)
auxiliary memory if the numbers iterator is copyable.

As another example of "iterator bookmarking", consider a stream of
numbers with an occasional string as a "postfix operator" now and then.
By far most frequent such operator is a '+', whereupon we must sum all
previous numbers (since the last previous operator if any, or else since
the start) and yield the result. Sometimes we find a '*' instead, which
is the same except that the previous numbers must instead be multiplied,
not summed.

    def filter_weird_stream(stream):
        it = iter(stream)
        while True:
            it, bookmark = itertools.tee(it)
            total = 0
            for item in it:
                if item=='+':
                    yield total
                    break
                elif item=='*':
                    product = 1
                    for item in bookmark:
                        if item=='*':
                            yield product
                            break
                        else:
                            product *= item
               else:
                   total += item

Similar use cases of itertools.tee can support such tasks as "undo" on a
stream of commands represented by an iterator, "backtracking" on the
parse of a stream of tokens, and so on. (Of course, in each case, one
should also consider simpler possibilities such as saving relevant
portions of the sequence into lists while stepping on the sequence with
just one iterator, depending on the details of one's task).

Here is an example, in pure Python, of how the 'enumerate' built-in
could be extended to support __copy__ if its underlying iterator also
supported __copy__:

    class enumerate(object):

        def __init__(self, it):
            self.it = iter(it)
            self.i = -1

        def __iter__(self):
            return self

        def next(self):
            self.i += 1
            return self.i, self.it.next()

        def __copy__(self):
            result = self.__class__.__new__()
            result.it = self.it.__copy__()
            result.i = self.i
            return result

Here is an example of the kind of "fragility" produced by "accidental
copyability" of an iterator -- the reason why one must NOT use copy.copy
expecting, if it succeeds, to receive as a result an iterator which is
iterable-on independently from the original. Here is an iterator class
that iterates (in preorder) on "trees" which, for simplicity, are just
nested lists -- any item that's a list is treated as a subtree, any
other item as a leaf.

    class ListreeIter(object):

        def __init__(self, tree):
            self.tree = [tree]
            self.indx = [-1]

        def __iter__(self):
            return self

        def next(self):
            if not self.indx:
                raise StopIteration
            self.indx[-1] += 1
            try:
                result = self.tree[-1][self.indx[-1]]
            except IndexError:
                self.tree.pop()
                self.indx.pop()
                return self.next()
            if type(result) is not list:
                return result
            self.tree.append(result)
            self.indx.append(-1)
            return self.next()

Now, for example, the following code:

    import copy
    x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ]

    print 'showing all items:',
    it = ListreeIter(x)
    for i in it:
        print i,
        if i==6: cop = copy.copy(it)
    print

    print 'showing items >6 again:'
    for i in cop: print i,
    print

does NOT work as intended -- the "cop" iterator gets consumed, and
exhausted, step by step as the original "it" iterator is, because the
accidental (rather than deliberate) copying performed by copy.copy
shares, rather than duplicating the "index" list, which is the mutable
attribute it.indx (a list of numerical indices). Thus, this "client
code" of the iterator, which attempts to iterate twice over a portion of
the sequence via a copy.copy on the iterator, is NOT correct.

Some correct solutions include using itertools.tee, i.e., changing the
first for loop into:

    for i in it:
        print i,
        if i==6:
            it, cop = itertools.tee(it)
            break
    for i in it: print i,

(note that we MUST break the loop in two, otherwise we'd still be
looping on the ORIGINAL value of it, which must NOT be used further
after the call to tee!!!); or making a list, i.e.

    for i in it:
        print i,
        if i==6:
            cop = lit = list(it)
            break
    for i in lit: print i,

(again, the loop must be broken in two, since iterator 'it' gets
exhausted by the call list(it)).

Finally, all of these solutions would work if Listiter supplied a
suitable __copy__ method, as this PEP recommends:

    def __copy__(self):
        result = self.__class__.new()
        result.tree = copy.copy(self.tree)
        result.indx = copy.copy(self.indx)
        return result

There is no need to get any "deeper" in the copy, but the two mutable
"index state" attributes must indeed be copied in order to achieve a
"proper" (independently iterable) iterator-copy.

The recommended solution is to have class Listiter supply this __copy__
method AND have client code use itertools.tee (with the
split-in-two-parts loop as shown above). This will make client code
maximally tolerant of different iterator types it might be using AND
achieve good performance for tee'ing of this specific iterator type at
the same time.

References

[1] Discussion on python-dev starting at post:
https://mail.python.org/pipermail/python-dev/2003-October/038969.html

[2] Online documentation for the copy module of the standard library:
https://docs.python.org/release/2.6/library/copy.html

Copyright

This document has been placed in the public domain.