PEP: 3106 Title: Revamping dict.keys(), .values() and .items() Version:
$Revision$ Last-Modified: $Date$ Author: Guido van Rossum Status: Final
Type: Standards Track Content-Type: text/x-rst Created: 19-Dec-2006
Python-Version: 3.0 Post-History:

Abstract

This PEP proposes to change the .keys(), .values() and .items() methods
of the built-in dict type to return a set-like or unordered container
object whose contents are derived from the underlying dictionary rather
than a list which is a copy of the keys, etc.; and to remove the
.iterkeys(), .itervalues() and .iteritems() methods.

The approach is inspired by that taken in the Java Collections
Framework[1].

Introduction

It has long been the plan to change the .keys(), .values() and .items()
methods of the built-in dict type to return a more lightweight object
than a list, and to get rid of .iterkeys(), .itervalues() and
.iteritems(). The idea is that code that currently (in 2.x) reads:

    for k, v in d.iteritems(): ...

should be rewritten as:

    for k, v in d.items(): ...

(and similar for .itervalues() and .iterkeys(), except the latter is
redundant since we can write that loop as for k in d.)

Code that currently reads:

    a = d.keys()    # assume we really want a list here

(etc.) should be rewritten as

  a = list(d.keys())

There are (at least) two ways to accomplish this. The original plan was
to simply let .keys(), .values() and .items() return an iterator, i.e.
exactly what iterkeys(), itervalues() and iteritems() return in Python
2.x. However, the Java Collections Framework[2] suggests that a better
solution is possible: the methods return objects with set behavior (for
.keys() and .items()) or multiset (== bag) behavior (for .values()) that
do not contain copies of the keys, values or items, but rather reference
the underlying dict and pull their values out of the dict as needed.

The advantage of this approach is that one can still write code like
this:

    a = d.items()
    for k, v in a: ...
    # And later, again:
    for k, v in a: ...

Effectively, iter(d.keys()) (etc.) in Python 3.0 will do what
d.iterkeys() (etc.) does in Python 2.x; but in most contexts we don't
have to write the iter() call because it is implied by a for-loop.

The objects returned by the .keys() and .items() methods behave like
sets. The object returned by the values() method behaves like a much
simpler unordered collection -- it cannot be a set because duplicate
values are possible.

Because of the set behavior, it will be possible to check whether two
dicts have the same keys by simply testing:

    if a.keys() == b.keys(): ...

and similarly for .items().

These operations are thread-safe only to the extent that using them in a
thread-unsafe way may cause an exception but will not cause corruption
of the internal representation.

As in Python 2.x, mutating a dict while iterating over it using an
iterator has an undefined effect and will in most cases raise a
RuntimeError exception. (This is similar to the guarantees made by the
Java Collections Framework.)

The objects returned by .keys() and .items() are fully interoperable
with instances of the built-in set and frozenset types; for example:

    set(d.keys()) == d.keys()

is guaranteed to be True (except when d is being modified simultaneously
by another thread).

Specification

I'm using pseudo-code to specify the semantics:

    class dict:

        # Omitting all other dict methods for brevity.
        # The .iterkeys(), .itervalues() and .iteritems() methods
        # will be removed.

        def keys(self):
            return d_keys(self)

        def items(self):
            return d_items(self)

        def values(self):
            return d_values(self)

    class d_keys:

        def __init__(self, d):
            self.__d = d

        def __len__(self):
            return len(self.__d)

        def __contains__(self, key):
            return key in self.__d

        def __iter__(self):
            for key in self.__d:
                yield key

        # The following operations should be implemented to be
        # compatible with sets; this can be done by exploiting
        # the above primitive operations:
        #
        #   <, <=, ==, !=, >=, > (returning a bool)
        #   &, |, ^, - (returning a new, real set object)
        #
        # as well as their method counterparts (.union(), etc.).
        #
        # To specify the semantics, we can specify x == y as:
        #
        #   set(x) == set(y)   if both x and y are d_keys instances
        #   set(x) == y        if x is a d_keys instance
        #   x == set(y)        if y is a d_keys instance
        #
        # and so on for all other operations.

    class d_items:

        def __init__(self, d):
            self.__d = d

        def __len__(self):
            return len(self.__d)

        def __contains__(self, (key, value)):
            return key in self.__d and self.__d[key] == value

        def __iter__(self):
            for key in self.__d:
                yield key, self.__d[key]

        # As well as the set operations mentioned for d_keys above.
        # However the specifications suggested there will not work if
        # the values aren't hashable.  Fortunately, the operations can
        # still be implemented efficiently.  For example, this is how
        # intersection can be specified:

        def __and__(self, other):
            if isinstance(other, (set, frozenset, d_keys)):
                result = set()
                for item in other:
                    if item in self:
                        result.add(item)
                return result
            if not isinstance(other, d_items):
                return NotImplemented
            d = {}
            if len(other) < len(self):
                self, other = other, self
            for item in self:
                if item in other:
                    key, value = item
                    d[key] = value
            return d.items()

        # And here is equality:

        def __eq__(self, other):
            if isinstance(other, (set, frozenset, d_keys)):
                if len(self) != len(other):
                    return False
                for item in other:
                    if item not in self:
                        return False
                return True
            if not isinstance(other, d_items):
                return NotImplemented
            # XXX We could also just compare the underlying dicts...
            if len(self) != len(other):
                return False
            for item in self:
                if item not in other:
                    return False
            return True

        def __ne__(self, other):
            # XXX Perhaps object.__ne__() should be defined this way.
            result = self.__eq__(other)
            if result is not NotImplemented:
                result = not result
            return result

    class d_values:

        def __init__(self, d):
            self.__d = d

        def __len__(self):
            return len(self.__d)

        def __contains__(self, value):
            # This is slow, and it's what "x in y" uses as a fallback
            # if __contains__ is not defined; but I'd rather make it
            # explicit that it is supported.
            for v in self:
                 if v == value:
                     return True
            return False

        def __iter__(self):
            for key in self.__d:
                yield self.__d[key]

        def __eq__(self, other):
            if not isinstance(other, d_values):
                return NotImplemented
            if len(self) != len(other):
                return False
            # XXX Sometimes this could be optimized, but these are the
            # semantics: we can't depend on the values to be hashable
            # or comparable.
            olist = list(other)
            for x in self:
                try:
                    olist.remove(x)
                except ValueError:
                    return False
            assert olist == []
            return True

        def __ne__(self, other):
            result = self.__eq__(other)
            if result is not NotImplemented:
                result = not result
            return result

Notes:

The view objects are not directly mutable, but don't implement
__hash__(); their value can change if the underlying dict is mutated.

The only requirements on the underlying dict are that it implements
__getitem__(), __contains__(), __iter__(), and __len__().

We don't implement .copy() -- the presence of a .copy() method suggests
that the copy has the same type as the original, but that's not feasible
without copying the underlying dict. If you want a copy of a specific
type, like list or set, you can just pass one of the above to the list()
or set() constructor.

The specification implies that the order in which items are returned by
.keys(), .values() and .items() is the same (just as it was in Python
2.x), because the order is all derived from the dict iterator (which is
presumably arbitrary but stable as long as a dict isn't modified). This
can be expressed by the following invariant:

    list(d.items()) == list(zip(d.keys(), d.values()))

Open Issues

Do we need more of a motivation? I would think that being able to do set
operations on keys and items without having to copy them should speak
for itself.

I've left out the implementation of various set operations. These could
still present small surprises.

It would be okay if multiple calls to d.keys() (etc.) returned the same
object, since the object's only state is the dict to which it refers. Is
this worth having extra slots in the dict object for? Should that be a
weak reference or should the d_keys (etc.) object live forever once
created? Strawman: probably not worth the extra slots in every dict.

Should d_keys, d_values and d_items have a public instance variable or
method through which one can retrieve the underlying dict? Strawman: yes
(but what should it be called?).

I'm soliciting better names than d_keys, d_values and d_items. These
classes could be public so that their implementations could be reused by
the .keys(), .values() and .items() methods of other mappings. Or should
they?

Should the d_keys, d_values and d_items classes be reusable? Strawman:
yes.

Should they be subclassable? Strawman: yes (but see below).

A particularly nasty issue is whether operations that are specified in
terms of other operations (e.g. .discard()) must really be implemented
in terms of those other operations; this may appear irrelevant but it
becomes relevant if these classes are ever subclassed. Historically,
Python has a really poor track record of specifying the semantics of
highly optimized built-in types clearly in such cases; my strawman is to
continue that trend. Subclassing may still be useful to add new methods,
for example.

I'll leave the decisions (especially about naming) up to whoever submits
a working implementation.

References

[1] Java Collections Framework
http://java.sun.com/docs/books/tutorial/collections/index.html

[2] Java Collections Framework
http://java.sun.com/docs/books/tutorial/collections/index.html