PEP: 618 Title: Add Optional Length-Checking To zip Version: $Revision$
Last-Modified: $Date$ Author: Brandt Bucher <brandt@python.org> Sponsor:
Antoine Pitrou <antoine@python.org> BDFL-Delegate: Guido van Rossum
<guido@python.org> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 01-May-2020 Python-Version: 3.10 Post-History:
01-May-2020, 10-May-2020, 16-Jun-2020 Resolution:
https://mail.python.org/archives/list/python-dev@python.org/message/NLWB7FVJGMBBMCF4P3ZKUIE53JPDOWJ3

Abstract

This PEP proposes adding an optional strict boolean keyword parameter to
the built-in zip. When enabled, a ValueError is raised if one of the
arguments is exhausted before the others.

Motivation

It is clear from the author's personal experience and a survey of the
standard library that much (if not most) zip usage involves iterables
that must be of equal length. Sometimes this invariant is proven true
from the context of the surrounding code, but often the data being
zipped is passed from the caller, sourced separately, or generated in
some fashion. In any of these cases, the default behavior of zip means
that faulty refactoring or logic errors could easily result in silently
losing data. These bugs are not only difficult to diagnose, but
difficult to even detect at all.

It is easy to come up with simple cases where this could be a problem.
For example, the following code may work fine when items is a sequence,
but silently start producing shortened, mismatched results if items is
refactored by the caller to be a consumable iterator:

    def apply_calculations(items):
        transformed = transform(items)
        for i, t in zip(items, transformed):
            yield calculate(i, t)

There are several other ways in which zip is commonly used. Idiomatic
tricks are especially susceptible, because they are often employed by
users who lack a complete understanding of how the code works. One
example is unpacking into zip to lazily "unzip" or "transpose" nested
iterables:

    >>> x = [[1, 2, 3], ["one" "two" "three"]]
    >>> xt = list(zip(*x))

Another is "chunking" data into equal-sized groups:

    >>> n = 3
    >>> x = range(n ** 2),
    >>> xn = list(zip(*[iter(x)] * n))

In the first case, non-rectangular data is usually a logic error. In the
second case, data with a length that is not a multiple of n is often an
error as well. However, both of these idioms will silently omit the
tail-end items of malformed input.

Perhaps most convincingly, the use of zip in the standard-library ast
module created a bug in literal_eval which silently dropped parts of
malformed nodes:

    >>> from ast import Constant, Dict, literal_eval
    >>> nasty_dict = Dict(keys=[Constant(None)], values=[])
    >>> literal_eval(nasty_dict)  # Like eval("{None: }")
    {}

In fact, the author has counted dozens of other call sites in Python's
standard library and tooling where it would be appropriate to enable
this new feature immediately.

Rationale

Some critics assert that constant boolean switches are a "code-smell",
or go against Python's design philosophy. However, Python currently
contains several examples of boolean keyword parameters on built-in
functions which are typically called with compile-time constants:

-   compile(..., dont_inherit=True)
-   open(..., closefd=False)
-   print(..., flush=True)
-   sorted(..., reverse=True)

Many more exist in the standard library.

The idea and name for this new parameter were originally proposed by Ram
Rachum. The thread received over 100 replies, with the alternative
"equal" receiving a similar amount of support.

The author does not have a strong preference between the two choices,
though "equal equals" is a bit awkward in prose. It may also (wrongly)
imply some notion of "equality" between the zipped items:

    >>> z = zip([2.0, 4.0, 6.0], [2, 4, 8], equal=True)

Specification

When the built-in zip is called with the keyword-only argument
strict=True, the resulting iterator will raise a ValueError if the
arguments are exhausted at differing lengths. This error will occur at
the point when iteration would normally stop today.

Backward Compatibility

This change is fully backward-compatible. zip currently takes no keyword
arguments, and the "non-strict" default behavior when strict is omitted
remains unchanged.

Reference Implementation

The author has drafted a C implementation.

An approximate Python translation is:

    def zip(*iterables, strict=False):
        if not iterables:
            return
        iterators = tuple(iter(iterable) for iterable in iterables)
        try:
            while True:
                items = []
                for iterator in iterators:
                    items.append(next(iterator))
                yield tuple(items)
        except StopIteration:
            if not strict:
                return
        if items:
            i = len(items)
            plural = " " if i == 1 else "s 1-"
            msg = f"zip() argument {i+1} is shorter than argument{plural}{i}"
            raise ValueError(msg)
        sentinel = object()
        for i, iterator in enumerate(iterators[1:], 1):
            if next(iterator, sentinel) is not sentinel:
                plural = " " if i == 1 else "s 1-"
                msg = f"zip() argument {i+1} is longer than argument{plural}{i}"
                raise ValueError(msg)

Rejected Ideas

Add itertools.zip_strict

This is the alternative with the most support on the Python-Ideas
mailing list, so it deserves to be discussed in some detail here. It
does not have any disqualifying flaws, and could work well enough as a
substitute if this PEP is rejected.

With that in mind, this section aims to outline why adding an optional
parameter to zip is a smaller change that ultimately does a better job
of solving the problems motivating this PEP.

Precedent

It seems that a great deal of the motivation driving this alternative is
that zip_longest already exists in itertools. However, zip_longest is in
many ways a much more complicated, specialized utility: it takes on the
responsibility of filling in missing values, a job neither of the other
variants needs to concern themselves with.

If both zip and zip_longest lived alongside each other in itertools or
as builtins, then adding zip_strict in the same location would indeed be
a much stronger argument. However, the new "strict" variant is
conceptually much closer to zip in interface and behavior than
zip_longest, while still not meeting the high bar of being its own
builtin. Given this situation, it seems most natural for zip to grow
this new option in-place.

Usability

If zip is capable of preventing this class of bug, it becomes much
simpler for users to enable the check at call sites with this property.
Compare this with importing a drop-in replacement for a built-in
utility, which feels somewhat heavy just to check a tricky condition
that should "always" be true.

Some have also argued that a new function buried in the standard library
is somehow more "discoverable" than a keyword parameter on the built-in
itself. The author does not agree with this assessment.

Maintenance Cost

While implementation should only be a secondary concern when making
usability improvements, it is important to recognize that adding a new
utility is significantly more complicated than modifying an existing
one. The CPython implementation accompanying this PEP is simple and has
no measurable performance impact on default zip behavior, while adding
an entirely new utility to itertools would require either:

-   Duplicating much of the existing zip logic, as zip_longest already
    does.
-   Significantly refactoring either zip, zip_longest, or both to share
    a common or inherited implementation (which may impact performance).

Add Several "Modes" To Switch Between

This option only makes more sense than a binary flag if we anticipate
having three or more modes. The "obvious" three choices for these
enumerated or constant modes would be "shortest" (the current zip
behavior), "strict" (the proposed behavior), and "longest" (the
itertools.zip_longest behavior).

However, it doesn't seem like adding behaviors other than the current
default and the proposed "strict" mode is worth the additional
complexity. The clearest candidate, "longest", would require a new
fillvalue parameter (which is meaningless for both other modes). This
mode is also already handled perfectly by itertools.zip_longest, and
adding it would create two ways of doing the same thing. It's not clear
which would be the "obvious" choice: the mode parameter on the built-in
zip, or the long-lived namesake utility in itertools.

Add A Method Or Alternate Constructor To The zip Type

Consider the following two options, which have both been proposed:

    >>> zm = zip(*iters).strict()
    >>> zd = zip.strict(*iters)

It's not obvious which one will succeed, or how the other will fail. If
zip.strict is implemented as a method, zm will succeed, but zd will fail
in one of several confusing ways:

-   Yield results that aren't wrapped in a tuple (if iters contains just
    one item, a zip iterator).
-   Raise a TypeError for an incorrect argument type (if iters contains
    just one item, not a zip iterator).
-   Raise a TypeError for an incorrect number of arguments (otherwise).

If zip.strict is implemented as a classmethod or staticmethod, zd will
succeed, and zm will silently yield nothing (which is the problem we are
trying to avoid in the first place).

This proposal is further complicated by the fact that CPython's actual
zip type is currently an undocumented implementation detail. This means
that choosing one of the above behaviors will effectively "lock in" the
current implementation (or at least require it to be emulated) going
forward.

Change The Default Behavior Of zip

There is nothing "wrong" with the default behavior of zip, since there
are many cases where it is indeed the correct way to handle
unequally-sized inputs. It's extremely useful, for example, when dealing
with infinite iterators.

itertools.zip_longest already exists to service those cases where the
"extra" tail-end data is still needed.

Accept A Callback To Handle Remaining Items

While able to do basically anything a user could need, this solution
makes handling the more common cases (like rejecting mismatched lengths)
unnecessarily complicated and non-obvious.

Raise An AssertionError

There are no built-in functions or types that raise an AssertionError as
part of their API. Further, the official documentation simply reads (in
its entirety):

  Raised when an assert statement fails.

Since this feature has nothing to do with Python's assert statement,
raising an AssertionError here would be inappropriate. Users desiring a
check that is disabled in optimized mode (like an assert statement) can
use strict=__debug__ instead.

Add A Similar Feature to map

This PEP does not propose any changes to map, since the use of map with
multiple iterable arguments is quite rare. However, this PEP's ruling
shall serve as precedent such a future discussion (should it occur).

If rejected, the feature is realistically not worth pursuing. If
accepted, such a change to map should not require its own PEP (though,
like all enhancements, its usefulness should be carefully considered).
For consistency, it should follow same API and semantics debated here
for zip.

Do Nothing

This option is perhaps the least attractive.

Silently truncated data is a particularly nasty class of bug, and
hand-writing a robust solution that gets this right isn't trivial. The
real-world motivating examples from Python's own standard library are
evidence that it's very easy to fall into the sort of trap that this
feature aims to avoid.

References

Examples

Note

This listing is not exhaustive.

-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3394
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3418
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3435
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L94-L95
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1184
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1275
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1363
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1391
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/copy.py#L217
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/csv.py#L142
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/dis.py#L462
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L142
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L143
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L1440
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L2095
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/os.py#L510
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/plistlib.py#L577
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1317
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1323
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1339
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3015
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3071
-   https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3901

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.