PEP 618 – Add Optional Length-Checking To zip
- Author:
- Brandt Bucher <brandt at python.org>
- Sponsor:
- Antoine Pitrou <antoine at python.org>
- BDFL-Delegate:
- Guido van Rossum <guido at python.org>
- Status:
- Final
- Type:
- Standards Track
- Created:
- 01-May-2020
- Python-Version:
- 3.10
- Post-History:
- 01-May-2020, 10-May-2020, 16-Jun-2020
- Resolution:
- Python-Dev message
Abstract
This PEP proposes adding an optional strict
boolean keyword
parameter to the built-in zip
. When enabled, a ValueError
is
raised if one of the arguments is exhausted before the others.
Motivation
It is clear from the author’s personal experience and a survey of the
standard library that much (if not most) zip
usage
involves iterables that must be of equal length. Sometimes this
invariant is proven true from the context of the surrounding code, but
often the data being zipped is passed from the caller, sourced
separately, or generated in some fashion. In any of these cases, the
default behavior of zip
means that faulty refactoring or logic
errors could easily result in silently losing data. These bugs are
not only difficult to diagnose, but difficult to even detect at all.
It is easy to come up with simple cases where this could be a problem.
For example, the following code may work fine when items
is a
sequence, but silently start producing shortened, mismatched results
if items
is refactored by the caller to be a consumable iterator:
def apply_calculations(items):
transformed = transform(items)
for i, t in zip(items, transformed):
yield calculate(i, t)
There are several other ways in which zip
is commonly used.
Idiomatic tricks are especially susceptible, because they are often
employed by users who lack a complete understanding of how the code
works. One example is unpacking into zip
to lazily “unzip” or
“transpose” nested iterables:
>>> x = [[1, 2, 3], ["one" "two" "three"]]
>>> xt = list(zip(*x))
Another is “chunking” data into equal-sized groups:
>>> n = 3
>>> x = range(n ** 2),
>>> xn = list(zip(*[iter(x)] * n))
In the first case, non-rectangular data is usually a logic error. In
the second case, data with a length that is not a multiple of n
is
often an error as well. However, both of these idioms will silently
omit the tail-end items of malformed input.
Perhaps most convincingly, the use of zip
in the standard-library
ast
module created a bug in literal_eval
which silently
dropped parts of malformed nodes:
>>> from ast import Constant, Dict, literal_eval
>>> nasty_dict = Dict(keys=[Constant(None)], values=[])
>>> literal_eval(nasty_dict) # Like eval("{None: }")
{}
In fact, the author has counted dozens of other call sites in Python’s standard library and tooling where it would be appropriate to enable this new feature immediately.
Rationale
Some critics assert that constant boolean switches are a “code-smell”, or go against Python’s design philosophy. However, Python currently contains several examples of boolean keyword parameters on built-in functions which are typically called with compile-time constants:
compile(..., dont_inherit=True)
open(..., closefd=False)
print(..., flush=True)
sorted(..., reverse=True)
Many more exist in the standard library.
The idea and name for this new parameter were originally proposed by Ram Rachum. The thread received over 100 replies, with the alternative “equal” receiving a similar amount of support.
The author does not have a strong preference between the two choices, though “equal equals” is a bit awkward in prose. It may also (wrongly) imply some notion of “equality” between the zipped items:
>>> z = zip([2.0, 4.0, 6.0], [2, 4, 8], equal=True)
Specification
When the built-in zip
is called with the keyword-only argument
strict=True
, the resulting iterator will raise a ValueError
if
the arguments are exhausted at differing lengths. This error will
occur at the point when iteration would normally stop today.
Backward Compatibility
This change is fully backward-compatible. zip
currently takes no
keyword arguments, and the “non-strict” default behavior when
strict
is omitted remains unchanged.
Reference Implementation
The author has drafted a C implementation.
An approximate Python translation is:
def zip(*iterables, strict=False):
if not iterables:
return
iterators = tuple(iter(iterable) for iterable in iterables)
try:
while True:
items = []
for iterator in iterators:
items.append(next(iterator))
yield tuple(items)
except StopIteration:
if not strict:
return
if items:
i = len(items)
plural = " " if i == 1 else "s 1-"
msg = f"zip() argument {i+1} is shorter than argument{plural}{i}"
raise ValueError(msg)
sentinel = object()
for i, iterator in enumerate(iterators[1:], 1):
if next(iterator, sentinel) is not sentinel:
plural = " " if i == 1 else "s 1-"
msg = f"zip() argument {i+1} is longer than argument{plural}{i}"
raise ValueError(msg)
Rejected Ideas
Add itertools.zip_strict
This is the alternative with the most support on the Python-Ideas mailing list, so it deserves to be discussed in some detail here. It does not have any disqualifying flaws, and could work well enough as a substitute if this PEP is rejected.
With that in mind, this section aims to outline why adding an optional
parameter to zip
is a smaller change that ultimately does a better
job of solving the problems motivating this PEP.
Precedent
It seems that a great deal of the motivation driving this alternative
is that zip_longest
already exists in itertools
. However,
zip_longest
is in many ways a much more complicated, specialized
utility: it takes on the responsibility of filling in missing values,
a job neither of the other variants needs to concern themselves with.
If both zip
and zip_longest
lived alongside each other in
itertools
or as builtins, then adding zip_strict
in the same
location would indeed be a much stronger argument. However, the new
“strict” variant is conceptually much closer to zip
in interface
and behavior than zip_longest
, while still not meeting the high
bar of being its own builtin. Given this situation, it seems most
natural for zip
to grow this new option in-place.
Usability
If zip
is capable of preventing this class of bug, it becomes much
simpler for users to enable the check at call sites with this
property. Compare this with importing a drop-in replacement for a
built-in utility, which feels somewhat heavy just to check a tricky
condition that should “always” be true.
Some have also argued that a new function buried in the standard library is somehow more “discoverable” than a keyword parameter on the built-in itself. The author does not agree with this assessment.
Maintenance Cost
While implementation should only be a secondary concern when making
usability improvements, it is important to recognize that adding a new
utility is significantly more complicated than modifying an existing
one. The CPython implementation accompanying this PEP is simple and
has no measurable performance impact on default zip
behavior,
while adding an entirely new utility to itertools
would require
either:
- Duplicating much of the existing
zip
logic, aszip_longest
already does. - Significantly refactoring either
zip
,zip_longest
, or both to share a common or inherited implementation (which may impact performance).
Add Several “Modes” To Switch Between
This option only makes more sense than a binary flag if we anticipate
having three or more modes. The “obvious” three choices for these
enumerated or constant modes would be “shortest” (the current zip
behavior), “strict” (the proposed behavior), and “longest”
(the itertools.zip_longest
behavior).
However, it doesn’t seem like adding behaviors other than the current
default and the proposed “strict” mode is worth the additional
complexity. The clearest candidate, “longest”, would require a new
fillvalue
parameter (which is meaningless for both other modes).
This mode is also already handled perfectly by
itertools.zip_longest
, and adding it would create two ways of
doing the same thing. It’s not clear which would be the “obvious”
choice: the mode
parameter on the built-in zip
, or the
long-lived namesake utility in itertools
.
Add A Method Or Alternate Constructor To The zip
Type
Consider the following two options, which have both been proposed:
>>> zm = zip(*iters).strict()
>>> zd = zip.strict(*iters)
It’s not obvious which one will succeed, or how the other will fail.
If zip.strict
is implemented as a method, zm
will succeed, but
zd
will fail in one of several confusing ways:
- Yield results that aren’t wrapped in a tuple (if
iters
contains just one item, azip
iterator). - Raise a
TypeError
for an incorrect argument type (ifiters
contains just one item, not azip
iterator). - Raise a
TypeError
for an incorrect number of arguments (otherwise).
If zip.strict
is implemented as a classmethod
or
staticmethod
, zd
will succeed, and zm
will silently yield
nothing (which is the problem we are trying to avoid in the first
place).
This proposal is further complicated by the fact that CPython’s actual
zip
type is currently an undocumented implementation detail. This
means that choosing one of the above behaviors will effectively “lock
in” the current implementation (or at least require it to be emulated)
going forward.
Change The Default Behavior Of zip
There is nothing “wrong” with the default behavior of zip
, since
there are many cases where it is indeed the correct way to handle
unequally-sized inputs. It’s extremely useful, for example, when
dealing with infinite iterators.
itertools.zip_longest
already exists to service those cases where
the “extra” tail-end data is still needed.
Accept A Callback To Handle Remaining Items
While able to do basically anything a user could need, this solution makes handling the more common cases (like rejecting mismatched lengths) unnecessarily complicated and non-obvious.
Raise An AssertionError
There are no built-in functions or types that raise an
AssertionError
as part of their API. Further, the official
documentation
simply reads (in its entirety):
Raised when anassert
statement fails.
Since this feature has nothing to do with Python’s assert
statement, raising an AssertionError
here would be inappropriate.
Users desiring a check that is disabled in optimized mode (like an
assert
statement) can use strict=__debug__
instead.
Add A Similar Feature to map
This PEP does not propose any changes to map
, since the use of
map
with multiple iterable arguments is quite rare. However, this
PEP’s ruling shall serve as precedent such a future discussion (should
it occur).
If rejected, the feature is realistically not worth pursuing. If
accepted, such a change to map
should not require its own PEP
(though, like all enhancements, its usefulness should be carefully
considered). For consistency, it should follow same API and semantics
debated here for zip
.
Do Nothing
This option is perhaps the least attractive.
Silently truncated data is a particularly nasty class of bug, and hand-writing a robust solution that gets this right isn’t trivial. The real-world motivating examples from Python’s own standard library are evidence that it’s very easy to fall into the sort of trap that this feature aims to avoid.
References
Examples
Note
This listing is not exhaustive.
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3394
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3418
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3435
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L94-L95
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1184
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1275
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1363
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1391
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/copy.py#L217
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/csv.py#L142
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/dis.py#L462
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L142
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L143
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L1440
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L2095
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/os.py#L510
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/plistlib.py#L577
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1317
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1323
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1339
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3015
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3071
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3901
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0618.rst
Last modified: 2023-09-09 17:39:29 GMT