PEP: 472 Title: Support for indexing with keyword arguments Version:
$Revision$ Last-Modified: $Date$ Author: Stefano Borini, Joseph
Martinot-Lagarde Discussions-To: python-ideas@python.org Status:
Rejected Type: Standards Track Content-Type: text/x-rst Created:
24-Jun-2014 Python-Version: 3.6 Post-History: 02-Jul-2014 Resolution:
https://mail.python.org/pipermail/python-dev/2019-March/156693.html

Abstract

This PEP proposes an extension of the indexing operation to support
keyword arguments. Notations in the form a[K=3,R=2] would become legal
syntax. For future-proofing considerations, a[1:2, K=3, R=4] are
considered and may be allowed as well, depending on the choice for
implementation. In addition to a change in the parser, the index
protocol (__getitem__, __setitem__ and __delitem__) will also
potentially require adaptation.

Motivation

The indexing syntax carries a strong semantic content, differentiating
it from a method call: it implies referring to a subset of data. We
believe this semantic association to be important, and wish to expand
the strategies allowed to refer to this data.

As a general observation, the number of indices needed by an indexing
operation depends on the dimensionality of the data: one-dimensional
data (e.g. a list) requires one index (e.g. a[3]), two-dimensional data
(e.g. a matrix) requires two indices (e.g. a[2,3]) and so on. Each index
is a selector along one of the axes of the dimensionality, and the
position in the index tuple is the metainformation needed to associate
each index to the corresponding axis.

The current python syntax focuses exclusively on position to express the
association to the axes, and also contains syntactic sugar to refer to
non-punctiform selection (slices)

    >>> a[3]       # returns the fourth element of a
    >>> a[1:10:2]  # slice notation (extract a non-trivial data subset)
    >>> a[3,2]     # multiple indexes (for multidimensional arrays)

The additional notation proposed in this PEP would allow notations
involving keyword arguments in the indexing operation, e.g.

    >>> a[K=3, R=2]

which would allow to refer to axes by conventional names.

One must additionally consider the extended form that allows both
positional and keyword specification

    >>> a[3,R=3,K=4]

This PEP will explore different strategies to enable the use of these
notations.

Use cases

The following practical use cases present two broad categories of usage
of a keyworded specification: Indexing and contextual option. For
indexing:

1.  To provide a more communicative meaning to the index, preventing
    e.g. accidental inversion of indexes

        >>> gridValues[x=3, y=5, z=8]
        >>> rain[time=0:12, location=location]

2.  In some domain, such as computational physics and chemistry, the use
    of a notation such as Basis[Z=5] is a Domain Specific Language
    notation to represent a level of accuracy

        >>> low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3])

    In this case, the index operation would return a basis set at the
    chosen level of accuracy (represented by the parameter Z). The
    reason behind an indexing is that the BasisSet object could be
    internally represented as a numeric table, where rows (the
    "coefficient" axis, hidden to the user in this example) are
    associated to individual elements (e.g. row 0:5 contains
    coefficients for element 1, row 5:8 coefficients for element 2) and
    each column is associated to a given degree of accuracy ("accuracy"
    or "Z" axis) so that first column is low accuracy, second column is
    medium accuracy and so on. With that indexing, the user would obtain
    another object representing the contents of the column of the
    internal table for accuracy level 3.

Additionally, the keyword specification can be used as an option
contextual to the indexing. Specifically:

1.  A "default" option allows to specify a default return value when the
    index is not present

        >>> lst = [1, 2, 3]
        >>> value = lst[5, default=0]  # value is 0

2.  For a sparse dataset, to specify an interpolation strategy to infer
    a missing point from e.g. its surrounding data.

        >>> value = array[1, 3, interpolate=spline_interpolator]

3.  A unit could be specified with the same mechanism

        >>> value = array[1, 3, unit="degrees"]

How the notation is interpreted is up to the implementing class.

Current implementation

Currently, the indexing operation is handled by methods __getitem__,
__setitem__ and __delitem__. These methods' signature accept one
argument for the index (with __setitem__ accepting an additional
argument for the set value). In the following, we will analyze
__getitem__(self, idx) exclusively, with the same considerations implied
for the remaining two methods.

When an indexing operation is performed, __getitem__(self, idx) is
called. Traditionally, the full content between square brackets is
turned into a single object passed to argument idx:

-   When a single element is passed, e.g. a[2], idx will be 2.
-   When multiple elements are passed, they must be separated by commas:
    a[2, 3]. In this case, idx will be a tuple (2, 3). With
    a[2, 3, "hello", {}] idx will be (2, 3, "hello", {}).
-   A slicing notation e.g. a[2:10] will produce a slice object, or a
    tuple containing slice objects if multiple values were passed.

Except for its unique ability to handle slice notation, the indexing
operation has similarities to a plain method call: it acts like one when
invoked with only one element; If the number of elements is greater than
one, the idx argument behaves like a *args. However, as stated in the
Motivation section, an indexing operation has the strong semantic
implication of extraction of a subset out of a larger set, which is not
automatically associated to a regular method call unless appropriate
naming is chosen. Moreover, its different visual style is important for
readability.

Specifications

The implementation should try to preserve the current signature for
__getitem__, or modify it in a backward-compatible way. We will present
different alternatives, taking into account the possible cases that need
to be addressed

    C0. a[1]; a[1,2]         # Traditional indexing
    C1. a[Z=3]
    C2. a[Z=3, R=4]
    C3. a[1, Z=3]
    C4. a[1, Z=3, R=4]
    C5. a[1, 2, Z=3]
    C6. a[1, 2, Z=3, R=4]
    C7. a[1, Z=3, 2, R=4]    # Interposed ordering

Strategy "Strict dictionary"

This strategy acknowledges that __getitem__ is special in accepting only
one object, and the nature of that object must be non-ambiguous in its
specification of the axes: it can be either by order, or by name. As a
result of this assumption, in presence of keyword arguments, the passed
entity is a dictionary and all labels must be specified.

    C0. a[1]; a[1,2]      -> idx = 1; idx = (1, 2)
    C1. a[Z=3]            -> idx = {"Z": 3}
    C2. a[Z=3, R=4]       -> idx = {"Z": 3, "R": 4}
    C3. a[1, Z=3]         -> raise SyntaxError
    C4. a[1, Z=3, R=4]    -> raise SyntaxError
    C5. a[1, 2, Z=3]      -> raise SyntaxError
    C6. a[1, 2, Z=3, R=4] -> raise SyntaxError
    C7. a[1, Z=3, 2, R=4] -> raise SyntaxError

Pros

-   Strong conceptual similarity between the tuple case and the
    dictionary case. In the first case, we are specifying a tuple, so we
    are naturally defining a plain set of values separated by commas. In
    the second, we are specifying a dictionary, so we are specifying a
    homogeneous set of key/value pairs, as in dict(Z=3, R=4);
-   Simple and easy to parse on the __getitem__ side: if it gets a
    tuple, determine the axes using positioning. If it gets a
    dictionary, use the keywords.
-   C interface does not need changes.

Neutral

-   Degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4] means the
    notation is syntactic sugar.

Cons

-   Very strict.
-   Destroys ordering of the passed arguments. Preserving the order
    would be possible with an OrderedDict as drafted by PEP 468.
-   Does not allow use cases with mixed positional/keyword arguments
    such as a[1, 2, default=5].

Strategy "mixed dictionary"

This strategy relaxes the above constraint to return a dictionary
containing both numbers and strings as keys.

    C0. a[1]; a[1,2]      -> idx = 1; idx = (1, 2)
    C1. a[Z=3]            -> idx = {"Z": 3}
    C2. a[Z=3, R=4]       -> idx = {"Z": 3, "R": 4}
    C3. a[1, Z=3]         -> idx = { 0: 1, "Z": 3}
    C4. a[1, Z=3, R=4]    -> idx = { 0: 1, "Z": 3, "R": 4}
    C5. a[1, 2, Z=3]      -> idx = { 0: 1, 1: 2, "Z": 3}
    C6. a[1, 2, Z=3, R=4] -> idx = { 0: 1, 1: 2, "Z": 3, "R": 4}
    C7. a[1, Z=3, 2, R=4] -> idx = { 0: 1, "Z": 3, 2: 2, "R": 4}

Pros

-   Opens for mixed cases.

Cons

-   Destroys ordering information for string keys. We have no way of
    saying if "Z" in C7 was in position 1 or 3.
-   Implies switching from a tuple to a dict as soon as one specified
    index has a keyword argument. May be confusing to parse.

Strategy "named tuple"

Return a named tuple for idx instead of a tuple. Keyword arguments would
obviously have their stated name as key, and positional argument would
have an underscore followed by their order:

    C0. a[1]; a[1,2]      -> idx = 1; idx = (_0=1, _1=2)
    C1. a[Z=3]            -> idx = (Z=3)
    C2. a[Z=3, R=2]       -> idx = (Z=3, R=2)
    C3. a[1, Z=3]         -> idx = (_0=1, Z=3)
    C4. a[1, Z=3, R=2]    -> idx = (_0=1, Z=3, R=2)
    C5. a[1, 2, Z=3]      -> idx = (_0=1, _2=2, Z=3)
    C6. a[1, 2, Z=3, R=4] -> (_0=1, _1=2, Z=3, R=4)
    C7. a[1, Z=3, 2, R=4] -> (_0=1, Z=3, _1=2, R=4)
                          or (_0=1, Z=3, _2=2, R=4)
                          or raise SyntaxError

The required typename of the namedtuple could be Index or the name of
the argument in the function definition, it keeps the ordering and is
easy to analyse by using the _fields attribute. It is backward
compatible, provided that C0 with more than one entry now passes a
namedtuple instead of a plain tuple.

Pros

-   Looks nice. namedtuple transparently replaces tuple and gracefully
    degrades to the old behavior.
-   Does not require a change in the C interface

Cons

-   According to some sources[1] namedtuple is not well developed. To
    include it as such important object would probably require rework
    and improvement;
-   The namedtuple fields, and thus the type, will have to change
    according to the passed arguments. This can be a performance
    bottleneck, and makes it impossible to guarantee that two subsequent
    index accesses get the same Index class;
-   the _n "magic" fields are a bit unusual, but ipython already uses
    them for result history.
-   Python currently has no builtin namedtuple. The current one is
    available in the "collections" module in the standard library.
-   Differently from a function, the two notations
    gridValues[x=3, y=5, z=8] and gridValues[3,5,8] would not gracefully
    match if the order is modified at call time (e.g. we ask for
    gridValues[y=5, z=8, x=3]). In a function, we can pre-define
    argument names so that keyword arguments are properly matched. Not
    so in __getitem__, leaving the task for interpreting and matching to
    __getitem__ itself.

Strategy "New argument contents"

In the current implementation, when many arguments are passed to
__getitem__, they are grouped in a tuple and this tuple is passed to
__getitem__ as the single argument idx. This strategy keeps the current
signature, but expands the range of variability in type and contents of
idx to more complex representations.

We identify four possible ways to implement this strategy:

-   P1: uses a single dictionary for the keyword arguments.
-   P2: uses individual single-item dictionaries.
-   P3: similar to P2, but replaces single-item dictionaries with a
    (key, value) tuple.
-   P4: similar to P2, but uses a special and additional new object:
    keyword()

Some of these possibilities lead to degenerate notations, i.e.
indistinguishable from an already possible representation. Once again,
the proposed notation becomes syntactic sugar for these representations.

Under this strategy, the old behavior for C0 is unchanged.

    C0: a[1]        -> idx = 1                    # integer
        a[1,2]      -> idx = (1,2)                # tuple

In C1, we can use either a dictionary or a tuple to represent key and
value pair for the specific indexing entry. We need to have a tuple with
a tuple in C1 because otherwise we cannot differentiate a["Z", 3] from
a[Z=3].

    C1: a[Z=3]      -> idx = {"Z": 3}             # P1/P2 dictionary with single key
                    or idx = (("Z", 3),)          # P3 tuple of tuples
                    or idx = keyword("Z", 3)      # P4 keyword object

As you can see, notation P1/P2 implies that a[Z=3] and a[{"Z": 3}] will
call __getitem__ passing the exact same value, and is therefore
syntactic sugar for the latter. Same situation occurs, although with
different index, for P3. Using a keyword object as in P4 would remove
this degeneracy.

For the C2 case:

    C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4}     # P1 dictionary/ordereddict
                    or idx = ({"Z": 3}, {"R": 4}) # P2 tuple of two single-key dict
                    or idx = (("Z", 3), ("R", 4)) # P3 tuple of tuples
                    or idx = (keyword("Z", 3),
                              keyword("R", 4) )   # P4 keyword objects

P1 naturally maps to the traditional **kwargs behavior, however it
breaks the convention that two or more entries for the index produce a
tuple. P2 preserves this behavior, and additionally preserves the order.
Preserving the order would also be possible with an OrderedDict as
drafted by PEP 468.

The remaining cases are here shown:

    C3. a[1, Z=3]   -> idx = (1, {"Z": 3})                     # P1/P2
                    or idx = (1, ("Z", 3))                     # P3
                    or idx = (1, keyword("Z", 3))              # P4

    C4. a[1, Z=3, R=4] -> idx = (1, {"Z": 3, "R": 4})          # P1
                       or idx = (1, {"Z": 3}, {"R": 4})        # P2
                       or idx = (1, ("Z", 3), ("R", 4))        # P3
                       or idx = (1, keyword("Z", 3),
                                    keyword("R", 4))           # P4

    C5. a[1, 2, Z=3]   -> idx = (1, 2, {"Z": 3})               # P1/P2
                       or idx = (1, 2, ("Z", 3))               # P3
                       or idx = (1, 2, keyword("Z", 3))        # P4

    C6. a[1, 2, Z=3, R=4] -> idx = (1, 2, {"Z":3, "R": 4})     # P1
                          or idx = (1, 2, {"Z": 3}, {"R": 4})  # P2
                          or idx = (1, 2, ("Z", 3), ("R", 4))  # P3
                          or idx = (1, 2, keyword("Z", 3),
                                          keyword("R", 4))     # P4

    C7. a[1, Z=3, 2, R=4] -> idx = (1, 2, {"Z": 3, "R": 4})    # P1. Pack the keyword arguments. Ugly.
                          or raise SyntaxError                 # P1. Same behavior as in function calls.
                          or idx = (1, {"Z": 3}, 2, {"R": 4})  # P2
                          or idx =  (1, ("Z", 3), 2, ("R", 4)) # P3
                          or idx =  (1, keyword("Z", 3),
                                     2, keyword("R", 4))       # P4

Pros

-   Signature is unchanged;
-   P2/P3 can preserve ordering of keyword arguments as specified at
    indexing,
-   P1 needs an OrderedDict, but would destroy interposed ordering if
    allowed: all keyword indexes would be dumped into the dictionary;
-   Stays within traditional types: tuples and dicts. Evt. OrderedDict;
-   Some proposed strategies are similar in behavior to a traditional
    function call;
-   The C interface for PyObject_GetItem and family would remain
    unchanged.

Cons

-   Apparently complex and wasteful;
-   Degeneracy in notation (e.g. a[Z=3] and a[{"Z":3}] are equivalent
    and indistinguishable notations at the __[get|set|del]item__ level).
    This behavior may or may not be acceptable.
-   for P4, an additional object similar in nature to slice() is needed,
    but only to disambiguate the above degeneracy.
-   idx type and layout seems to change depending on the whims of the
    caller;
-   May be complex to parse what is passed, especially in the case of
    tuple of tuples;
-   P2 Creates a lot of single keys dictionary as members of a tuple.
    Looks ugly. P3 would be lighter and easier to use than the tuple of
    dicts, and still preserves order (unlike the regular dict), but
    would result in clumsy extraction of keywords.

Strategy "kwargs argument"

__getitem__ accepts an optional **kwargs argument which should be
keyword only. idx also becomes optional to support a case where no
non-keyword arguments are allowed. The signature would then be either

    __getitem__(self, idx)
    __getitem__(self, idx, **kwargs)
    __getitem__(self, **kwargs)

Applied to our cases would produce:

    C0. a[1,2]            -> idx=(1,2);  kwargs={}
    C1. a[Z=3]            -> idx=None ;  kwargs={"Z":3}
    C2. a[Z=3, R=4]       -> idx=None ;  kwargs={"Z":3, "R":4}
    C3. a[1, Z=3]         -> idx=1    ;  kwargs={"Z":3}
    C4. a[1, Z=3, R=4]    -> idx=1    ;  kwargs={"Z":3, "R":4}
    C5. a[1, 2, Z=3]      -> idx=(1,2);  kwargs={"Z":3}
    C6. a[1, 2, Z=3, R=4] -> idx=(1,2);  kwargs={"Z":3, "R":4}
    C7. a[1, Z=3, 2, R=4] -> raise SyntaxError # in agreement to function behavior

Empty indexing a[] of course remains invalid syntax.

Pros

-   Similar to function call, evolves naturally from it;
-   Use of keyword indexing with an object whose __getitem__ doesn't
    have a kwargs will fail in an obvious way. That's not the case for
    the other strategies.

Cons

-   It doesn't preserve order, unless an OrderedDict is used;
-   Forbids C7, but is it really needed?
-   Requires a change in the C interface to pass an additional PyObject
    for the keyword arguments.

C interface

As briefly introduced in the previous analysis, the C interface would
potentially have to change to allow the new feature. Specifically,
PyObject_GetItem and related routines would have to accept an additional
PyObject *kw argument for Strategy "kwargs argument". The remaining
strategies would not require a change in the C function signatures, but
the different nature of the passed object would potentially require
adaptation.

Strategy "named tuple" would behave correctly without any change: the
class returned by the factory method in collections returns a subclass
of tuple, meaning that PyTuple_* functions can handle the resulting
object.

Alternative Solutions

In this section, we present alternative solutions that would workaround
the missing feature and make the proposed enhancement not worth of
implementation.

Use a method

One could keep the indexing as is, and use a traditional get() method
for those cases where basic indexing is not enough. This is a good
point, but as already reported in the introduction, methods have a
different semantic weight from indexing, and you can't use slices
directly in methods. Compare e.g. a[1:3, Z=2] with
a.get(slice(1,3), Z=2).

The authors however recognize this argument as compelling, and the
advantage in semantic expressivity of a keyword-based indexing may be
offset by a rarely used feature that does not bring enough benefit and
may have limited adoption.

Emulate requested behavior by abusing the slice object

This extremely creative method exploits the slice objects' behavior,
provided that one accepts to use strings (or instantiate properly named
placeholder objects for the keys), and accept to use ":" instead of "=".

    >>> a["K":3]
    slice('K', 3, None)
    >>> a["K":3, "R":4]
    (slice('K', 3, None), slice('R', 4, None))
    >>>

While clearly smart, this approach does not allow easy inquire of the
key/value pair, it's too clever and esotheric, and does not allow to
pass a slice as in a[K=1:10:2].

However, Tim Delaney comments

  "I really do think that a[b=c, d=e] should just be syntax sugar for
  a['b':c, 'd':e]. It's simple to explain, and gives the greatest
  backwards compatibility. In particular, libraries that already abused
  slices in this way will just continue to work with the new syntax."

We think this behavior would produce inconvenient results. The library
Pandas uses strings as labels, allowing notation such as

    >>> a[:, "A":"F"]

to extract data from column "A" to column "F". Under the above comment,
this notation would be equally obtained with

    >>> a[:, A="F"]

which is weird and collides with the intended meaning of keyword in
indexing, that is, specifying the axis through conventional names rather
than positioning.

Pass a dictionary as an additional index

    >>> a[1, 2, {"K": 3}]

this notation, although less elegant, can already be used and achieves
similar results. It's evident that the proposed Strategy "New argument
contents" can be interpreted as syntactic sugar for this notation.

Additional Comments

Commenters also expressed the following relevant points:

Relevance of ordering of keyword arguments

As part of the discussion of this PEP, it's important to decide if the
ordering information of the keyword arguments is important, and if
indexes and keys can be ordered in an arbitrary way (e.g.
a[1,Z=3,2,R=4]). PEP 468 tries to address the first point by proposing
the use of an ordereddict, however one would be inclined to accept that
keyword arguments in indexing are equivalent to kwargs in function
calls, and therefore as of today equally unordered, and with the same
restrictions.

Need for homogeneity of behavior

Relative to Strategy "New argument contents", a comment from Ian
Cordasco points out that

  "it would be unreasonable for just one method to behave totally
  differently from the standard behaviour in Python. It would be
  confusing for only __getitem__ (and ostensibly, __setitem__) to take
  keyword arguments but instead of turning them into a dictionary, turn
  them into individual single-item dictionaries." We agree with his
  point, however it must be pointed out that __getitem__ is already
  special in some regards when it comes to passed arguments.

Chris Angelico also states:

  "it seems very odd to start out by saying "here, let's give indexing
  the option to carry keyword args, just like with function calls", and
  then come back and say "oh, but unlike function calls, they're
  inherently ordered and carried very differently"." Again, we agree on
  this point. The most straightforward strategy to keep homogeneity
  would be Strategy "kwargs argument", opening to a **kwargs argument on
  __getitem__.

One of the authors (Stefano Borini) thinks that only the "strict
dictionary" strategy is worth of implementation. It is non-ambiguous,
simple, does not force complex parsing, and addresses the problem of
referring to axes either by position or by name. The "options" use case
is probably best handled with a different approach, and may be
irrelevant for this PEP. The alternative "named tuple" is another valid
choice.

Having .get() become obsolete for indexing with default fallback

Introducing a "default" keyword could make dict.get() obsolete, which
would be replaced by d["key", default=3]. Chris Angelico however states:

  "Currently, you need to write __getitem__ (which raises an exception
  on finding a problem) plus something else, e.g. get(), which returns a
  default instead. By your proposal, both branches would go inside
  __getitem__, which means they could share code; but there still need
  to be two branches."

Additionally, Chris continues:

  "There'll be an ad-hoc and fairly arbitrary puddle of names (some will
  go default=, others will say that's way too long and go def=, except
  that that's a keyword so they'll use dflt= or something...), unless
  there's a strong force pushing people to one consistent name.".

This argument is valid but it's equally valid for any function call, and
is generally fixed by established convention and documentation.

On degeneracy of notation

User Drekin commented: "The case of a[Z=3] and a[{"Z": 3}] is similar to
current a[1, 2] and a[(1, 2)]. Even though one may argue that the
parentheses are actually not part of tuple notation but are just needed
because of syntax, it may look as degeneracy of notation when compared
to function call: f(1, 2) is not the same thing as f((1, 2)).".

References

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 End:

[1] "namedtuple is not as good as it should be"
(https://mail.python.org/pipermail/python-ideas/2013-June/021257.html)