PEP: 461 Title: Adding % formatting to bytes and bytearray Author: Ethan
Furman <ethan@stoneleaf.us> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 13-Jan-2014 Python-Version: 3.5
Post-History: 14-Jan-2014, 15-Jan-2014, 17-Jan-2014, 22-Feb-2014,
25-Mar-2014, 27-Mar-2014 Resolution:
https://mail.python.org/pipermail/python-dev/2014-March/133621.html

Abstract

This PEP proposes adding % formatting operations similar to Python 2's
str type to bytes and bytearray[1][2].

Rationale

While interpolation is usually thought of as a string operation, there
are cases where interpolation on bytes or bytearrays make sense, and the
work needed to make up for this missing functionality detracts from the
overall readability of the code.

Motivation

With Python 3 and the split between str and bytes, one small but
important area of programming became slightly more difficult, and much
more painful -- wire format protocols[3].

This area of programming is characterized by a mixture of binary data
and ASCII compatible segments of text (aka ASCII-encoded text). Bringing
back a restricted %-interpolation for bytes and bytearray will aid both
in writing new wire format code, and in porting Python 2 wire format
code.

Common use-cases include dbf and pdf file formats, email formats, and
FTP and HTTP communications, among many others.

Proposed semantics for bytes and bytearray formatting

%-interpolation

All the numeric formatting codes (d, i, o, u, x, X, e, E, f, F, g, G,
and any that are subsequently added to Python 3) will be supported, and
will work as they do for str, including the padding, justification and
other related modifiers (currently #, 0, -, space, and + (plus any added
to Python 3)). The only non-numeric codes allowed are c, b, a, and s
(which is a synonym for b).

For the numeric codes, the only difference between str and bytes (or
bytearray) interpolation is that the results from these codes will be
ASCII-encoded text, not unicode. In other words, for any numeric
formatting code %x:

    b"%x" % val

is equivalent to:

    ("%x" % val).encode("ascii")

Examples:

    >>> b'%4x' % 10
    b'   a'

    >>> b'%#4x' % 10
    ' 0xa'

    >>> b'%04X' % 10
    '000A'

%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1, not from a str.

Examples:

    >>> b'%c' % 48
    b'0'

    >>> b'%c' % b'a'
    b'a'

%b will insert a series of bytes. These bytes are collected in one of
two ways:

-   input type supports Py_buffer[4]? use it to collect the necessary
    bytes
-   input type is something else? use its __bytes__ method[5] ; if there
    isn't one, raise a TypeError

In particular, %b will not accept numbers nor str. str is rejected as
the string to bytes conversion requires an encoding, and we are refusing
to guess; numbers are rejected because:

-   what makes a number is fuzzy (float? Decimal? Fraction? some user
    type?)
-   allowing numbers would lead to ambiguity between numbers and textual
    representations of numbers (3.14 vs '3.14')
-   given the nature of wire formats, explicit is definitely better than
    implicit

%s is included as a synonym for %b for the sole purpose of making 2/3
code bases easier to maintain. Python 3 only code should use %b.

Examples:

    >>> b'%b' % b'abc'
    b'abc'

    >>> b'%b' % 'some string'.encode('utf8')
    b'some string'

    >>> b'%b' % 3.14
    Traceback (most recent call last):
    ...
    TypeError: b'%b' does not accept 'float'

    >>> b'%b' % 'hello world!'
    Traceback (most recent call last):
    ...
    TypeError: b'%b' does not accept 'str'

%a will give the equivalent of
repr(some_obj).encode('ascii', 'backslashreplace') on the interpolated
value. Use cases include developing a new protocol and writing landmarks
into the stream; debugging data going into an existing protocol to see
if the problem is the protocol itself or bad data; a fall-back for a
serialization format; or any situation where defining __bytes__ would
not be appropriate but a readable/informative representation is
needed[6].

%r is included as a synonym for %a for the sole purpose of making 2/3
code bases easier to maintain. Python 3 only code use %a[7].

Examples:

    >>> b'%a' % 3.14
    b'3.14'

    >>> b'%a' % b'abc'
    b"b'abc'"

    >>> b'%a' % 'def'
    b"'def'"

Compatibility with Python 2

As noted above, %s and %r are being included solely to help ease
migration from, and/or have a single code base with, Python 2. This is
important as there are modules both in the wild and behind closed doors
that currently use the Python 2 str type as a bytes container, and hence
are using %s as a bytes interpolator.

However, %b and %a should be used in new, Python 3 only code, so %s and
%r will immediately be deprecated, but not removed from the 3.x series
[8].

Proposed variations

It has been proposed to automatically use .encode('ascii','strict') for
str arguments to %b.

-   Rejected as this would lead to intermittent failures. Better to have
    the operation always fail so the trouble-spot can be correctly
    fixed.

It has been proposed to have %b return the ascii-encoded repr when the
value is a str (b'%b' % 'abc' --> b"'abc'").

-   Rejected as this would lead to hard to debug failures far from the
    problem site. Better to have the operation always fail so the
    trouble-spot can be easily fixed.

Originally this PEP also proposed adding format-style formatting, but it
was decided that format and its related machinery were all strictly text
(aka str) based, and it was dropped.

Various new special methods were proposed, such as __ascii__,
__format_bytes__, etc.; such methods are not needed at this time, but
can be visited again later if real-world use shows deficiencies with
this solution.

A competing PEP, PEP 460 Add binary interpolation and formatting <460>,
also exists.

Objections

The objections raised against this PEP were mainly variations on two
themes:

-   the bytes and bytearray types are for pure binary data, with no
    assumptions about encodings
-   offering %-interpolation that assumes an ASCII encoding will be an
    attractive nuisance and lead us back to the problems of the Python 2
    str/unicode text model

As was seen during the discussion, bytes and bytearray are also used for
mixed binary data and ASCII-compatible segments: file formats such as
dbf and pdf, network protocols such as ftp and email, etc.

bytes and bytearray already have several methods which assume an ASCII
compatible encoding. upper(), isalpha(), and expandtabs() to name just a
few. %-interpolation, with its very restricted mini-language, will not
be any more of a nuisance than the already existing methods.

Some have objected to allowing the full range of numeric formatting
codes with the claim that decimal alone would be sufficient. However, at
least two formats (dbf and pdf) make use of non-decimal numbers.

Footnotes

Copyright

This document has been placed in the public domain.

[1] http://docs.python.org/2/library/stdtypes.html#string-formatting

[2] neither string.Template, format, nor str.format are under
consideration

[3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html

[4] http://docs.python.org/3/c-api/buffer.html examples: memoryview,
array.array, bytearray, bytes

[5] http://docs.python.org/3/reference/datamodel.html#object.__bytes__

[6] https://mail.python.org/pipermail/python-dev/2014-February/132750.html

[7] http://bugs.python.org/issue23467 -- originally %r was not allowed,
but was added for consistency during the 3.5 alpha stage.

[8] http://bugs.python.org/issue23467 -- originally %r was not allowed,
but was added for consistency during the 3.5 alpha stage.