PEP: 3137 Title: Immutable Bytes and Mutable Buffer Version: $Revision$
Last-Modified: $Date$ Author: Guido van Rossum <guido@python.org>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
26-Sep-2007 Python-Version: 3.0 Post-History: 26-Sep-2007, 30-Sep-2007

Introduction

After releasing Python 3.0a1 with a mutable bytes type, pressure mounted
to add a way to represent immutable bytes. Gregory P. Smith proposed a
patch that would allow making a bytes object temporarily immutable by
requesting that the data be locked using the new buffer API from PEP
3118. This did not seem the right approach to me.

Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
make the bytes type immutable (by crudely removing all mutating APIs)
and fix the fall-out in the test suite. This showed that there aren't
all that many places that depend on the mutability of bytes, with the
exception of code that builds up a return value from small pieces.

Thinking through the consequences, and noticing that using the array
module as an ersatz mutable bytes type is far from ideal, and recalling
a proposal put forward earlier by Talin, I floated the suggestion to
have both a mutable and an immutable bytes type. (This had been brought
up before, but until seeing the evidence of Jeffrey's patch I wasn't
open to the suggestion.)

Moreover, a possible implementation strategy became clear: use the old
PyString implementation, stripped down to remove locale support and
implicit conversions to/from Unicode, for the immutable bytes type, and
keep the new PyBytes implementation as the mutable bytes type.

The ensuing discussion made it clear that the idea is welcome but needs
to be specified more precisely. Hence this PEP.

Advantages

One advantage of having an immutable bytes type is that code objects can
use these. It also makes it possible to efficiently create hash tables
using bytes for keys; this may be useful when parsing protocols like
HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in Python
2.x will be easier using the new design than using the original 3.0
design with mutable bytes; simply replace str with bytes and change
'...' literals into b'...' literals.

Naming

I propose the following type names at the Python level:

-   bytes is an immutable array of bytes (PyString)
-   bytearray is a mutable array of bytes (PyBytes)
-   memoryview is a bytes view on another object (PyMemory)

The old type named buffer is so similar to the new type memoryview,
introduce by PEP 3118, that it is redundant. The rest of this PEP
doesn't discuss the functionality of memoryview; it is just mentioned
here to justify getting rid of the old buffer type. (An earlier version
of this PEP proposed buffer as the new name for PyBytes; in the end this
name was deemed to confusing given the many other uses of the word
buffer.)

While eventually it makes sense to change the C API names, this PEP
maintains the old C API names, which should be familiar to all.

Summary

Here's a simple ASCII-art table summarizing the type names in various
Python versions:

    +--------------+-------------+------------+--------------------------+
    | C name       | 2.x    repr | 3.0a1 repr | 3.0a2               repr |
    +--------------+-------------+------------+--------------------------+
    | PyUnicode    | unicode u'' | str     '' | str                   '' |
    | PyString     | str      '' | str8   s'' | bytes                b'' |
    | PyBytes      | N/A         | bytes  b'' | bytearray bytearray(b'') |
    | PyBuffer     | buffer      | buffer     | N/A                      |
    | PyMemoryView | N/A         | memoryview | memoryview         <...> |
    +--------------+-------------+------------+--------------------------+

Literal Notations

The b'...' notation introduced in Python 3.0a1 returns an immutable
bytes object, whatever variation is used. To create a mutable array of
bytes, use bytearray(b'...') or bytearray([...]). The latter form takes
a list of integers in range(256).

Functionality

PEP 3118 Buffer API

Both bytes and bytearray implement the PEP 3118 buffer API. The bytes
type only implements read-only requests; the bytearray type allows
writable and data-locked requests as well. The element data type is
always 'B' (i.e. unsigned byte).

Constructors

There are four forms of constructors, applicable to both bytes and
bytearray:

-   bytes(<bytes>), bytes(<bytearray>), bytearray(<bytes>),
    bytearray(<bytearray>): simple copying constructors, with the note
    that bytes(<bytes>) might return its (immutable) argument, but
    bytearray(<bytearray>) always makes a copy.
-   bytes(<str>, <encoding>[, <errors>]),
    bytearray(<str>, <encoding>[, <errors>]): encode a text string. Note
    that the str.encode() method returns an immutable bytes object. The
    <encoding> argument is mandatory; <errors> is optional. <encoding>
    and <errors>, if given, must be str instances.
-   bytes(<memory view>), bytearray(<memory view>): construct a bytes or
    bytearray object from anything that implements the PEP 3118 buffer
    API.
-   bytes(<iterable of ints>), bytearray(<iterable of ints>): construct
    a bytes or bytearray object from a stream of integers in range(256).
-   bytes(<int>), bytearray(<int>): construct a zero-initialized bytes
    or bytearray object of a given length.

Comparisons

The bytes and bytearray types are comparable with each other and
orderable, so that e.g. b'abc' == bytearray(b'abc') < b'abd'.

Comparing either type to a str object for equality returns False
regardless of the contents of either operand. Ordering comparisons with
str raise TypeError. This is all conformant to the standard rules for
comparison and ordering between objects of incompatible types.

(Note: in Python 3.0a1, comparing a bytes instance with a str instance
would raise TypeError, on the premise that this would catch the
occasional mistake quicker, especially in code ported from Python 2.x.
However, a long discussion on the python-3000 list pointed out so many
problems with this that it is clearly a bad idea, to be rolled back in
3.0a2 regardless of the fate of the rest of this PEP.)

Slicing

Slicing a bytes object returns a bytes object. Slicing a bytearray
object returns a bytearray object.

Slice assignment to a bytearray object accepts anything that implements
the PEP 3118 buffer API, or an iterable of integers in range(256).

Indexing

Indexing bytes and bytearray returns small ints (like the bytes type in
3.0a1, and like lists or array.array('B')).

Assignment to an item of a bytearray object accepts an int in
range(256). (To assign from a bytes sequence, use a slice assignment.)

Str() and Repr()

The str() and repr() functions return the same thing for these objects.
The repr() of a bytes object returns a b'...' style literal. The repr()
of a bytearray returns a string of the form "bytearray(b'...')".

Operators

The following operators are implemented by the bytes and bytearray
types, except where mentioned:

-   b1 + b2: concatenation. With mixed bytes/bytearray operands, the
    return type is that of the first argument (this seems arbitrary
    until you consider how += works).
-   b1 += b2: mutates b1 if it is a bytearray object.
-   b * n, n * b: repetition; n must be an integer.
-   b *= n: mutates b if it is a bytearray object.
-   b1 in b2, b1 not in b2: substring test; b1 can be any object
    implementing the PEP 3118 buffer API.
-   i in b, i not in b: single-byte membership test; i must be an
    integer (if it is a length-1 bytes array, it is considered to be a
    substring test, with the same outcome).
-   len(b): the number of bytes.
-   hash(b): the hash value; only implemented by the bytes type.

Note that the % operator is not implemented. It does not appear worth
the complexity.

Methods

The following methods are implemented by bytes as well as bytearray,
with similar semantics. They accept anything that implements the PEP
3118 buffer API for bytes arguments, and return the same type as the
object whose method is called ("self"):

    .capitalize(), .center(), .count(), .decode(), .endswith(),
    .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
    .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
    .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
    .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
    .splitlines(), .startswith(), .strip(), .swapcase(), .title(),
    .translate(), .upper(), .zfill()

This is exactly the set of methods present on the str type in Python
2.x, with the exclusion of .encode(). The signatures and semantics are
the same too. However, whenever character classes like letter,
whitespace, lower case are used, the ASCII definitions of these classes
are used. (The Python 2.x str type uses the definitions from the current
locale, settable through the locale module.) The .encode() method is
left out because of the more strict definitions of encoding and decoding
in Python 3000: encoding always takes a Unicode string and returns a
bytes sequence, and decoding always takes a bytes sequence and returns a
Unicode string.

In addition, both types implement the class method .fromhex(), which
constructs an object from a string containing hexadecimal values (with
or without spaces between the bytes).

The bytearray type implements these additional methods from the
MutableSequence ABC (see PEP 3119):

  .extend(), .insert(), .append(), .reverse(), .pop(), .remove().

Bytes and the Str Type

Like the bytes type in Python 3.0a1, and unlike the relationship between
str and unicode in Python 2.x, attempts to mix bytes (or bytearray)
objects and str objects without specifying an encoding will raise a
TypeError exception. (However, comparing bytes/bytearray and str objects
for equality will simply return False; see the section on Comparisons
above.)

Conversions between bytes or bytearray objects and str objects must
always be explicit, using an encoding. There are two equivalent APIs:
str(b, <encoding>[, <errors>]) is equivalent to
b.decode(<encoding>[, <errors>]), and bytes(s, <encoding>[, <errors>])
is equivalent to s.encode(<encoding>[, <errors>]).

There is one exception: we can convert from bytes (or bytearray) to str
without specifying an encoding by writing str(b). This produces the same
result as repr(b). This exception is necessary because of the general
promise that any object can be printed, and printing is just a special
case of conversion to str. There is however no promise that printing a
bytes object interprets the individual bytes as characters (unlike in
Python 2.x).

The str type currently implements the PEP 3118 buffer API. While this is
perhaps occasionally convenient, it is also potentially confusing,
because the bytes accessed via the buffer API represent a
platform-depending encoding: depending on the platform byte order and a
compile-time configuration option, the encoding could be UTF-16-BE,
UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of
the str type might completely change the bytes representation, e.g. to
UTF-8, or even make it impossible to access the data as a contiguous
array of bytes at all. Therefore, the PEP 3118 buffer API will be
removed from the str type.

The basestring Type

The basestring type will be removed from the language. Code that used to
say isinstance(x, basestring) should be changed to use
isinstance(x, str) instead.

Pickling

Left as an exercise for the reader.

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End: