PEP: 358 Title: The "bytes" Object Version: $Revision$ Last-Modified:
$Date$ Author: Neil Schemenauer <nas@arctrix.com>, Guido van Rossum
<guido@python.org> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 15-Feb-2006 Python-Version: 2.6, 3.0 Post-History:

Update

This PEP has partially been superseded by PEP 3137.

Abstract

This PEP outlines the introduction of a raw bytes sequence type. Adding
the bytes type is one step in the transition to Unicode-based str
objects which will be introduced in Python 3.0.

The PEP describes how the bytes type should work in Python 2.6, as well
as how it should work in Python 3.0. (Occasionally there are differences
because in Python 2.6, we have two string types, str and unicode, while
in Python 3.0 we will only have one string type, whose name will be str
but whose semantics will be like the 2.6 unicode type.)

Motivation

Python's current string objects are overloaded. They serve to hold both
sequences of characters and sequences of bytes. This overloading of
purpose leads to confusion and bugs. In future versions of Python,
string objects will be used for holding character data. The bytes object
will fulfil the role of a byte container. Eventually the unicode type
will be renamed to str and the old str type will be removed.

Specification

A bytes object stores a mutable sequence of integers that are in the
range 0 to 255. Unlike string objects, indexing a bytes object returns
an integer. Assigning or comparing an object that is not an integer to
an element causes a TypeError exception. Assigning an element to a value
outside the range 0 to 255 causes a ValueError exception. The .__len__()
method of bytes returns the number of integers stored in the sequence
(i.e. the number of bytes).

The constructor of the bytes object has the following signature:

    bytes([initializer[, encoding]])

If no arguments are provided then a bytes object containing zero
elements is created and returned. The initializer argument can be a
string (in 2.6, either str or unicode), an iterable of integers, or a
single integer. The pseudo-code for the constructor (optimized for clear
semantics, not for speed) is:

    def bytes(initializer=0, encoding=None):
        if isinstance(initializer, int): # In 2.6, int -> (int, long)
            initializer = [0]*initializer
        elif isinstance(initializer, basestring):
            if isinstance(initializer, unicode): # In 3.0, "if True"
                if encoding is None:
                    # In 3.0, raise TypeError("explicit encoding required")
                    encoding = sys.getdefaultencoding()
                initializer = initializer.encode(encoding)
            initializer = [ord(c) for c in initializer]
        else:
            if encoding is not None:
                raise TypeError("no encoding allowed for this initializer")
            tmp = []
            for c in initializer:
                if not isinstance(c, int):
                    raise TypeError("initializer must be iterable of ints")
                if not 0 <= c < 256:
                    raise ValueError("initializer element out of range")
                tmp.append(c)
            initializer = tmp
        new = <new bytes object of length len(initializer)>
        for i, c in enumerate(initializer):
            new[i] = c
        return new

The .__repr__() method returns a string that can be evaluated to
generate a new bytes object containing a bytes literal:

    >>> bytes([10, 20, 30])
    b'\n\x14\x1e'

The object has a .decode() method equivalent to the .decode() method of
the str object. The object has a classmethod .fromhex() that takes a
string of characters from the set [0-9a-fA-F ] and returns a bytes
object (similar to binascii.unhexlify). For example:

    >>> bytes.fromhex('5c5350ff')
    b'\\SP\xff'
    >>> bytes.fromhex('5c 53 50 ff')
    b'\\SP\xff'

The object has a .hex() method that does the reverse conversion (similar
to binascii.hexlify):

    >> bytes([92, 83, 80, 255]).hex()
    '5c5350ff'

The bytes object has some methods similar to list methods, and others
similar to str methods. Here is a complete list of methods, with their
approximate signatures:

    .__add__(bytes) -> bytes
    .__contains__(int | bytes) -> bool
    .__delitem__(int | slice) -> None
    .__delslice__(int, int) -> None
    .__eq__(bytes) -> bool
    .__ge__(bytes) -> bool
    .__getitem__(int | slice) -> int | bytes
    .__getslice__(int, int) -> bytes
    .__gt__(bytes) -> bool
    .__iadd__(bytes) -> bytes
    .__imul__(int) -> bytes
    .__iter__() -> iterator
    .__le__(bytes) -> bool
    .__len__() -> int
    .__lt__(bytes) -> bool
    .__mul__(int) -> bytes
    .__ne__(bytes) -> bool
    .__reduce__(...) -> ...
    .__reduce_ex__(...) -> ...
    .__repr__() -> str
    .__reversed__() -> bytes
    .__rmul__(int) -> bytes
    .__setitem__(int | slice, int | iterable[int]) -> None
    .__setslice__(int, int, iterable[int]) -> Bote
    .append(int) -> None
    .count(int) -> int
    .decode(str) -> str | unicode # in 3.0, only str
    .endswith(bytes) -> bool
    .extend(iterable[int]) -> None
    .find(bytes) -> int
    .index(bytes | int) -> int
    .insert(int, int) -> None
    .join(iterable[bytes]) -> bytes
    .partition(bytes) -> (bytes, bytes, bytes)
    .pop([int]) -> int
    .remove(int) -> None
    .replace(bytes, bytes) -> bytes
    .rindex(bytes | int) -> int
    .rpartition(bytes) -> (bytes, bytes, bytes)
    .split(bytes) -> list[bytes]
    .startswith(bytes) -> bool
    .reverse() -> None
    .rfind(bytes) -> int
    .rindex(bytes | int) -> int
    .rsplit(bytes) -> list[bytes]
    .translate(bytes, [bytes]) -> bytes

Note the conspicuous absence of .isupper(), .upper(), and friends. (But
see "Open Issues" below.) There is no .__hash__() because the object is
mutable. There is no use case for a .sort() method.

The bytes type also supports the buffer interface, supporting reading
and writing binary (but not character) data.

Out of Scope Issues

-   Python 3k will have a much different I/O subsystem. Deciding how
    that I/O subsystem will work and interact with the bytes object is
    out of the scope of this PEP. The expectation however is that binary
    I/O will read and write bytes, while text I/O will read strings.
    Since the bytes type supports the buffer interface, the existing
    binary I/O operations in Python 2.6 will support bytes objects.
-   It has been suggested that a special method named .__bytes__() be
    added to the language to allow objects to be converted into byte
    arrays. This decision is out of scope.
-   A bytes literal of the form b"..." is also proposed. This is the
    subject of PEP 3112.

Open Issues

-   The .decode() method is redundant since a bytes object b can also be
    decoded by calling unicode(b, <encoding>) (in 2.6) or
    str(b, <encoding>) (in 3.0). Do we need encode/decode methods at
    all? In a sense the spelling using a constructor is cleaner.
-   Need to specify the methods still more carefully.
-   Pickling and marshalling support need to be specified.
-   Should all those list methods really be implemented?
-   A case could be made for supporting .ljust(), .rjust(), .center()
    with a mandatory second argument.
-   A case could be made for supporting .split() with a mandatory
    argument.
-   A case could even be made for supporting .islower(), .isupper(),
    .isspace(), .isalpha(), .isalnum(), .isdigit() and the corresponding
    conversions (.lower() etc.), using the ASCII definitions for
    letters, digits and whitespace. If this is accepted, the cases for
    .ljust(), .rjust(), .center() and .split() become much stronger, and
    they should have default arguments as well, using an ASCII space or
    all ASCII whitespace (for .split()).

Frequently Asked Questions

Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing?

A: In the current version of Python, the encode method returns a str
object and we cannot change that without breaking code. The construct
bytes(s.encode(...)) is expensive because it has to copy the byte
sequence multiple times. Also, Python generally provides two ways of
converting an object of type A into an object of type B: ask an A
instance to convert itself to a B, or ask the type B to create a new
instance from an A. Depending on what A and B are, both APIs make sense;
sometimes reasons of decoupling require that A can't know about B, in
which case you have to use the latter approach; sometimes B can't know
about A, in which case you have to use the former.

Q: Why does bytes ignore the encoding argument if the initializer is a
str? (This only applies to 2.6.)

A: There is no sane meaning that the encoding can have in that case. str
objects are byte arrays and they know nothing about the encoding of
character data they contain. We need to assume that the programmer has
provided a str object that already uses the desired encoding. If you
need something other than a pure copy of the bytes then you need to
first decode the string. For example:

    bytes(s.decode(encoding1), encoding2)

Q: Why not have the encoding argument default to Latin-1 (or some other
encoding that covers the entire byte range) rather than ASCII?

A: The system default encoding for Python is ASCII. It seems least
confusing to use that default. Also, in Py3k, using Latin-1 as the
default might not be what users expect. For example, they might prefer a
Unicode encoding. Any default will not always work as expected. At least
ASCII will complain loudly if you try to encode non-ASCII data.

Copyright

This document has been placed in the public domain.