PEP: 3112 Title: Bytes literals in Python 3000 Version: $Revision$
Last-Modified: $Date$ Author: Jason Orendorff
<jason.orendorff@gmail.com> Status: Final Type: Standards Track
Content-Type: text/x-rst Requires: 358 Created: 23-Feb-2007
Python-Version: 3.0 Post-History: 23-Feb-2007

Abstract

This PEP proposes a literal syntax for the bytes objects introduced in
PEP 358. The purpose is to provide a convenient way to spell ASCII
strings and arbitrary binary data.

Motivation

Existing spellings of an ASCII string in Python 3000 include:

    bytes('Hello world', 'ascii')
    'Hello world'.encode('ascii')

The proposed syntax is:

    b'Hello world'

Existing spellings of an 8-bit binary sequence in Python 3000 include:

    bytes([0x7f, 0x45, 0x4c, 0x46, 0x01, 0x01, 0x01, 0x00])
    bytes('\x7fELF\x01\x01\x01\0', 'latin-1')
    '7f454c4601010100'.decode('hex')

The proposed syntax is:

    b'\x7f\x45\x4c\x46\x01\x01\x01\x00'
    b'\x7fELF\x01\x01\x01\0'

In both cases, the advantages of the new syntax are brevity, some small
efficiency gain, and the detection of encoding errors at compile time
rather than at runtime. The brevity benefit is especially felt when
using the string-like methods of bytes objects:

    lines = bdata.split(bytes('\n', 'ascii'))  # existing syntax
    lines = bdata.split(b'\n')  # proposed syntax

And when converting code from Python 2.x to Python 3000:

    sok.send('EXIT\r\n')  # Python 2.x
    sok.send('EXIT\r\n'.encode('ascii'))  # Python 3000 existing
    sok.send(b'EXIT\r\n')  # proposed

Grammar Changes

The proposed syntax is an extension of the existing string syntax[1].

The new syntax for strings, including the new bytes literal, is:

    stringliteral: [stringprefix] (shortstring | longstring)
    stringprefix: "b" | "r" | "br" | "B" | "R" | "BR" | "Br" | "bR"
    shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"'
    longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
    shortstringitem: shortstringchar | escapeseq
    longstringitem: longstringchar | escapeseq
    shortstringchar:
      <any source character except "\" or newline or the quote>
    longstringchar: <any source character except "\">
    escapeseq: "\" NL
      | "\\" | "\'" | '\"'
      | "\a" | "\b" | "\f" | "\n" | "\r" | "\t" | "\v"
      | "\ooo" | "\xhh"
      | "\uxxxx" | "\Uxxxxxxxx" | "\N{name}"

The following additional restrictions apply only to bytes literals
(stringliteral tokens with b or B in the stringprefix):

-   Each shortstringchar or longstringchar must be a character between 1
    and 127 inclusive, regardless of any encoding declaration[2] in the
    source file.
-   The Unicode-specific escape sequences \u*xxxx*, \U*xxxxxxxx*, and
    \N{*name*} are unrecognized in Python 2.x and forbidden in Python
    3000.

Adjacent bytes literals are subject to the same concatenation rules as
adjacent string literals[3]. A bytes literal adjacent to a string
literal is an error.

Semantics

Each evaluation of a bytes literal produces a new bytes object. The
bytes in the new object are the bytes represented by the shortstringitem
or longstringitem parts of the literal, in the same order.

Rationale

The proposed syntax provides a cleaner migration path from Python 2.x to
Python 3000 for most code involving 8-bit strings. Preserving the old
8-bit meaning of a string literal is usually as simple as adding a b
prefix. The one exception is Python 2.x strings containing bytes >127,
which must be rewritten using escape sequences. Transcoding a source
file from one encoding to another, and fixing up the encoding
declaration, should preserve the meaning of the program. Python 2.x
non-Unicode strings violate this principle; Python 3000 bytes literals
shouldn't.

A string literal with a b in the prefix is always a syntax error in
Python 2.5, so this syntax can be introduced in Python 2.6, along with
the bytes type.

A bytes literal produces a new object each time it is evaluated, like
list displays and unlike string literals. This is necessary because
bytes literals, like lists and unlike strings, are mutable[4].

Reference Implementation

Thomas Wouters has checked an implementation into the Py3K branch,
r53872.

References

Copyright

This document has been placed in the public domain.

[1] http://docs.python.org/reference/lexical_analysis.html#string-literals

[2] http://docs.python.org/reference/lexical_analysis.html#encoding-declarations

[3] http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation

[4] https://mail.python.org/pipermail/python-3000/2007-February/005779.html