PEP: 383 Title: Non-decodable Bytes in System Character Interfaces
Author: Martin von Löwis <martin@v.loewis.de> Status: Final Type:
Standards Track Content-Type: text/x-rst Created: 22-Apr-2009
Python-Version: 3.1 Post-History:

Abstract

File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding or
not. This PEP proposes a means of dealing with such irregularities by
embedding the bytes in character strings in such a way that allows
recreation of the original byte string.

Rationale

The C char type is a data type that is commonly used to represent both
character data and bytes. Certain POSIX interfaces are specified and
widely understood as operating on character data, however, the system
call interfaces make no assumption on the encoding of these data, and
pass them on as-is. With Python 3, character strings use a Unicode-based
internal representation, making it difficult to ignore the encoding of
byte strings in the same way that the C interfaces can ignore the
encoding.

On the other hand, Microsoft Windows NT has corrected the original
design limitation of Unix, and made it explicit in its system interfaces
that these data (file names, environment variables, command line
arguments) are indeed character data, by providing a Unicode-based API
(keeping a C-char-based one for backwards compatibility).

For Python 3, one proposed solution is to provide two sets of APIs: a
byte-oriented one, and a character-oriented one, where the
character-oriented one would be limited to not being able to represent
all data accurately. Unfortunately, for Windows, the situation would be
exactly the opposite: the byte-oriented interface cannot represent all
data; only the character-oriented API can. As a consequence, libraries
and applications that want to support all user data in a cross-platform
manner have to accept mish-mash of bytes and characters exactly in the
way that caused endless troubles for Python 2.x.

With this PEP, a uniform treatment of these data as characters becomes
possible. The uniformity is achieved by using specific encoding
algorithms, meaning that the data can be converted back to bytes on
POSIX systems only if the same encoding is used.

Being able to treat such strings uniformly will allow application
writers to abstract from details specific to the operating system, and
reduces the risk of one API failing when the other API would have
worked.

Specification

On Windows, Python uses the wide character APIs to access
character-oriented APIs, allowing direct conversion of the environmental
data to Python str objects (PEP 277).

On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode, failing for characters that cannot be
decoded. With this PEP, non-decodable bytes >= 128 will be represented
as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce
exceptions; see the discussion below.

To convert non-decodable bytes, a new error handler (PEP 293)
"surrogateescape" is introduced, which produces these surrogates. On
encoding, the error handler converts the surrogate back to the
corresponding byte. This error handler will be used in any API that
receives or produces file names, command line arguments, or environment
variables.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again (also see the discussion
below).

Byte-oriented interfaces that already exist in Python 3.0 are not
affected by this specification. They are neither enhanced nor
deprecated.

External libraries that operate on file names (such as GUI file
choosers) should also encode them according to the PEP.

Discussion

This surrogateescape encoding is based on Markus Kuhn's idea that he
called UTF-8b[1].

While providing a uniform API to non-decodable bytes, this interface has
the limitation that chosen representation only "works" if the data get
converted back to bytes with the surrogateescape error handler also.
Encoding the data with the locale's encoding and the (default) strict
error handler will raise an exception, encoding them with UTF-8 will
produce nonsensical data.

Data obtained from other sources may conflict with data produced by this
PEP. Dealing with such conflicts is out of scope of the PEP.

This PEP allows the possibility of "smuggling" bytes in character
strings. This would be a security risk if the bytes are
security-critical when interpreted as characters on a target system,
such as path name separators. For this reason, the PEP rejects smuggling
bytes below 128. If the target system uses EBCDIC, such smuggled bytes
may still be a security risk, allowing smuggling of e.g. square brackets
or the backslash. Python currently does not support EBCDIC, so this
should not be a problem in practice. Anybody porting Python to an EBCDIC
system might want to adjust the error handlers, or come up with other
approaches to address the security risks.

Encodings that are not compatible with ASCII are not supported by this
specification; bytes in the ASCII range that fail to decode will cause
an exception. It is widely agreed that such encodings should not be used
as locale charsets.

For most applications, we assume that they eventually pass data received
from a system interface back into the same system interfaces. For
example, an application invoking os.listdir() will likely pass the
result strings back into APIs like os.stat() or open(), which then
encodes them back into their original byte representation. Applications
that need to process the original byte strings can obtain them by
encoding the character strings with the file system encoding, passing
"surrogateescape" as the error handler name. For example, a function
that works like os.listdir, except for accepting and returning bytes,
would be written as:

    def listdir_b(dirname):
        fse = sys.getfilesystemencoding()
        dirname = dirname.decode(fse, "surrogateescape")
        for fn in os.listdir(dirname):
            # fn is now a str object
            yield fn.encode(fse, "surrogateescape")

The extension to the encode error handler interface proposed by this PEP
is necessary to implement the 'surrogateescape' error handler, because
there are required byte sequences which cannot be generated from
replacement Unicode. However, the encode error handler interface
presently requires replacement Unicode to be provided in lieu of the
non-encodable Unicode from the source string. Then it promptly encodes
that replacement Unicode. In some error handlers, such as the
'surrogateescape' proposed here, it is also simpler and more efficient
for the error handler to provide a pre-encoded replacement byte string,
rather than forcing it to calculating Unicode from which the encoder
would create the desired bytes.

A few alternative approaches have been proposed:

-   create a new string subclass that supports embedded bytes
-   use different escape schemes, such as escaping with a NUL character,
    or mapping to infrequent characters.

Of these proposals, the approach of escaping each byte XX with the
sequence U+0000 U+00XX has the disadvantage that encoding to UTF-8 will
introduce a NUL byte in the UTF-8 sequence. As a consequence, C
libraries may interpret this as a string termination, even though the
string continues. In particular, the gtk libraries will truncate text in
this case; other libraries may show similar problems.

References

Copyright

This document has been placed in the public domain.

[1] UTF-8b
https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html