PEP 383 – Non-decodable Bytes in System Character Interfaces
- Martin v. Löwis <martin at v.loewis.de>
- Standards Track
File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.
The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.
On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API (keeping a C-char-based one for backwards compatibility).
For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x.
With this PEP, a uniform treatment of these data as characters becomes possible. The uniformity is achieved by using specific encoding algorithms, meaning that the data can be converted back to bytes on POSIX systems only if the same encoding is used.
Being able to treat such strings uniformly will allow application writers to abstract from details specific to the operating system, and reduces the risk of one API failing when the other API would have worked.
On Windows, Python uses the wide character APIs to access character-oriented APIs, allowing direct conversion of the environmental data to Python str objects (PEP 277).
On POSIX systems, Python currently applies the locale’s encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions; see the discussion below.
To convert non-decodable bytes, a new error handler (PEP 293) “surrogateescape” is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables.
The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below).
Byte-oriented interfaces that already exist in Python 3.0 are not affected by this specification. They are neither enhanced nor deprecated.
External libraries that operate on file names (such as GUI file choosers) should also encode them according to the PEP.
This surrogateescape encoding is based on Markus Kuhn’s idea that he called UTF-8b .
While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only “works” if the data get converted back to bytes with the surrogateescape error handler also. Encoding the data with the locale’s encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data.
Data obtained from other sources may conflict with data produced by this PEP. Dealing with such conflicts is out of scope of the PEP.
This PEP allows the possibility of “smuggling” bytes in character strings. This would be a security risk if the bytes are security-critical when interpreted as characters on a target system, such as path name separators. For this reason, the PEP rejects smuggling bytes below 128. If the target system uses EBCDIC, such smuggled bytes may still be a security risk, allowing smuggling of e.g. square brackets or the backslash. Python currently does not support EBCDIC, so this should not be a problem in practice. Anybody porting Python to an EBCDIC system might want to adjust the error handlers, or come up with other approaches to address the security risks.
Encodings that are not compatible with ASCII are not supported by this specification; bytes in the ASCII range that fail to decode will cause an exception. It is widely agreed that such encodings should not be used as locale charsets.
For most applications, we assume that they eventually pass data received from a system interface back into the same system interfaces. For example, an application invoking os.listdir() will likely pass the result strings back into APIs like os.stat() or open(), which then encodes them back into their original byte representation. Applications that need to process the original byte strings can obtain them by encoding the character strings with the file system encoding, passing “surrogateescape” as the error handler name. For example, a function that works like os.listdir, except for accepting and returning bytes, would be written as:
def listdir_b(dirname): fse = sys.getfilesystemencoding() dirname = dirname.decode(fse, "surrogateescape") for fn in os.listdir(dirname): # fn is now a str object yield fn.encode(fse, "surrogateescape")
The extension to the encode error handler interface proposed by this PEP is necessary to implement the ‘surrogateescape’ error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the ‘surrogateescape’ proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes.
A few alternative approaches have been proposed:
- create a new string subclass that supports embedded bytes
- use different escape schemes, such as escaping with a NUL character, or mapping to infrequent characters.
Of these proposals, the approach of escaping each byte XX with the sequence U+0000 U+00XX has the disadvantage that encoding to UTF-8 will introduce a NUL byte in the UTF-8 sequence. As a consequence, C libraries may interpret this as a string termination, even though the string continues. In particular, the gtk libraries will truncate text in this case; other libraries may show similar problems.
- UTF-8b http://permalink.gmane.org/gmane.comp.internationalization.linux/920
This document has been placed in the public domain.
Last modified: 2022-01-21 11:03:51 GMT