PEP 349 – Allow str() to return unicode strings
- Neil Schemenauer <nas at arctrix.com>
- Standards Track
- Python-Dev message
Table of Contents
This PEP proposes to change the
str() built-in function so that it
can return unicode strings. This change would make it easier to
write code that works with either string type and would also make
some existing code handle unicode strings. The C function
PyObject_Str() would remain unchanged and the function
PyString_New() would be added instead.
Python has had a Unicode string type for some time now but use of it is not yet widespread. There is a large amount of Python code that assumes that string data is represented as str instances. The long-term plan for Python is to phase out the str type and use unicode for all string data. Clearly, a smooth migration path must be provided.
We need to upgrade existing libraries, written for str instances, to be made capable of operating in an all-unicode string world. We can’t change to an all-unicode world until all essential libraries are made capable for it. Upgrading the libraries in one shot does not seem feasible. A more realistic strategy is to individually make the libraries capable of operating on unicode strings while preserving their current all-str environment behaviour.
First, we need to be able to write code that can accept unicode instances without attempting to coerce them to str instances. Let us label such code as Unicode-safe. Unicode-safe libraries can be used in an all-unicode world.
Second, we need to be able to write code that, when provided only str instances, will not create unicode results. Let us label such code as str-stable. Libraries that are str-stable can be used by libraries and applications that are not yet Unicode-safe.
Sometimes it is simple to write code that is both str-stable and Unicode-safe. For example, the following function just works:
def appendx(s): return s + 'x'
That’s not too surprising since the unicode type is designed to
make the task easier. The principle is that when str and unicode
instances meet, the result is a unicode instance. One notable
difficulty arises when code requires a string representation of an
object; an operation traditionally accomplished by using the
Using the current
str() function makes the code not Unicode-safe.
str() call with a
unicode() call makes the code not
str() so that it could return unicode
instances would solve this problem. As a further benefit, some code
that is currently not Unicode-safe because it uses
A Python implementation of the
str() built-in follows:
def str(s): """Return a nice string representation of the object. The return value is a str or unicode instance. """ if type(s) is str or type(s) is unicode: return s r = s.__str__() if not isinstance(r, (str, unicode)): raise TypeError('__str__ returned non-string') return r
The following function would be added to the C API and would be the
equivalent to the
str() built-in (ideally it be called
but changing that function could cause a massive number of
PyObject *PyString_New(PyObject *);
A reference implementation is available on Sourceforge  as a patch.
Some code may require that
str() returns a str instance. In the
standard library, only one such case has been found so far. The
email.header_decode() requires a str instance and the
email.Header.decode_header() function tries to ensure this by
str() on its argument. The code was fixed by changing
the line “header = str(header)” to:
if isinstance(header, unicode): header = header.encode('ascii')
Whether this is truly a bug is questionable since
really operates on byte strings, not character strings. Code that
passes it a unicode instance could itself be considered buggy.
A new built-in function could be added instead of changing
Doing so would introduce virtually no backwards compatibility
problems. However, since the compatibility problems are expected to
str() seems preferable to adding a new built-in.
The basestring type could be changed to have the proposed behaviour,
rather than changing
str(). However, that would be confusing
behaviour for an abstract base type.
This document has been placed in the public domain.
Last modified: 2021-02-03 14:06:23+00:00 GMT