PEP: 349 Title: Allow str() to return unicode strings Version:
$Revision$ Last-Modified: $Date$ Author: Neil Schemenauer
<nas@arctrix.com> Status: Rejected Type: Standards Track Content-Type:
text/x-rst Created: 02-Aug-2005 Python-Version: 2.5 Post-History:
06-Aug-2005 Resolution:
https://mail.python.org/archives/list/python-dev@python.org/message/M2Y3PUFLAE23NPRJPVBYF6P5LW5LVN6F/

Abstract

This PEP proposes to change the str() built-in function so that it can
return unicode strings. This change would make it easier to write code
that works with either string type and would also make some existing
code handle unicode strings. The C function PyObject_Str() would remain
unchanged and the function PyString_New() would be added instead.

Rationale

Python has had a Unicode string type for some time now but use of it is
not yet widespread. There is a large amount of Python code that assumes
that string data is represented as str instances. The long-term plan for
Python is to phase out the str type and use unicode for all string data.
Clearly, a smooth migration path must be provided.

We need to upgrade existing libraries, written for str instances, to be
made capable of operating in an all-unicode string world. We can't
change to an all-unicode world until all essential libraries are made
capable for it. Upgrading the libraries in one shot does not seem
feasible. A more realistic strategy is to individually make the
libraries capable of operating on unicode strings while preserving their
current all-str environment behaviour.

First, we need to be able to write code that can accept unicode
instances without attempting to coerce them to str instances. Let us
label such code as Unicode-safe. Unicode-safe libraries can be used in
an all-unicode world.

Second, we need to be able to write code that, when provided only str
instances, will not create unicode results. Let us label such code as
str-stable. Libraries that are str-stable can be used by libraries and
applications that are not yet Unicode-safe.

Sometimes it is simple to write code that is both str-stable and
Unicode-safe. For example, the following function just works:

    def appendx(s):
        return s + 'x'

That's not too surprising since the unicode type is designed to make the
task easier. The principle is that when str and unicode instances meet,
the result is a unicode instance. One notable difficulty arises when
code requires a string representation of an object; an operation
traditionally accomplished by using the str() built-in function.

Using the current str() function makes the code not Unicode-safe.
Replacing a str() call with a unicode() call makes the code not
str-stable. Changing str() so that it could return unicode instances
would solve this problem. As a further benefit, some code that is
currently not Unicode-safe because it uses str() would become
Unicode-safe.

Specification

A Python implementation of the str() built-in follows:

    def str(s):
        """Return a nice string representation of the object.  The
        return value is a str or unicode instance.
        """
        if type(s) is str or type(s) is unicode:
            return s
        r = s.__str__()
        if not isinstance(r, (str, unicode)):
            raise TypeError('__str__ returned non-string')
        return r

The following function would be added to the C API and would be the
equivalent to the str() built-in (ideally it be called PyObject_Str, but
changing that function could cause a massive number of compatibility
problems):

    PyObject *PyString_New(PyObject *);

A reference implementation is available on Sourceforge[1] as a patch.

Backwards Compatibility

Some code may require that str() returns a str instance. In the standard
library, only one such case has been found so far. The function
email.header_decode() requires a str instance and the
email.Header.decode_header() function tries to ensure this by calling
str() on its argument. The code was fixed by changing the line "header =
str(header)" to:

    if isinstance(header, unicode):
        header = header.encode('ascii')

Whether this is truly a bug is questionable since decode_header() really
operates on byte strings, not character strings. Code that passes it a
unicode instance could itself be considered buggy.

Alternative Solutions

A new built-in function could be added instead of changing str(). Doing
so would introduce virtually no backwards compatibility problems.
However, since the compatibility problems are expected to rare, changing
str() seems preferable to adding a new built-in.

The basestring type could be changed to have the proposed behaviour,
rather than changing str(). However, that would be confusing behaviour
for an abstract base type.

References

Copyright

This document has been placed in the public domain.

[1] https://bugs.python.org/issue1266570