PEP: 414 Title: Explicit Unicode Literal for Python 3.3 Version:
$Revision$ Last-Modified: $Date$ Author: Armin Ronacher
<armin.ronacher@active-4.com>, Alyssa Coghlan <ncoghlan@gmail.com>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
15-Feb-2012 Python-Version: 3.3 Post-History: 28-Feb-2012, 04-Mar-2012
Resolution:
https://mail.python.org/pipermail/python-dev/2012-February/116995.html

Abstract

This document proposes the reintegration of an explicit unicode literal
from Python 2.x to the Python 3.x language specification, in order to
reduce the volume of changes needed when porting Unicode-aware Python 2
applications to Python 3.

BDFL Pronouncement

This PEP has been formally accepted for Python 3.3:

  I'm accepting the PEP. It's about as harmless as they come. Make it
  so.

Proposal

This PEP proposes that Python 3.3 restore support for Python 2's Unicode
literal syntax, substantially increasing the number of lines of existing
Python 2 code in Unicode aware applications that will run without
modification on Python 3.

Specifically, the Python 3 definition for string literal prefixes will
be expanded to allow:

    "u" | "U"

in addition to the currently supported:

    "r" | "R"

The following will all denote ordinary Python 3 strings:

    'text'
    "text"
    '''text'''
    """text"""
    u'text'
    u"text"
    u'''text'''
    u"""text"""
    U'text'
    U"text"
    U'''text'''
    U"""text"""

No changes are proposed to Python 3's actual Unicode handling, only to
the acceptable forms for string literals.

Exclusion of "Raw" Unicode Literals

Python 2 supports a concept of "raw" Unicode literals that don't meet
the conventional definition of a raw string: \uXXXX and \UXXXXXXXX
escape sequences are still processed by the compiler and converted to
the appropriate Unicode code points when creating the associated Unicode
objects.

Python 3 has no corresponding concept - the compiler performs no
preprocessing of the contents of raw string literals. This matches the
behaviour of 8-bit raw string literals in Python 2.

Since such strings are rarely used and would be interpreted differently
in Python 3 if permitted, it was decided that leaving them out entirely
was a better choice. Code which uses them will thus still fail
immediately on Python 3 (with a Syntax Error), rather than potentially
producing different output.

To get equivalent behaviour that will run on both Python 2 and Python 3,
either an ordinary Unicode literal can be used (with appropriate
additional escaping within the string), or else string concatenation or
string formatting can be combine the raw portions of the string with
those that require the use of Unicode escape sequences.

Note that when using from __future__ import unicode_literals in Python
2, the nominally "raw" Unicode string literals will process \uXXXX and
\UXXXXXXXX escape sequences, just like Python 2 strings explicitly
marked with the "raw Unicode" prefix.

Author's Note

This PEP was originally written by Armin Ronacher, and Guido's approval
was given based on that version.

The currently published version has been rewritten by Alyssa Coghlan to
include additional historical details and rationale that were taken into
account when Guido made his decision, but were not explicitly documented
in Armin's version of the PEP.

Readers should be aware that many of the arguments in this PEP are not
technical ones. Instead, they relate heavily to the social and personal
aspects of software development.

Rationale

With the release of a Python 3 compatible version of the Web Services
Gateway Interface (WSGI) specification (PEP 3333) for Python 3.2, many
parts of the Python web ecosystem have been making a concerted effort to
support Python 3 without adversely affecting their existing developer
and user communities.

One major item of feedback from key developers in those communities,
including Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask,
Werkzeug), Jacob Kaplan-Moss (Django) and Kenneth Reitz (requests) is
that the requirement to change the spelling of every Unicode literal in
an application (regardless of how that is accomplished) is a key
stumbling block for porting efforts.

In particular, unlike many of the other Python 3 changes, it isn't one
that framework and library authors can easily handle on behalf of their
users. Most of those users couldn't care less about the "purity" of the
Python language specification, they just want their websites and
applications to work as well as possible.

While it is the Python web community that has been most vocal in
highlighting this concern, it is expected that other highly Unicode
aware domains (such as GUI development) may run into similar issues as
they (and their communities) start making concerted efforts to support
Python 3.

Common Objections

Complaint: This PEP may harm adoption of Python 3.2

This complaint is interesting, as it carries within it a tacit admission
that this PEP will make it easier to port Unicode aware Python 2
applications to Python 3.

There are many existing Python communities that are prepared to put up
with the constraints imposed by the existing suite of porting tools, or
to update their Python 2 code bases sufficiently that the problems are
minimised.

This PEP is not for those communities. Instead, it is designed
specifically to help people that don't want to put up with those
difficulties.

However, since the proposal is for a comparatively small tweak to the
language syntax with no semantic changes, it is feasible to support it
as a third party import hook. While such an import hook imposes some
import time overhead, and requires additional steps from each
application that needs it to get the hook in place, it allows
applications that target Python 3.2 to use libraries and frameworks that
would otherwise only run on Python 3.3+ due to their use of unicode
literal prefixes.

One such import hook project is Vinay Sajip's uprefix[1].

For those that prefer to translate their code in advance rather than
converting on the fly at import time, Armin Ronacher is working on a
hook that runs at install time rather than during import[2].

Combining the two approaches is of course also possible. For example,
the import hook could be used for rapid edit-test cycles during local
development, but the install hook for continuous integration tasks and
deployment on Python 3.2.

The approaches described in this section may prove useful, for example,
for applications that wish to target Python 3 on the Ubuntu 12.04 LTS
release, which will ship with Python 2.7 and 3.2 as officially supported
Python versions.

Complaint: Python 3 shouldn't be made worse just to support porting from Python 2

This is indeed one of the key design principles of Python 3. However,
one of the key design principles of Python as a whole is that
"practicality beats purity". If we're going to impose a significant
burden on third party developers, we should have a solid rationale for
doing so.

In most cases, the rationale for backwards incompatible Python 3 changes
are either to improve code correctness (for example, stricter default
separation of binary and text data and integer division upgrading to
floats when necessary), reduce typical memory usage (for example,
increased usage of iterators and views over concrete lists), or to
remove distracting nuisances that make Python code harder to read
without increasing its expressiveness (for example, the comma based
syntax for naming caught exceptions). Changes backed by such reasoning
are not going to be reverted, regardless of objections from Python 2
developers attempting to make the transition to Python 3.

In many cases, Python 2 offered two ways of doing things for historical
reasons. For example, inequality could be tested with both != and <> and
integer literals could be specified with an optional L suffix. Such
redundancies have been eliminated in Python 3, which reduces the overall
size of the language and improves consistency across developers.

In the original Python 3 design (up to and including Python 3.2), the
explicit prefix syntax for unicode literals was deemed to fall into this
category, as it is completely unnecessary in Python 3. However, the
difference between those other cases and unicode literals is that the
unicode literal prefix is not redundant in Python 2 code: it is a
programmatically significant distinction that needs to be preserved in
some fashion to avoid losing information.

While porting tools were created to help with the transition (see next
section) it still creates an additional burden on heavy users of unicode
strings in Python 2, solely so that future developers learning Python 3
don't need to be told "For historical reasons, string literals may have
an optional u or U prefix. Never use this yourselves, it's just there to
help with porting from an earlier version of the language."

Plenty of students learning Python 2 received similar warnings regarding
string exceptions without being confused or irreparably stunted in their
growth as Python developers. It will be the same with this feature.

This point is further reinforced by the fact that Python 3 still allows
the uppercase variants of the B and R prefixes for bytes literals and
raw bytes and string literals. If the potential for confusion due to
string prefix variants is that significant, where was the outcry asking
that these redundant prefixes be removed along with all the other
redundancies that were eliminated in Python 3?

Just as support for string exceptions was eliminated from Python 2 using
the normal deprecation process, support for redundant string prefix
characters (specifically, B, R, u, U) may eventually be eliminated from
Python 3, regardless of the current acceptance of this PEP. However,
such a change will likely only occur once third party libraries
supporting Python 2.7 is about as common as libraries supporting Python
2.2 or 2.3 is today.

Complaint: The WSGI "native strings" concept is an ugly hack

One reason the removal of unicode literals has provoked such concern
amongst the web development community is that the updated WSGI
specification had to make a few compromises to minimise the disruption
for existing web servers that provide a WSGI-compatible interface (this
was deemed necessary in order to make the updated standard a viable
target for web application authors and web framework developers).

One of those compromises is the concept of a "native string". WSGI
defines three different kinds of string:

-   text strings: handled as unicode in Python 2 and str in Python 3
-   native strings: handled as str in both Python 2 and Python 3
-   binary data: handled as str in Python 2 and bytes in Python 3

Some developers consider WSGI's "native strings" to be an ugly hack, as
they are explicitly documented as being used solely for latin-1 decoded
"text", regardless of the actual encoding of the underlying data. Using
this approach bypasses many of the updates to Python 3's data model that
are designed to encourage correct handling of text encodings. However,
it generally works due to the specific details of the problem domain -
web server and web framework developers are some of the individuals most
aware of how blurry the line can get between binary data and text when
working with HTTP and related protocols, and how important it is to
understand the implications of the encodings in use when manipulating
encoded text data. At the application level most of these details are
hidden from the developer by the web frameworks and support libraries
(both in Python 2 and in Python 3).

In practice, native strings are a useful concept because there are some
APIs (both in the standard library and in third party frameworks and
packages) and some internal interpreter details that are designed
primarily to work with str. These components often don't support unicode
in Python 2 or bytes in Python 3, or, if they do, require additional
encoding details and/or impose constraints that don't apply to the str
variants.

Some example of interfaces that are best handled by using actual str
instances are:

-   Python identifiers (as attributes, dict keys, class names, module
    names, import references, etc)
-   URLs for the most part as well as HTTP headers in urllib/http
    servers
-   WSGI environment keys and CGI-inherited values
-   Python source code for dynamic compilation and AST hacks
-   Exception messages
-   __repr__ return value
-   preferred filesystem paths
-   preferred OS environment

In Python 2.6 and 2.7, these distinctions are most naturally expressed
as follows:

-   u"": text string (unicode)
-   "": native string (str)
-   b"": binary data (str, also aliased as bytes)

In Python 3, the latin-1 decoded native strings are not distinguished
from any other text strings:

-   "": text string (str)
-   "": native string (str)
-   b"": binary data (bytes)

If from __future__ import unicode_literals is used to modify the
behaviour of Python 2, then, along with an appropriate definition of
n(), the distinction can be expressed as:

-   "": text string
-   n(""): native string
-   b"": binary data

(While n=str works for simple cases, it can sometimes have problems due
to non-ASCII source encodings)

In the common subset of Python 2 and Python 3 (with appropriate
specification of a source encoding and definitions of the u() and b()
helper functions), they can be expressed as:

-   u(""): text string
-   "": native string
-   b(""): binary data

That last approach is the only variant that supports Python 2.5 and
earlier.

Of all the alternatives, the format currently supported in Python 2.6
and 2.7 is by far the cleanest approach that clearly distinguishes the
three desired kinds of behaviour. With this PEP, that format will also
be supported in Python 3.3+. It will also be supported in Python 3.1 and
3.2 through the use of import and install hooks. While it is
significantly less likely, it is also conceivable that the hooks could
be adapted to allow the use of the b prefix on Python 2.5.

Complaint: The existing tools should be good enough for everyone

A commonly expressed sentiment from developers that have already
successfully ported applications to Python 3 is along the lines of "if
you think it's hard, you're doing it wrong" or "it's not that hard, just
try it!". While it is no doubt unintentional, these responses all have
the effect of telling the people that are pointing out inadequacies in
the current porting toolset "there's nothing wrong with the porting
tools, you just suck and don't know how to use them properly".

These responses are a case of completely missing the point of what
people are complaining about. The feedback that resulted in this PEP
isn't due to people complaining that ports aren't possible. Instead, the
feedback is coming from people that have successfully completed ports
and are objecting that they found the experience thoroughly unpleasant
for the class of application that they needed to port (specifically,
Unicode aware web frameworks and support libraries).

This is a subjective appraisal, and it's the reason why the Python 3
porting tools ecosystem is a case where the "one obvious way to do it"
philosophy emphatically does not apply. While it was originally intended
that "develop in Python 2, convert with 2to3, test both" would be the
standard way to develop for both versions in parallel, in practice, the
needs of different projects and developer communities have proven to be
sufficiently diverse that a variety of approaches have been devised,
allowing each group to select an approach that best fits their needs.

Lennart Regebro has produced an excellent overview of the available
migration strategies[3], and a similar review is provided in the
official porting guide[4]. (Note that the official guidance has softened
to "it depends on your specific situation" since Lennart wrote his
overview).

However, both of those guides are written from the founding assumption
that all of the developers involved are already committed to the idea of
supporting Python 3. They make no allowance for the social aspects of
such a change when you're interacting with a user base that may not be
especially tolerant of disruptions without a clear benefit, or are
trying to persuade Python 2 focused upstream developers to accept
patches that are solely about improving Python 3 forward compatibility.

With the current porting toolset, every migration strategy will result
in changes to every Unicode literal in a project. No exceptions. They
will be converted to either an unprefixed string literal (if the project
decides to adopt the unicode_literals import) or else to a converter
call like u("text").

If the unicode_literals import approach is employed, but is not adopted
across the entire project at the same time, then the meaning of a bare
string literal may become annoyingly ambiguous. This problem can be
particularly pernicious for aggregated software, like a Django site - in
such a situation, some files may end up using the unicode_literals
import and others may not, creating definite potential for confusion.

While these problems are clearly solvable at a technical level, they're
a completely unnecessary distraction at the social level. Developer
energy should be reserved for addressing real technical difficulties
associated with the Python 3 transition (like distinguishing their 8-bit
text strings from their binary data). They shouldn't be punished with
additional code changes (even automated ones) solely due to the fact
that they have already explicitly identified their Unicode strings in
Python 2.

Armin Ronacher has created an experimental extension to 2to3 which only
modernizes Python code to the extent that it runs on Python 2.7 or later
with support from the cross-version compatibility six library. This tool
is available as python-modernize[5]. Currently, the deltas generated by
this tool will affect every Unicode literal in the converted source.
This will create legitimate concerns amongst upstream developers asked
to accept such changes, and amongst framework users being asked to
change their applications.

However, by eliminating the noise from changes to the Unicode literal
syntax, many projects could be cleanly and (comparatively)
non-controversially made forward compatible with Python 3.3+ just by
running python-modernize and applying the recommended changes.

References

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 End:

[1] uprefix import hook project
(https://bitbucket.org/vinay.sajip/uprefix)

[2] install hook to remove unicode string prefix characters
(https://github.com/mitsuhiko/unicode-literals-pep/tree/master/install-hook)

[3] Porting to Python 3: Migration Strategies
(http://python3porting.com/strategies.html)

[4] Porting Python 2 Code to Python 3
(http://docs.python.org/howto/pyporting.html)

[5] Python-Modernize (http://github.com/mitsuhiko/python-modernize)