PEP: 261 Title: Support for "wide" Unicode characters Version:
$Revision$ Last-Modified: $Date$ Author: Paul Prescod <paul@prescod.net>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001

Abstract

Python 2.1 unicode characters can have ordinals only up to 2**16 - 1.
This range corresponds to a range in Unicode known as the Basic
Multilingual Plane. There are now characters in Unicode that live on
other "planes". The largest addressable character in Unicode has the
ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this
TOPCHAR and call characters in this range "wide characters".

Glossary

Character

    Used by itself, means the addressable units of a Python Unicode
    string.

Code point

    A code point is an integer between 0 and TOPCHAR. If you imagine
    Unicode as a mapping from integers to characters, each integer is a
    code point. But the integers between 0 and TOPCHAR that do not map
    to characters are also code points. Some will someday be used for
    characters. Some are guaranteed never to be used for characters.

Codec

    A set of functions for translating between physical encodings (e.g.
    on disk or coming in from a network) into logical Python objects.

Encoding

    Mechanism for representing abstract characters in terms of physical
    bits and bytes. Encodings allow us to store Unicode characters on
    disk and transmit them over networks in a manner that is compatible
    with other Unicode software.

Surrogate pair

    Two physical characters that represent a single logical character.
    Part of a convention for representing 32-bit code points in terms of
    two 16-bit code points.

Unicode string

    A Python type representing a sequence of code points with "string
    semantics" (e.g. case conversions, regular expression compatibility,
    etc.) Constructed with the unicode() function.

Proposed Solution

One solution would be to merely increase the maximum ordinal to a larger
value. Unfortunately the only straightforward implementation of this
idea is to use 4 bytes per character. This has the effect of doubling
the size of most Unicode strings. In order to avoid imposing this cost
on every user, Python 2.2 will allow the 4-byte implementation as a
build-time option. Users can choose whether they care about wide
characters or prefer to preserve memory.

The 4-byte option is called "wide Py_UNICODE". The 2-byte option is
called "narrow Py_UNICODE".

Most things will behave identically in the wide and narrow worlds.

-   unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one
    string.

-   unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string
    on wide Python builds. On narrow builds it will raise ValueError.

    ISSUE

      Python currently allows \U literals that cannot be represented as
      a single Python character. It generates two Python characters
      known as a "surrogate pair". Should this be disallowed on future
      narrow Python builds?

    Pro:

      Python already the construction of a surrogate pair for a large
      unicode literal character escape sequence. This is basically
      designed as a simple way to construct "wide characters" even in a
      narrow Python build. It is also somewhat logical considering that
      the Unicode-literal syntax is basically a short-form way of
      invoking the unicode-escape codec.

    Con:

      Surrogates could be easily created this way but the user still
      needs to be careful about slicing, indexing, printing etc.
      Therefore, some have suggested that Unicode literals should not
      support surrogates.

    ISSUE

      Should Python allow the construction of characters that do not
      correspond to Unicode code points? Unassigned Unicode code points
      should obviously be legal (because they could be assigned at any
      time). But code points above TOPCHAR are guaranteed never to be
      used by Unicode. Should we allow access to them anyhow?

    Pro:

      If a Python user thinks they know what they're doing why should we
      try to prevent them from violating the Unicode spec? After all, we
      don't stop 8-bit strings from containing non-ASCII characters.

    Con:

      Codecs and other Unicode-consuming code will have to be careful of
      these characters which are disallowed by the Unicode
      specification.

-   ord() is always the inverse of unichr()

-   There is an integer value in the sys module that describes the
    largest ordinal for a character in a Unicode string on the current
    interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of
    Python and TOPCHAR on wide builds.

    ISSUE:

      Should there be distinct constants for accessing TOPCHAR and the
      real upper bound for the domain of unichr (if they differ)? There
      has also been a suggestion of sys.unicodewidth which can take the
      values 'wide' and 'narrow'.

-   every Python Unicode character represents exactly one Unicode code
    point (i.e. Python Unicode Character = Abstract Unicode character).

-   codecs will be upgraded to support "wide characters" (represented
    directly in UCS-4, and as variable-length sequences in UTF-8 and
    UTF-16). This is the main part of the implementation left to be
    done.

-   There is a convention in the Unicode world for encoding a 32-bit
    code point in terms of two 16-bit code points. These are known as
    "surrogate pairs". Python's codecs will adopt this convention and
    encode 32-bit code points as surrogate pairs on narrow Python
    builds.

    ISSUE

      Should there be a way to tell codecs not to generate surrogates
      and instead treat wide characters as errors?

    Pro:

      I might want to write code that works only with fixed-width
      characters and does not have to worry about surrogates.

    Con:

      No clear proposal of how to communicate this to codecs.

-   there are no restrictions on constructing strings that use code
    points "reserved for surrogates" improperly. These are called
    "isolated surrogates". The codecs should disallow reading these from
    files, but you could construct them using string literals or
    unichr().

Implementation

There is a new define:

    #define Py_UNICODE_SIZE 2

To test whether UCS2 or UCS4 is in use, the derived macro
Py_UNICODE_WIDE should be used, which is defined when UCS-4 is in use.

There is a new configure option:

  ----------------------- -------------------------------------------------------------
  --enable-unicode=ucs2   configures a narrow Py_UNICODE, and uses wchar_t if it fits
  --enable-unicode=ucs4   configures a wide Py_UNICODE, and uses wchar_t if it fits
  --enable-unicode        same as "=ucs2"
  --disable-unicode       entirely remove the Unicode functionality.
  ----------------------- -------------------------------------------------------------

It is also proposed that one day --enable-unicode will just default to
the width of your platforms wchar_t.

Windows builds will be narrow for a while based on the fact that there
have been few requests for wide characters, those requests are mostly
from hard-core programmers with the ability to buy their own Python and
Windows itself is strongly biased towards 16-bit characters.

Notes

This PEP does NOT imply that people using Unicode need to use a 4-byte
encoding for their files on disk or sent over the network. It only
allows them to do so. For example, ASCII is still a legitimate (7-bit)
Unicode-encoding.

It has been proposed that there should be a module that handles
surrogates in narrow Python builds for programmers. If someone wants to
implement that, it will be another PEP. It might also be combined with
features that allow other kinds of character-, word- and line- based
indexing.

Rejected Suggestions

More or less the status-quo

  We could officially say that Python characters are 16-bit and require
  programmers to implement wide characters in their application logic by
  combining surrogate pairs. This is a heavy burden because emulating
  32-bit characters is likely to be very inefficient if it is coded
  entirely in Python. Plus these abstracted pseudo-strings would not be
  legal as input to the regular expression engine.

"Space-efficient Unicode" type

  Another class of solution is to use some efficient storage internally
  but present an abstraction of wide characters to the programmer. Any
  of these would require a much more complex implementation than the
  accepted solution. For instance consider the impact on the regular
  expression engine. In theory, we could move to this implementation in
  the future without breaking Python code. A future Python could
  "emulate" wide Python semantics on narrow Python. Guido is not willing
  to undertake the implementation right now.

Two types

  We could introduce a 32-bit Unicode type alongside the 16-bit type.
  There is a lot of code that expects there to be only a single Unicode
  type.

This PEP represents the least-effort solution. Over the next several
years, 32-bit Unicode characters will become more common and that may
either convince us that we need a more sophisticated solution or (on the
other hand) convince us that simply mandating wide Unicode characters is
an appropriate solution. Right now the two options on the table are do
nothing or do this.

References

Unicode Glossary: http://www.unicode.org/glossary/

Copyright

This document has been placed in the public domain.