PEP: 540 Title: Add a new UTF-8 Mode Version: $Revision$ Last-Modified:
$Date$ Author: Victor Stinner <vstinner@python.org> BDFL-Delegate: INADA
Naoki Status: Final Type: Standards Track Content-Type: text/x-rst
Created: 05-Jan-2016 Python-Version: 3.7 Resolution:
https://mail.python.org/pipermail/python-dev/2017-December/151173.html

Abstract

Add a new "UTF-8 Mode" to enhance Python's use of UTF-8. When UTF-8 Mode
is active, Python will:

-   use the utf-8 encoding, regardless of the locale currently set by
    the current platform, and
-   change the stdin and stdout error handlers to surrogateescape.

This mode is off by default, but is automatically activated when using
the "POSIX" locale.

Add the -X utf8 command line option and PYTHONUTF8 environment variable
to control UTF-8 Mode.

Rationale

Locale encoding and UTF-8

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for various reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause trouble.

It isn't always easy to get an accurate locale. Locales don't get the
exact same name on different Linux distributions, FreeBSD, macOS, etc.
And some locales, like the recent C.UTF-8 locale, are only supported by
a few platforms. The current locale can even vary on the same platform
depending on context; for example, a SSH connection can use a different
encoding than the filesystem or local terminal encoding on the same
machine.

On the flip side, Python 3.6 is already using UTF-8 by default on macOS,
Android and Windows (PEP 529) for most functions -- although open() is a
notable exception here. UTF-8 is also the default encoding of Python
scripts, XML and JSON file formats. The Go programming language uses
UTF-8 for all strings.

UTF-8 support is nearly ubiquitous for data read and written by modern
platforms. It also has excellent support in Python. The problem is
simply that the locale is frequently misconfigured. An obvious solution
suggests itself: ignore the locale encoding and use UTF-8.

Passthrough for undecodable bytes: surrogateescape

When decoding bytes from UTF-8 using the default strict error handler,
Python 3 raises a UnicodeDecodeError on the first undecodable byte.

Unix command line tools like cat or grep and most Python 2 applications
simply do not have this class of bugs: they don't decode data, but
process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the surrogateescape error handler (PEP 383). It allows processing data
as if it were bytes, but uses Unicode in practice; undecodable bytes are
stored as surrogate characters.

UTF-8 Mode sets the surrogateescape error handler for stdin and stdout,
since these streams as commonly associated to Unix command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded, and Python is expected to fail early when open()
is called with the wrong options, like opening a JPEG picture in text
mode. The open() default error handler remains strict for these reasons.

No change by default for best backward compatibility

While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale is
usually equivalent to the ASCII encoding, whereas UTF-8 is a much better
choice. It does not change the behaviour for other locales to prevent
any risk or regression.

As users are responsible to enable explicitly the new UTF-8 Mode for
these other locales, they are responsible for any potential mojibake
issues caused by UTF-8 Mode.

Proposal

Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
encoding, and change stdin and stdout error handlers to surrogateescape.

Add the new -X utf8 command line option and PYTHONUTF8 environment
variable. Users can explicitly activate UTF-8 Mode with the command-line
option -X utf8 or by setting the environment variable PYTHONUTF8=1.

This mode is disabled by default and enabled by the POSIX locale. Users
can explicitly disable UTF-8 Mode with the command-line option -X utf8=0
or by setting the environment variable PYTHONUTF8=0.

For standard streams, the PYTHONIOENCODING environment variable has
priority over UTF-8 Mode.

On Windows, the PYTHONLEGACYWINDOWSFSENCODING environment variable (PEP
529) has the priority over UTF-8 Mode.

Effects of UTF-8 Mode:

-   sys.getfilesystemencoding() returns 'UTF-8'.
-   locale.getpreferredencoding() returns UTF-8; its do_setlocale
    argument, and the locale encoding, are ignored.
-   sys.stdin and sys.stdout error handler is set to surrogateescape.

Side effects:

-   open() uses the UTF-8 encoding by default. However, it still uses
    the strict error handler by default.
-   os.fsdecode() and os.fsencode() use the UTF-8 encoding.
-   Command line arguments, environment variables and filenames use the
    UTF-8 encoding.

Relationship with the locale coercion (PEP 538)

The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
mode has no additional effect.

The UTF-8 Mode has the same effect as locale coercion:

-   sys.getfilesystemencoding() returns 'UTF-8',
-   locale.getpreferredencoding() returns UTF-8, and
-   the sys.stdin and sys.stdout error handlers are set to
    surrogateescape.

These changes only affect Python code. But the locale coercion has
additional effects: the LC_CTYPE environment variable and the LC_CTYPE
locale are set to a UTF-8 locale like C.UTF-8. One side effect is that
non-Python code is also impacted by the locale coercion. The two PEPs
are complementary.

On platforms like Centos 7 where locale coercion is not supported, the
POSIX locale only enables UTF-8 Mode. In this case, Python code uses the
UTF-8 encoding and ignores the locale encoding, whereas non-Python code
uses the locale encoding, which is usually ASCII for the POSIX locale.

While the UTF-8 Mode is supported on all platforms and can be enabled
with any locale, the locale coercion is not supported by all platforms
and is restricted to the POSIX locale.

The UTF-8 Mode has only an impact on Python child processes when the
PYTHONUTF8 environment variable is set to 1, whereas the locale coercion
sets the LC_CTYPE environment variables which impacts all child
processes.

The benefit of the locale coercion approach is that it helps ensure that
encoding handling in binary extension modules and child processes is
consistent with Python's encoding handling. The upside of the UTF-8 Mode
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.

Backward Compatibility

The only backward incompatible change is that the POSIX locale now
enables the UTF-8 Mode by default: it will now use the UTF-8 encoding,
ignore the locale encoding, and change stdin and stdout error handlers
to surrogateescape.

Annex: Encodings And Error Handlers

UTF-8 Mode changes the default encoding and error handler used by
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
sys.stderr.

Encoding and error handler

  Function                       Default                   UTF-8 Mode or POSIX locale
  ------------------------------ ------------------------- ----------------------------
  open()                         locale/strict             UTF-8/strict
  os.fsdecode(), os.fsencode()   locale/surrogateescape    UTF-8/surrogateescape
  sys.stdin, sys.stdout          locale/strict             UTF-8/surrogateescape
  sys.stderr                     locale/backslashreplace   UTF-8/backslashreplace

By comparison, Python 3.6 uses:

  Function                       Default                   POSIX locale
  ------------------------------ ------------------------- ----------------------------
  open()                         locale/strict             locale/strict
  os.fsdecode(), os.fsencode()   locale/surrogateescape    locale/surrogateescape
  sys.stdin, sys.stdout          locale/strict             locale/**surrogateescape**
  sys.stderr                     locale/backslashreplace   locale/backslashreplace

Encoding and error handler on Windows

On Windows, the encodings and error handlers are different:

  Function                       Default                  Legacy Windows FS encoding   UTF-8 Mode
  ------------------------------ ------------------------ ---------------------------- ------------------------
  open()                         mbcs/strict              mbcs/strict                  UTF-8/strict
  os.fsdecode(), os.fsencode()   UTF-8/surrogatepass      mbcs/replace                 UTF-8/surrogatepass
  sys.stdin, sys.stdout          UTF-8/surrogateescape    UTF-8/surrogateescape        UTF-8/surrogateescape
  sys.stderr                     UTF-8/backslashreplace   UTF-8/backslashreplace       UTF-8/backslashreplace

By comparison, Python 3.6 uses:

  Function                       Default                  Legacy Windows FS encoding
  ------------------------------ ------------------------ ----------------------------
  open()                         mbcs/strict              mbcs/strict
  os.fsdecode(), os.fsencode()   UTF-8/surrogatepass      mbcs/replace
  sys.stdin, sys.stdout          UTF-8/surrogateescape    UTF-8/surrogateescape
  sys.stderr                     UTF-8/backslashreplace   UTF-8/backslashreplace

The "Legacy Windows FS encoding" is enabled by the
PYTHONLEGACYWINDOWSFSENCODING environment variable.

If stdin and/or stdout is redirected to a pipe, sys.stdin and/or
sys.output uses mbcs encoding by default rather than UTF-8. But in UTF-8
Mode, sys.stdin and sys.stdout always use the UTF-8 encoding.

Note

There is no POSIX locale on Windows. The ANSI code page is used as the
locale encoding, and this code page never uses the ASCII encoding.

Links

-   bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
-   PEP 538: "Coercing the legacy C locale to C.UTF-8"
-   PEP 529: "Change Windows filesystem encoding to UTF-8"
-   PEP 528: "Change Windows console encoding to UTF-8"
-   PEP 383: "Non-decodable Bytes in System Character Interfaces"

Post History

-   2017-12: [Python-Dev] PEP 540: Add a new UTF-8 Mode
-   2017-04: [Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
    540 (assuming UTF-8 for *nix system boundaries)
-   2017-01: [Python-ideas] PEP 540: Add a new UTF-8 Mode
-   2017-01: bpo-28180: Implementation of the PEP 538: coerce C locale
    to C.utf-8 (msg284764)
-   2016-08-17: bpo-27781: Change sys.getfilesystemencoding() on Windows
    to UTF-8 (msg272916) -- Victor proposed -X utf8 for the PEP 529
    (Change Windows filesystem encoding to UTF-8)

Version History

-   Version 4: locale.getpreferredencoding() now returns 'UTF-8' in the
    UTF-8 Mode.
-   Version 3: The UTF-8 Mode does not change the open() default error
    handler (strict) anymore, and the Strict UTF-8 Mode has been
    removed.
-   Version 2: Rewrite the PEP from scratch to make it much shorter and
    easier to understand.
-   Version 1: First version posted to python-dev.

Copyright

This document has been placed in the public domain.