PEP: 538 Title: Coercing the legacy C locale to a UTF-8 based locale
Version: $Revision$ Last-Modified: $Date$ Author: Alyssa Coghlan
<ncoghlan@gmail.com> BDFL-Delegate: INADA Naoki Status: Final Type:
Standards Track Content-Type: text/x-rst Created: 28-Dec-2016
Python-Version: 3.7 Post-History: 03-Jan-2017, 07-Jan-2017, 05-Mar-2017,
09-May-2017 Resolution:
https://mail.python.org/pipermail/python-dev/2017-May/148035.html

Abstract

An ongoing challenge with Python 3 on *nix systems is the conflict
between needing to use the configured locale encoding by default for
consistency with other locale-aware components in the same process or
subprocesses, and the fact that the standard C locale (as defined in
POSIX:2001) typically implies a default text encoding of ASCII, which is
entirely inadequate for the development of networked services and client
applications in a multilingual world.

PEP 540 proposes a change to CPython's handling of the legacy C locale
such that CPython will assume the use of UTF-8 in such environments,
rather than persisting with the demonstrably problematic assumption of
ASCII as an appropriate encoding for communicating with operating system
interfaces. This is a good approach for cases where network encoding
interoperability is a more important concern than local encoding
interoperability.

However, it comes at the cost of making CPython's encoding assumptions
diverge from those of other locale-aware components in the same process,
as well as those of components running in subprocesses that share the
same environment.

This can cause interoperability problems with some extension modules
(such as GNU readline's command line history editing), as well as with
components running in subprocesses (such as older Python runtimes).

It also requires non-trivial changes to the internals of how CPython
itself works, rather than relying primarily on existing configuration
settings that are supported by Python versions prior to Python 3.7.

Accordingly, this PEP proposes that independently of the UTF-8 mode
proposed in PEP 540, the way the CPython implementation handles the
default C locale be changed to be roughly equivalent to the following
existing configuration settings (supported since Python 3.1):

    LC_CTYPE=C.UTF-8
    PYTHONIOENCODING=utf-8:surrogateescape

The exact target locale for coercion will be chosen from a predefined
list at runtime based on the actually available locales.

The reinterpreted locale settings will be written back to the
environment so they're visible to other components in the same process
and in subprocesses, but the changed PYTHONIOENCODING default will be
made implicit in order to avoid causing compatibility problems with
Python 2 subprocesses that don't provide the surrogateescape error
handler.

The new legacy locale coercion behavior can be disabled either by
setting LC_ALL (which may still lead to a Unicode compatibility warning)
or by setting the new PYTHONCOERCECLOCALE environment variable to 0.

With this change, any *nix platform that does not offer at least one of
the C.UTF-8, C.utf8 or UTF-8 locales as part of its standard
configuration would only be considered a fully supported platform for
CPython 3.7+ deployments when a suitable locale other than the default C
locale is configured explicitly (e.g. en_AU.UTF-8, zh_CN.gb18030). If
PEP 540 is accepted in addition to this PEP, then pure Python modules
would also be supported when using the proposed PYTHONUTF8 mode, but
expectations for full Unicode compatibility in extension modules would
continue to be limited to the platforms covered by this PEP.

As it only reflects a change in default settings rather than a
fundamentally new capability, redistributors (such as Linux
distributions) with a narrower target audience than the upstream CPython
development team may also choose to opt in to this locale coercion
behaviour for the Python 3.6.x series by applying the necessary changes
as a downstream patch.

Implementation Notes

Attempting to implement the PEP as originally accepted showed that the
proposal to emit locale coercion and compatibility warnings by default
simply wasn't practical (there were too many cases where previously
working code failed because of the warnings, rather than because of
latent locale handling defects in the affected code).

As a result, the PY_WARN_ON_C_LOCALE config flag was removed, and
replaced with a runtime PYTHONCOERCECLOCALE=warn environment variable
setting that allows developers and system integrators to opt-in to
receiving locale coercion and compatibility warnings, without emitting
them by default.

The output examples in the PEP itself have also been updated to remove
the warnings and make them easier to read.

Background

While the CPython interpreter is starting up, it may need to convert
from the char * format to the wchar_t * format, or from one of those
formats to PyUnicodeObject *, in a way that's consistent with the locale
settings of the overall system. It handles these cases by relying on the
operating system to do the conversion and then ensuring that the text
encoding name reported by sys.getfilesystemencoding() matches the
encoding used during this early bootstrapping process.

On Windows, the limitations of the mbcs format used by default in these
conversions proved sufficiently problematic that PEP 528 and PEP 529
were implemented to bypass the operating system supplied interfaces for
binary data handling and force the use of UTF-8 instead.

On Mac OS X, iOS, and Android, many components, including CPython,
already assume the use of UTF-8 as the system encoding, regardless of
the locale setting. However, this isn't the case for all components, and
the discrepancy can cause problems in some situations (for example, when
using the GNU readline module [16]).

On non-Apple and non-Android *nix systems, these operations are handled
using the C locale system in glibc, which has the following
characteristics[1]:

-   by default, all processes start in the C locale, which uses ASCII
    for these conversions. This is almost never what anyone doing
    multilingual text processing actually wants (including CPython and
    C/C++ GUI frameworks).
-   calling setlocale(LC_ALL, "") reconfigures the active locale based
    on the locale categories configured in the current process
    environment
-   if the locale requested by the current environment is unknown, or no
    specific locale is configured, then the default C locale will remain
    active

The specific locale category that covers the APIs that CPython depends
on is LC_CTYPE, which applies to "classification and conversion of
characters, and to multibyte and wide characters"[2]. Accordingly,
CPython includes the following key calls to setlocale:

-   in the main python binary, CPython calls setlocale(LC_ALL, "") to
    configure the entire C locale subsystem according to the process
    environment. It does this prior to making any calls into the shared
    CPython library
-   in Py_Initialize, CPython calls setlocale(LC_CTYPE, ""), such that
    the configured locale settings for that category always match those
    set in the environment. It does this unconditionally, and it doesn't
    revert the process state change in Py_Finalize

(This summary of the locale handling omits several technical details
related to exactly where and when the text encoding declared as part of
the locale settings is used - see PEP 540 for further discussion, as
these particular details matter more when decoupling CPython from the
declared C locale than they do when overriding the locale with one based
on UTF-8)

These calls are usually sufficient to provide sensible behaviour, but
they can still fail in the following cases:

-   SSH environment forwarding means that SSH clients may sometimes
    forward client locale settings to servers that don't have that
    locale installed. This leads to CPython running in the default
    ASCII-based C locale
-   some process environments (such as Linux containers) may not have
    any explicit locale configured at all. As with unknown locales, this
    leads to CPython running in the default ASCII-based C locale
-   on Android, rather than configuring the locale based on environment
    variables, the empty locale "" is treated as specifically requesting
    the "C" locale

The simplest way to deal with this problem for currently released
versions of CPython is to explicitly set a more sensible locale when
launching the application. For example:

    LC_CTYPE=C.UTF-8 python3 ...

The C.UTF-8 locale is a full locale definition that uses UTF-8 for the
LC_CTYPE category, and the same settings as the C locale for all other
categories (including LC_COLLATE). It is offered by a number of Linux
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as
an alternative to the ASCII-based C locale. Some other platforms (such
as HP-UX) offer an equivalent locale definition under the name C.utf8.

Mac OS X and other *BSD systems have taken a different approach: instead
of offering a C.UTF-8 locale, they offer a partial UTF-8 locale that
only defines the LC_CTYPE category. On such systems, the preferred
environmental locale adjustment is to set LC_CTYPE=UTF-8 rather than to
set LC_ALL or LANG.[3]

In the specific case of Docker containers and similar technologies, the
appropriate locale setting can be specified directly in the container
image definition.

Another common failure case is developers specifying LANG=C in order to
see otherwise translated user interface messages in English, rather than
the more narrowly scoped LC_MESSAGES=C or LANGUAGE=en.

Relationship with other PEPs

This PEP shares a common problem statement with PEP 540 (improving
Python 3's behaviour in the default C locale), but diverges markedly in
the proposed solution:

-   PEP 540 proposes to entirely decouple CPython's default text
    encoding from the C locale system in that case, allowing text
    handling inconsistencies to arise between CPython and other
    locale-aware components running in the same process and in
    subprocesses. This approach aims to make CPython behave less like a
    locale-aware application, and more like locale-independent language
    runtimes like those for Go, Node.js (V8), and Rust
-   this PEP proposes to override the legacy C locale with a more
    recently defined locale that uses UTF-8 as its default text
    encoding. This means that the text encoding override will apply not
    only to CPython, but also to any locale-aware extension modules
    loaded into the current process, as well as to locale-aware
    applications invoked in subprocesses that inherit their environment
    from the parent process. This approach aims to retain CPython's
    traditional strong support for integration with other locale-aware
    components while also actively helping to push forward the adoption
    and standardisation of the C.UTF-8 locale as a Unicode-aware
    replacement for the legacy C locale in the wider C/C++ ecosystem

After reviewing both PEPs, it became clear that they didn't actually
conflict at a technical level, and the proposal in PEP 540 offered a
superior option in cases where no suitable locale was available, as well
as offering a better reference behaviour for platforms where the notion
of a "locale encoding" doesn't make sense (for example, embedded systems
running MicroPython rather than the CPython reference interpreter).

Meanwhile, this PEP offered improved compatibility with other
locale-aware components, and an approach more amenable to being
backported to Python 3.6 by downstream redistributors.

As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour when none of the standard UTF-8
based locales were available, as well as extending the changes in the
default settings to APIs that aren't currently independently
configurable (such as the default encoding and error handler for
open()).

The availability of PEP 540 also meant that the LC_CTYPE=en_US.UTF-8
legacy fallback was removed from the list of UTF-8 locales tried as a
coercion target, with the expectation being that CPython will instead
rely solely on the proposed PYTHONUTF8 mode in such cases.

Motivation

While Linux container technologies like Docker, Kubernetes, and
OpenShift are best known for their use in web service development, the
related container formats and execution models are also being adopted
for Linux command line application development. Technologies like Gnome
Flatpak[4] and Ubuntu Snappy[5] further aim to bring these same
techniques to Linux GUI application development.

When using Python 3 for application development in these contexts, it
isn't uncommon to see text encoding related errors akin to the
following:

    $ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
    Unable to decode the command from the command line:
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
    $ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
    Unable to decode the command from the command line:
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed

Even though the same command is likely to work fine when run locally:

    $ python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ

The source of the problem can be seen by instead running the locale
command in the three environments:

    $ locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=en_AU.UTF-8
    LC_CTYPE="en_AU.UTF-8"
    LC_ALL=
    $ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=
    LC_CTYPE="POSIX"
    LC_ALL=
    $ docker run --rm ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=
    LANGUAGE=
    LC_CTYPE="POSIX"
    LC_ALL=

In this particular example, we can see that the host system locale is
set to "en_AU.UTF-8", so CPython uses UTF-8 as the default text
encoding. By contrast, the base Docker images for Fedora and Debian
don't have any specific locale set, so they use the POSIX locale by
default, which is an alias for the ASCII-based default C locale.

The simplest way to get Python 3 (regardless of the exact version) to
behave sensibly in Fedora and Debian based containers is to run it in
the C.UTF-8 locale that both distros provide:

    $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ
    $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ

    $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=
    LC_CTYPE=C.UTF-8
    LC_ALL=
    $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=
    LANGUAGE=
    LC_CTYPE=C.UTF-8
    LC_ALL=

The Alpine Linux based Python images provided by Docker, Inc. already
use the C.UTF-8 locale by default:

    $ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ
    $ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
    LANG=C.UTF-8
    LANGUAGE=
    LC_CTYPE="C.UTF-8"
    LC_ALL=

Similarly, for custom container images (i.e. those adding additional
content on top of a base distro image), a more suitable locale can be
set in the image definition so everything just works by default.
However, it would provide a much nicer and more consistent user
experience if CPython were able to just deal with this problem
automatically rather than relying on redistributors or end users to
handle it through system configuration changes.

While the glibc developers are working towards making the C.UTF-8 locale
universally available for use by glibc based applications like
CPython[6], this unfortunately doesn't help on platforms that ship older
versions of glibc without that feature, and also don't provide C.UTF-8
(or an equivalent) as an on-disk locale the way Debian and Fedora do.
These platforms are considered out of scope for this PEP - see PEP 540
for further discussion of possible options for improving CPython's
default behaviour in such environments.

Design Principles

The above motivation leads to the following core design principles for
the proposed solution:

-   if a locale other than the default C locale is explicitly
    configured, we'll continue to respect it
-   as far as is feasible, any changes made will use existing
    configuration options
-   Python's runtime behaviour in potential coercion target locales
    should be identical regardless of whether the locale was set
    explicitly in the environment or implicitly as a locale coercion
    target
-   for Python 3.7, if we're changing the locale setting without an
    explicit config option, we'll emit a warning on stderr that we're
    doing so rather than silently changing the process configuration.
    This will alert application and system integrators to the change,
    even if they don't closely follow the PEP process or Python release
    announcements. However, to minimize the chance of introducing new
    problems for end users, we'll do this without using the warnings
    system, so even running with -Werror won't turn it into a runtime
    exception. (Note: these warnings ended up being silenced by default.
    See the Implementation Note above for more details)
-   for Python 3.7, any changed defaults will offer some form of
    explicit "off" switch at build time, runtime, or both

Minimizing the negative impact on systems currently correctly configured
to use GB-18030 or another partially ASCII compatible universal encoding
leads to the following design principle:

-   if a UTF-8 based Linux container is run on a host that is explicitly
    configured to use a non-UTF-8 encoding, and tries to exchange
    locally encoded data with that host rather than exchanging
    explicitly UTF-8 encoded data, CPython will endeavour to correctly
    round-trip host provided data that is concatenated or split solely
    at common ASCII compatible code points, but may otherwise emit
    nonsensical results.

Minimizing the negative impact on systems and programs correctly
configured to use an explicit locale category like LC_TIME, LC_MONETARY
or LC_NUMERIC while otherwise running in the legacy C locale gives the
following design principles:

-   don't make any environmental changes that would alter any existing
    settings for locale categories other than LC_CTYPE (most notably:
    don't set LC_ALL or LANG)

Finally, maintaining compatibility with running arbitrary subprocesses
in orchestration use cases leads to the following design principle:

-   don't make any Python-specific environmental changes that might be
    incompatible with any still supported version of CPython (including
    CPython 2.7)

Specification

To better handle the cases where CPython would otherwise end up
attempting to operate in the C locale, this PEP proposes that CPython
automatically attempt to coerce the legacy C locale to a UTF-8 based
locale for the LC_CTYPE category when it is run as a standalone command
line application.

It further proposes to emit a warning on stderr if the legacy C locale
is in effect for the LC_CTYPE category at the point where the language
runtime itself is initialized, and the explicit environmental flag to
disable locale coercion is not set, in order to warn system and
application integrators that they're running CPython in an unsupported
configuration.

In addition to these general changes, some additional Android-specific
changes are proposed to handle the differences in the behaviour of
setlocale on that platform.

Legacy C locale coercion in the standalone Python interpreter binary

When run as a standalone application, CPython has the opportunity to
reconfigure the C locale before any locale dependent operations are
executed in the process.

This means that it can change the locale settings not only for the
CPython runtime, but also for any other locale-aware components running
in the current process (e.g. as part of extension modules), as well as
in subprocesses that inherit their environment from the current process.

After calling setlocale(LC_ALL, "") to initialize the locale settings in
the current process, the main interpreter binary will be updated to
include the following call:

    const char *ctype_loc = setlocale(LC_CTYPE, NULL);

This cryptic invocation is the API that C provides to query the current
locale setting without changing it. Given that query, it is possible to
check for exactly the C locale with strcmp:

    ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C locale

This call also returns "C" when either no particular locale is set, or
the nominal locale is set to an alias for the C locale (such as POSIX).

Given this information, CPython can then attempt to coerce the locale to
one that uses UTF-8 rather than ASCII as the default encoding.

Three such locales will be tried:

-   C.UTF-8 (available at least in Debian, Ubuntu, Alpine, and Fedora
    25+, and expected to be available by default in a future version of
    glibc)
-   C.utf8 (available at least in HP-UX)
-   UTF-8 (available in at least some *BSD variants, including Mac OS X)

The coercion will be implemented by setting the LC_CTYPE environment
variable to the candidate locale name, such that future calls to
setlocale() will see it, as will other components looking for those
settings (such as GUI development frameworks and Python's own locale
module).

To allow for better cross-platform binary portability and to adjust
automatically to future changes in locale availability, these checks
will be implemented at runtime on all platforms other than Windows,
rather than attempting to determine which locales to try at compile
time.

When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
successfully configured:

    Python detected LC_CTYPE=C: LC_CTYPE coerced to C.UTF-8 (set another
    locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

(Note: this warning ended up being silenced by default. See the
Implementation Note above for more details)

As long as the current platform provides at least one of the candidate
UTF-8 based environments, this locale coercion will mean that the
standard Python binary and locale-aware extensions should once again
"just work" in the three main failure cases we're aware of (missing
locale settings, SSH forwarding of unknown locales via LANG or LC_CTYPE,
and developers explicitly requesting LANG=C).

The one case where failures may still occur is when stderr is
specifically being checked for no output, which can be resolved either
by configuring a locale other than the C locale, or else by using a
mechanism other than "there was no output on stderr" to check for
subprocess errors (e.g. checking process return codes).

If none of the candidate locales are successfully configured, or the
LC_ALL, locale override is defined in the current process environment,
then initialization will continue in the C locale and the Unicode
compatibility warning described in the next section will be emitted just
as it would for any other application.

If PYTHONCOERCECLOCALE=0 is explicitly set, initialization will continue
in the C locale and the Unicode compatibility warning described in the
next section will be automatically suppressed.

The interpreter will always check for the PYTHONCOERCECLOCALE
environment variable at startup (even when running under the -E or -I
switches), as the locale coercion check necessarily takes place before
any command line argument processing. For consistency, the runtime check
to determine whether or not to suppress the locale compatibility warning
will be similarly independent of these settings.

Legacy C locale warning during runtime initialization

By the time that Py_Initialize is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that
by the time it is called, it is too late to reliably switch to a
different locale - doing so would introduce inconsistencies in decoded
text, even in the context of the standalone Python interpreter binary.

Accordingly, when Py_Initialize is called and CPython detects that the
configured locale is still the default C locale and
PYTHONCOERCECLOCALE=0 is not set, the following warning will be issued:

    Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
    encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
    C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
    locales is recommended.

(Note: this warning ended up being silenced by default. See the
Implementation Note above for more details)

In this case, no actual change will be made to the locale settings.

Instead, the warning informs both system and application integrators
that they're running Python 3 in a configuration that we don't expect to
work properly.

The second sentence providing recommendations may eventually be
conditionally compiled based on the operating system (e.g. recommending
LC_CTYPE=UTF-8 on *BSD systems), but the initial implementation will
just use the common generic message shown above.

New build-time configuration options

While both of the above behaviours would be enabled by default, they
would also have new associated configuration options and preprocessor
definitions for the benefit of redistributors that want to override
those default settings.

The locale coercion behaviour would be controlled by the flag
--with[out]-c-locale-coercion, which would set the PY_COERCE_C_LOCALE
preprocessor definition.

The locale warning behaviour would be controlled by the flag
--with[out]-c-locale-warning, which would set the PY_WARN_ON_C_LOCALE
preprocessor definition.

(Note: this compile time warning option ended up being replaced by a
runtime PYTHONCOERCECLOCALE=warn option. See the Implementation Note
above for more details)

On platforms which don't use the autotools based build system (i.e.
Windows) these preprocessor variables would always be undefined.

Changes to the default error handling on the standard streams

Since Python 3.5, CPython has defaulted to using surrogateescape on the
standard streams (sys.stdin, sys.stdout) when it detects that the
current locale is C and no specific error handled has been set using
either the PYTHONIOENCODING environment variable or the
Py_setStandardStreamEncoding API. For other locales, the default error
handler for the standard streams is strict.

In order to preserve this behaviour without introducing any behavioural
discrepancies between locale coercion and explicitly configuring a
locale, the coercion target locales (C.UTF-8, C.utf8, and UTF-8) will be
added to the list of locales that use surrogateescape as their default
error handler for the standard streams.

No changes are proposed to the default error handler for sys.stderr:
that will continue to be backslashreplace.

Changes to locale settings on Android

Independently of the other changes in this PEP, CPython on Android
systems will be updated to call setlocale(LC_ALL, "C.UTF-8") where it
currently calls setlocale(LC_ALL, "") and setlocale(LC_CTYPE, "C.UTF-8")
where it currently calls setlocale(LC_CTYPE, "").

This Android-specific behaviour is being introduced due to the following
Android-specific details:

-   on Android, passing "" to setlocale is equivalent to passing "C"
-   the C.UTF-8 locale is always available

Platform Support Changes

A new "Legacy C Locale" section will be added to PEP 11 that states:

-   as of CPython 3.7, *nix platforms are expected to provide at least
    one of C.UTF-8 (full locale), C.utf8 (full locale) or UTF-8 (
    LC_CTYPE-only locale) as an alternative to the legacy C locale. Any
    Unicode related integration problems that occur only in the legacy C
    locale and cannot be reproduced in an appropriately configured
    non-ASCII locale will be closed as "won't fix".

Rationale

Improving the handling of the C locale

It has been clear for some time that the C locale's default encoding of
ASCII is entirely the wrong choice for development of modern networked
services. Newer languages like Rust and Go have eschewed that default
entirely, and instead made it a deployment requirement that systems be
configured to use UTF-8 as the text encoding for operating system
interfaces. Similarly, Node.js assumes UTF-8 by default (a behaviour
inherited from the V8 JavaScript engine) and requires custom build
settings to indicate it should use the system locale settings for
locale-aware operations. Both the JVM and the .NET CLR use UTF-16-LE as
their primary encoding for passing text between applications and the
application runtime (i.e. the JVM/CLR, not the host operating system).

The challenge for CPython has been the fact that in addition to being
used for network service development, it is also extensively used as an
embedded scripting language in larger applications, and as a desktop
application development language, where it is more important to be
consistent with other locale-aware components sharing the same process,
as well as with the user's desktop locale settings, than it is with the
emergent conventions of modern network service development.

The core premise of this PEP is that for all of these use cases, the
assumption of ASCII implied by the default "C" locale is the wrong
choice, and furthermore that the following assumptions are valid:

-   in desktop application use cases, the process locale will already be
    configured appropriately, and if it isn't, then that is an operating
    system or embedding application level problem that needs to be
    reported to and resolved by the operating system provider or
    application developer
-   in network service development use cases (especially those based on
    Linux containers), the process locale may not be configured at all,
    and if it isn't, then the expectation is that components will impose
    their own default encoding the way Rust, Go and Node.js do, rather
    than trusting the legacy C default encoding of ASCII the way CPython
    currently does

Defaulting to "surrogateescape" error handling on the standard IO streams

By coercing the locale away from the legacy C default and its assumption
of ASCII as the preferred text encoding, this PEP also disables the
implicit use of the "surrogateescape" error handler on the standard IO
streams that was introduced in Python 3.5 ([7]), as well as the
automatic use of surrogateescape when operating in PEP 540's proposed
UTF-8 mode.

Rather than introducing yet another configuration option to adjust that
behaviour, this PEP instead proposes to extend the "surrogateescape"
default for stdin and stderr error handling to also apply to the three
potential coercion target locales.

The aim of this behaviour is to attempt to ensure that operating system
provided text values are typically able to be transparently passed
through a Python 3 application even if it is incorrect in assuming that
that text has been encoded as UTF-8.

In particular, GB 18030[8] is a Chinese national text encoding standard
that handles all Unicode code points, that is formally incompatible with
both ASCII and UTF-8, but will nevertheless often tolerate processing as
surrogate escaped data - the points where GB 18030 reuses ASCII byte
values in an incompatible way are likely to be invalid in UTF-8, and
will therefore be escaped and opaque to string processing operations
that split on or search for the relevant ASCII code points. Operations
that don't involve splitting on or searching for particular ASCII or
Unicode code point values are almost certain to work correctly.

Similarly, Shift-JIS[9] and ISO-2022-JP[10] remain in widespread use in
Japan, and are incompatible with both ASCII and UTF-8, but will tolerate
text processing operations that don't involve splitting on or searching
for particular ASCII or Unicode code point values.

As an example, consider two files, one encoded with UTF-8 (the default
encoding for en_AU.UTF-8), and one encoded with GB-18030 (the default
encoding for zh_CN.gb18030):

    $ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
    $ python3 -c 'open("gb18030.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'

On disk, we can see that these are two very different files:

    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "rb").read().strip()); \
                  print("GB18030:", open("gb18030.txt", "rb").read().strip())'
    UTF-8:   b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
    GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'

That nevertheless can both be rendered correctly to the terminal as long
as they're decoded prior to printing:

    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
    UTF-8:   ℙƴ☂ℌøἤ
    GB18030: ℙƴ☂ℌøἤ

By contrast, if we just pass along the raw bytes, as cat and similar
C/C++ utilities will tend to do:

    $ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
    ℙƴ☂ℌøἤ
    �6�6�0�0�7�9�6�4�0�3�6�6

Even setting a specifically Chinese locale won't help in getting the
GB-18030 encoded file rendered correctly:

    $ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
    ℙƴ☂ℌøἤ
    �6�6�0�0�7�9�6�4�0�3�6�6

The problem is that the terminal encoding setting remains UTF-8,
regardless of the nominal locale. A GB18030 terminal can be emulated
using the iconv utility:

    $ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
    鈩櫰粹槀鈩屆羔激
    ℙƴ☂ℌøἤ

This reverses the problem, such that the GB18030 file is rendered
correctly, but the UTF-8 file has been converted to unrelated hanzi
characters, rather than the expected rendering of "Python" as non-ASCII
characters.

With the emulated GB18030 terminal encoding, assuming UTF-8 in Python
results in both files being displayed incorrectly:

    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
      | iconv -f GB18030 -t UTF-8
    UTF-8:   鈩櫰粹槀鈩屆羔激
    GB18030: 鈩櫰粹槀鈩屆羔激

However, setting the locale correctly means that the emulated GB18030
terminal now displays both files as originally intended:

    $ LANG=zh_CN.gb18030 \
      python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
      | iconv -f GB18030 -t UTF-8
    UTF-8:   ℙƴ☂ℌøἤ
    GB18030: ℙƴ☂ℌøἤ

The rationale for retaining surrogateescape as the default IO encoding
is that it will preserve the following helpful behaviour in the C
locale:

    $ cat gb18030.txt \
      | LANG=C python3 -c "import sys; print(sys.stdin.read())" \
      | iconv -f GB18030 -t UTF-8
    ℙƴ☂ℌøἤ

Rather than reverting to the exception currently seen when a UTF-8 based
locale is explicitly configured:

    $ cat gb18030.txt \
      | python3 -c "import sys; print(sys.stdin.read())" \
      | iconv -f GB18030 -t UTF-8
    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/usr/lib64/python3.5/codecs.py", line 321, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte

As an added benefit, environments explicitly configured to use one of
the coercion target locales will implicitly gain the encoding
transparency behaviour currently enabled by default in the C locale.

Avoiding setting PYTHONIOENCODING during UTF-8 locale coercion

Rather than changing the default handling of the standard streams during
interpreter initialization, earlier versions of this PEP proposed
setting PYTHONIOENCODING to utf-8:surrogateescape. This turned out to
create a significant compatibility problem: since the surrogateescape
handler only exists in Python 3.1+, running Python 2.7 processes in
subprocesses could potentially break in a confusing way with that
configuration.

The current design means that earlier Python versions will instead
retain their default strict error handling on the standard streams,
while Python 3.7+ will consistently use the more permissive
surrogateescape handler even when these locales are explicitly
configured (rather than being reached through locale coercion).

Dropping official support for ASCII based text handling in the legacy C locale

We've been trying to get strict bytes/text separation to work reliably
in the legacy C locale for over a decade at this point. Not only haven't
we been able to get it to work, neither has anyone else - the only
viable alternatives identified have been to pass the bytes along
verbatim without eagerly decoding them to text (C/C++, Python 2.x, Ruby,
etc), or else to largely ignore the nominal C/C++ locale encoding and
assume the use of either UTF-8 (PEP 540, Rust, Go, Node.js, etc) or
UTF-16-LE (JVM, .NET CLR).

While this PEP ensures that developers that genuinely need to do so can
still opt-in to running their Python code in the legacy C locale (by
setting LC_ALL=C, PYTHONCOERCECLOCALE=0, or running a custom build that
sets --without-c-locale-coercion), it also makes it clear that we don't
expect Python 3's Unicode handling to be completely reliable in that
configuration, and the recommended alternative is to use a more
appropriate locale setting (potentially in combination with PEP 540's
UTF-8 mode, if that is available).

Providing implicit locale coercion only when running standalone

The major downside of the proposed design in this PEP is that it
introduces a potential discrepancy between the behaviour of the CPython
runtime when it is run as a standalone application and when it is run as
an embedded component inside a larger system (e.g. mod_wsgi running
inside Apache httpd).

Over the course of Python 3.x development, multiple attempts have been
made to improve the handling of incorrect locale settings at the point
where the Python interpreter is initialised. The problem that emerged is
that this is ultimately too late in the interpreter startup process -
data such as command line arguments and the contents of environment
variables may have already been retrieved from the operating system and
processed under the incorrect ASCII text encoding assumption well before
Py_Initialize is called.

The problems created by those inconsistencies were then even harder to
diagnose and debug than those created by believing the operating
system's claim that ASCII was a suitable encoding to use for operating
system interfaces. This was the case even for the default CPython
binary, let alone larger C/C++ applications that embed CPython as a
scripting engine.

The approach proposed in this PEP handles that problem by moving the
locale coercion as early as possible in the interpreter startup sequence
when running standalone: it takes place directly in the C-level main()
function, even before calling in to the Py_Main() library function that
implements the features of the CPython interpreter CLI.

The Py_Initialize API then only gains an explicit warning (emitted on
stderr) when it detects use of the C locale, and relies on the embedding
application to specify something more reasonable.

That said, the reference implementation for this PEP adds most of the
functionality to the shared library, with the CLI being updated to
unconditionally call two new private APIs:

    if (_Py_LegacyLocaleDetected()) {
        _Py_CoerceLegacyLocale();
    }

These are similar to other "pre-configuration" APIs intended for
embedding applications: they're designed to be called before
Py_Initialize, and hence change the way the interpreter gets
initialized.

If these were made public (either as part of this PEP or in a subsequent
RFE), then it would be straightforward for other embedding applications
to recreate the same behaviour as is proposed for the CPython CLI.

Allowing restoration of the legacy behaviour

The CPython command line interpreter is often used to investigate faults
that occur in other applications that embed CPython, and those
applications may still be using the C locale even after this PEP is
implemented.

Providing a simple on/off switch for the locale coercion behaviour makes
it much easier to reproduce the behaviour of such applications for
debugging purposes, as well as making it easier to reproduce the
behaviour of older 3.x runtimes even when running a version with this
change applied.

Querying LC_CTYPE for C locale detection

LC_CTYPE is the actual locale category that CPython relies on to drive
the implicit decoding of environment variables, command line arguments,
and other text values received from the operating system.

As such, it makes sense to check it specifically when attempting to
determine whether or not the current locale configuration is likely to
cause Unicode handling problems.

Explicitly setting LC_CTYPE for UTF-8 locale coercion

Python is often used as a glue language, integrating other C/C++ ABI
compatible components in the current process, and components written in
arbitrary languages in subprocesses.

Setting LC_CTYPE to C.UTF-8 is important to handle cases where the
problem has arisen from a setting like LC_CTYPE=UTF-8 being provided on
a system where no UTF-8 locale is defined (e.g. when a Mac OS X ssh
client is configured to forward locale settings, and the user logs into
a Linux server).

This should be sufficient to ensure that when the locale coercion is
activated, the switch to the UTF-8 based locale will be applied
consistently across the current process and any subprocesses that
inherit the current environment.

Avoiding setting LANG for UTF-8 locale coercion

Earlier versions of this PEP proposed setting the LANG category
independent default locale, in addition to setting LC_CTYPE.

This was later removed on the grounds that setting only LC_CTYPE is
sufficient to handle all of the problematic scenarios that the PEP aimed
to resolve, while setting LANG as well would break cases where LANG was
set correctly, and the locale problems were solely due to an incorrect
LC_CTYPE setting ([11]).

For example, consider a Python application that called the Linux date
utility in a subprocess rather than doing its own date formatting:

    $ LANG=ja_JP.UTF-8 LC_CTYPE=C date
    2017年  5月 23日 火曜日 17:31:03 JST

    $ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date  # Coercing only LC_CTYPE
    2017年  5月 23日 火曜日 17:32:58 JST

    $ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date  # Coercing both of LC_CTYPE and LANG
    Tue May 23 17:31:10 JST 2017

With only LC_CTYPE updated in the Python process, the subprocess would
continue to behave as expected. However, if LANG was updated as well,
that would effectively override the LC_TIME setting and use the wrong
date formatting conventions.

Avoiding setting LC_ALL for UTF-8 locale coercion

Earlier versions of this PEP proposed setting the LC_ALL locale
override, in addition to setting LC_CTYPE.

This was changed after it was determined that just setting LC_CTYPE and
LANG should be sufficient to handle all the scenarios the PEP aims to
cover, as it avoids causing any problems in cases like the following:

    $ LANG=C LC_MONETARY=ja_JP.utf8 ./python -c \
      "from locale import setlocale, LC_ALL, currency; setlocale(LC_ALL, ''); print(currency(1e6))"
    ￥1000000

Skipping locale coercion if LC_ALL is set in the current environment

With locale coercion now only setting LC_CTYPE and LANG, it will have no
effect if LC_ALL is also set. To avoid emitting a spurious locale
coercion notice in that case, coercion is instead skipped entirely.

Considering locale coercion independently of "UTF-8 mode"

With both this PEP's locale coercion and PEP 540's UTF-8 mode under
consideration for Python 3.7, it makes sense to ask whether or not we
can limit ourselves to only doing one or the other, rather than making
both changes.

The UTF-8 mode proposed in PEP 540 has two major limitations that make
it a potential complement to this PEP rather than a potential
replacement.

First, unlike this PEP, PEP 540's UTF-8 mode makes it possible to change
default behaviours that are not currently configurable at all. While
that's exactly what makes the proposal interesting, it's also what makes
it an entirely unproven approach. By contrast, the approach proposed in
this PEP builds directly atop existing configuration settings for the C
locale system ( LC_CTYPE, LANG) and Python's standard streams
(PYTHONIOENCODING) that have already been in use for years to handle the
kinds of compatibility problems discussed in this PEP.

Secondly, one of the things we know based on that experience is that the
proposed locale coercion can resolve problems not only in CPython
itself, but also in extension modules that interact with the standard
streams, like GNU readline. As an example, consider the following
interactive session from a PEP 538 enabled CPython build, where each
line after the first is executed by doing "up-arrow, left-arrow x4,
delete, enter":

    $ LANG=C ./python
    Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May  7 2017, 00:21:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌἤ")
    ℙƴ☂ℌἤ
    >>> print("ℙƴ☂ἤ")
    ℙƴ☂ἤ
    >>> print("ℙƴἤ")
    ℙƴἤ
    >>> print("ℙἤ")
    ℙἤ
    >>> print("ἤ")
    ἤ
    >>>

This is exactly what we'd expect from a well-behaved command history
editor.

By contrast, the following is what currently happens on an older release
if you only change the Python level stream encoding settings without
updating the locale settings:

    $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3
    Python 3.5.3 (default, Apr 24 2017, 13:32:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌ�")
     File "<stdin>", line 0

       ^
    SyntaxError: 'utf-8' codec can't decode bytes in position 20-21:
    invalid continuation byte

That particular misbehaviour is coming from GNU readline, not CPython
-because the command history editing wasn't UTF-8 aware, it corrupted
the history buffer and fed such nonsense to stdin that even the
surrogateescape error handler was bypassed. While PEP 540's UTF-8 mode
could technically be updated to also reconfigure readline, that's just
one extension module that might be interacting with the standard streams
without going through the CPython C API, and any change made by CPython
would only apply when readline is running directly as part of Python 3.7
rather than in a separate subprocess.

However, if we actually change the configured locale, GNU readline
starts behaving itself, without requiring any changes to the embedding
application:

    $ LANG=C.UTF-8 python3
    Python 3.5.3 (default, Apr 24 2017, 13:32:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌἤ")
    ℙƴ☂ℌἤ
    >>> print("ℙƴ☂ἤ")
    ℙƴ☂ἤ
    >>> print("ℙƴἤ")
    ℙƴἤ
    >>> print("ℙἤ")
    ℙἤ
    >>> print("ἤ")
    ἤ
    >>>
    $ LC_CTYPE=C.UTF-8 python3
    Python 3.5.3 (default, Apr 24 2017, 13:32:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌἤ")
    ℙƴ☂ℌἤ
    >>> print("ℙƴ☂ἤ")
    ℙƴ☂ἤ
    >>> print("ℙƴἤ")
    ℙƴἤ
    >>> print("ℙἤ")
    ℙἤ
    >>> print("ἤ")
    ἤ
    >>>

Enabling C locale coercion and warnings on Mac OS X, iOS and Android

On Mac OS X, iOS, and Android, CPython already assumes the use of UTF-8
for system interfaces, and we expect most other locale-aware components
to do the same.

Accordingly, this PEP originally proposed to disable locale coercion and
warnings at build time for these platforms, on the assumption that it
would be entirely redundant.

However, that assumption turned out to be incorrect, as subsequent
investigations showed that if you explicitly configure LANG=C on these
platforms, extension modules like GNU readline will misbehave in much
the same way as they do on other *nix systems.[12]

In addition, Mac OS X is also frequently used as a development and
testing platform for Python software intended for deployment to other
*nix environments (such as Linux or Android), and Linux is similarly
often used as a development and testing platform for mobile and Mac OS X
applications.

Accordingly, this PEP enables the locale coercion and warning features
by default on all platforms that use CPython's autotools based build
toolchain (i.e. everywhere other than Windows).

Implementation

The reference implementation is being developed in the
pep538-coerce-c-locale feature branch[13] in Alyssa Coghlan's fork of
the CPython repository on GitHub. A work-in-progress PR is available
at[14].

This reference implementation covers not only the enhancement request in
issue 28180[15], but also the Android compatibility fixes needed to
resolve issue 28997[16].

Backporting to earlier Python 3 releases

Backporting to Python 3.6.x

If this PEP is accepted for Python 3.7, redistributors backporting the
change specifically to their initial Python 3.6.x release will be both
allowed and encouraged. However, such backports should only be
undertaken either in conjunction with the changes needed to also provide
a suitable locale by default, or else specifically for platforms where
such a locale is already consistently available.

At least the Fedora project is planning to pursue this approach for the
upcoming Fedora 26 release[17].

Backporting to other 3.x releases

While the proposed behavioural change is seen primarily as a bug fix
addressing Python 3's current misbehaviour in the default ASCII-based C
locale, it still represents a reasonably significant change in the way
CPython interacts with the C locale system. As such, while some
redistributors may still choose to backport it to even earlier Python
3.x releases based on the needs and interests of their particular user
base, this wouldn't be encouraged as a general practice.

However, configuring Python 3 environments (such as base container
images) to use these configuration settings by default is both allowed
and recommended.

Acknowledgements

The locale coercion approach proposed in this PEP is inspired directly
by Armin Ronacher's handling of this problem in the click command line
utility development framework[18]:

    $ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()'
    Traceback (most recent call last):
      ...
    RuntimeError: Click will abort further execution because Python 3 was
    configured to use ASCII as encoding for the environment.  Either run this
    under Python 2 or consult http://click.pocoo.org/python3/ for mitigation
    steps.

    This system supports the C.UTF-8 locale which is recommended.
    You might be able to resolve your issue by exporting the
    following environment variables:

        export LC_ALL=C.UTF-8
        export LANG=C.UTF-8

The change was originally proposed as a downstream patch for Fedora's
system Python 3.6 package[19], and then reformulated as a PEP for Python
3.7 with a section allowing for backports to earlier versions by
redistributors. In parallel with the development of the upstream patch,
Charalampos Stratakis has been working on the Fedora 26 backport and
providing feedback on the practical viability of the proposed changes.

The initial draft was posted to the Python Linux SIG for discussion[20]
and then amended based on both that discussion and Victor Stinner's work
in PEP 540[21].

The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout
this PEP is taken from Ned Batchelder's excellent "Pragmatic Unicode"
presentation[22].

Stephen Turnbull has long provided valuable insight into the text
encoding handling challenges he regularly encounters at the University
of Tsukuba (筑波大学).

References

Copyright

This document has been placed in the public domain under the terms of
the CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/

[1] GNU C: How Programs Set the Locale
(https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)

[2] GNU C: Locale Categories
(https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)

[3] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac
OS X with default language set to English"
(https://bugs.python.org/issue18378#msg215215)

[4] GNOME Flatpak (https://flatpak.org/)

[5] Ubuntu Snappy (https://www.ubuntu.com/desktop/snappy)

[6] glibc C.UTF-8 locale proposal
(https://sourceware.org/glibc/wiki/Proposals/C.UTF-8)

[7] Use "surrogateescape" error handler for sys.stdin and sys.stdout on
UNIX for the C locale (https://bugs.python.org/issue19977)

[8] GB 18030 (https://en.wikipedia.org/wiki/GB_18030)

[9] Shift-JIS (https://en.wikipedia.org/wiki/Shift_JIS)

[10] ISO-2022 (https://en.wikipedia.org/wiki/ISO/IEC_2022)

[11] Potential problems when setting LANG in addition to setting
LC_CTYPE
(https://mail.python.org/pipermail/python-dev/2017-May/147968.html)

[12] GNU readline misbehaviour on Mac OS X with LANG=C
(https://mail.python.org/pipermail/python-dev/2017-May/147897.html)

[13] GitHub branch diff for ncoghlan:pep538-coerce-c-locale
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)

[14] GitHub pull request for the reference implementation
(https://github.com/python/cpython/pull/659)

[15] CPython: sys.getfilesystemencoding() should default to utf-8
(https://bugs.python.org/issue28180)

[16] test_readline.test_nonascii fails on Android
(https://bugs.python.org/issue28997)

[17] Fedora 26 change proposal for locale coercion backport
(https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale)

[18] Locale configuration required for click applications under Python 3
(https://click.palletsprojects.com/en/5.x/python3/#python-3-surrogate-handling)

[19] Fedora: force C.UTF-8 when Python 3 is run under the C locale
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)

[20] linux-sig discussion of initial PEP draft
(https://mail.python.org/pipermail/linux-sig/2017-January/000014.html)

[21] Feedback notes from linux-sig discussion and PEP 540
(https://github.com/python/peps/issues/171)

[22] Pragmatic Unicode (https://nedbatchelder.com/text/unipain.html)