PEP: 597 Title: Add optional EncodingWarning Last-Modified: 07-Aug-2021
Author: Inada Naoki <songofacandy@gmail.com> Status: Final Type:
Standards Track Content-Type: text/x-rst Created: 05-Jun-2019
Python-Version: 3.10

Abstract

Add a new warning category EncodingWarning. It is emitted when the
encoding argument to open() is omitted and the default locale-specific
encoding is used.

The warning is disabled by default. A new -X warn_default_encoding
command-line option and a new PYTHONWARNDEFAULTENCODING environment
variable can be used to enable it.

A "locale" argument value for encoding is added too. It explicitly
specifies that the locale encoding should be used, silencing the
warning.

Motivation

Using the default encoding is a common mistake

Developers using macOS or Linux may forget that the default encoding is
not always UTF-8.

For example, using long_description = open("README.md").read() in
setup.py is a common mistake. Many Windows users cannot install such
packages if there is at least one non-ASCII character (e.g. emoji,
author names, copyright symbols, and the like) in their UTF-8-encoded
README.md file.

Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII
characters in their README, and 82 fail to install from source on
non-UTF-8 locales due to not specifying an encoding for a non-ASCII
file.[1]

Another example is logging.basicConfig(filename="log.txt"). Some users
might expect it to use UTF-8 by default, but the locale encoding is
actually what is used.[2]

Even Python experts may assume that the default encoding is UTF-8. This
creates bugs that only happen on Windows; see[3],[4],[5], and[6] for
example.

Emitting a warning when the encoding argument is omitted will help find
such mistakes.

Explicit way to use locale-specific encoding

open(filename) isn't explicit about which encoding is expected:

-   If ASCII is assumed, this isn't a bug, but may result in decreased
    performance on Windows, particularly with non-Latin-1 locale
    encodings
-   If UTF-8 is assumed, this may be a bug or a platform-specific script
-   If the locale encoding is assumed, the behavior is as expected (but
    could change if future versions of Python modify the default)

From this point of view, open(filename) is not readable code.

encoding=locale.getpreferredencoding(False) can be used to specify the
locale encoding explicitly, but it is too long and easy to misuse (e.g.
one can forget to pass False as its argument).

This PEP provides an explicit way to specify the locale encoding.

Prepare to change the default encoding to UTF-8

Since UTF-8 has become the de-facto standard text encoding, we might
default to it for opening files in the future.

However, such a change will affect many applications and libraries. If
we start emitting DeprecationWarning everywhere the encoding argument is
omitted, it will be too noisy and painful.

Although this PEP doesn't propose changing the default encoding, it will
help enable that change by:

-   Reducing the number of omitted encoding arguments in libraries
    before we start emitting a DeprecationWarning by default.
-   Allowing users to pass encoding="locale" to suppress the current
    warning and any DeprecationWarning added in the future, as well as
    retaining consistent behavior if later Python versions change the
    default, ensuring support for any Python version >=3.10.

Specification

EncodingWarning

Add a new EncodingWarning warning class as a subclass of Warning. It is
emitted when the encoding argument is omitted and the default
locale-specific encoding is used.

Options to enable the warning

The -X warn_default_encoding option and the PYTHONWARNDEFAULTENCODING
environment variable are added. They are used to enable EncodingWarning.

sys.flags.warn_default_encoding is also added. The flag is true when
EncodingWarning is enabled.

When the flag is set, io.TextIOWrapper(), open() and other modules using
them will emit EncodingWarning when the encoding argument is omitted.

Since EncodingWarning is a subclass of Warning, they are shown by
default (if the warn_default_encoding flag is set), unlike
DeprecationWarning.

encoding="locale"

io.TextIOWrapper will accept "locale" as a valid argument to encoding.
It has the same meaning as the current encoding=None, except that
io.TextIOWrapper doesn't emit EncodingWarning when encoding="locale" is
specified.

io.text_encoding()

io.text_encoding() is a helper for functions with an encoding=None
parameter that pass it to io.TextIOWrapper() or open().

A pure Python implementation will look like this:

    def text_encoding(encoding, stacklevel=1):
        """A helper function to choose the text encoding.

        When *encoding* is not None, just return it.
        Otherwise, return the default text encoding (i.e. "locale").

        This function emits an EncodingWarning if *encoding* is None and
        sys.flags.warn_default_encoding is true.

        This function can be used in APIs with an encoding=None parameter
        that pass it to TextIOWrapper or open.
        However, please consider using encoding="utf-8" for new APIs.
        """
        if encoding is None:
            if sys.flags.warn_default_encoding:
                import warnings
                warnings.warn(
                    "'encoding' argument not specified.",
                    EncodingWarning, stacklevel + 2)
            encoding = "locale"
        return encoding

For example, pathlib.Path.read_text() can use it like this:

    def read_text(self, encoding=None, errors=None):
        encoding = io.text_encoding(encoding)
        with self.open(mode='r', encoding=encoding, errors=errors) as f:
            return f.read()

By using io.text_encoding(), EncodingWarning is emitted for the caller
of read_text() instead of read_text() itself.

Affected standard library modules

Many standard library modules will be affected by this change.

Most APIs accepting encoding=None will use io.text_encoding() as written
in the previous section.

Where using the locale encoding as the default encoding is reasonable,
encoding="locale" will be used instead. For example, the subprocess
module will use the locale encoding as the default for pipes.

Many tests use open() without encoding specified to read ASCII text
files. They should be rewritten with encoding="ascii".

Rationale

Opt-in warning

Although DeprecationWarning is suppressed by default, always emitting
DeprecationWarning when the encoding argument is omitted would be too
noisy.

Noisy warnings may lead developers to dismiss the DeprecationWarning.

"locale" is not a codec alias

We don't add "locale" as a codec alias because the locale can be changed
at runtime.

Additionally, TextIOWrapper checks os.device_encoding() when
encoding=None. This behavior cannot be implemented in a codec.

Backward Compatibility

The new warning is not emitted by default, so this PEP is 100%
backwards-compatible.

Forward Compatibility

Passing "locale" as the argument to encoding is not forward-compatible.
Code using it will not work on Python older than 3.10, and will instead
raise LookupError: unknown encoding: locale.

Until developers can drop Python 3.9 support, EncodingWarning can only
be used for finding missing encoding="utf-8" arguments.

How to Teach This

For new users

Since EncodingWarning is used to write cross-platform code, there is no
need to teach it to new users.

We can just recommend using UTF-8 for text files and using
encoding="utf-8" when opening them.

For experienced users

Using open(filename) to read text files encoded in UTF-8 is a common
mistake. It may not work on Windows because UTF-8 is not the default
encoding.

You can use -X warn_default_encoding or PYTHONWARNDEFAULTENCODING=1 to
find this type of mistake.

Omitting the encoding argument is not a bug when opening text files
encoded in the locale encoding, but encoding="locale" is recommended in
Python 3.10 and later because it is more explicit.

Reference Implementation

https://github.com/python/cpython/pull/19481

Discussions

The latest discussion thread is:
https://mail.python.org/archives/list/python-dev@python.org/thread/SFYUP2TWD5JZ5KDLVSTZ44GWKVY4YNCV/

-   Why not implement this in linters?
    -   encoding="locale" and io.text_encoding() must be implemented in
        Python.
    -   It is difficult to find all callers of functions wrapping open()
        or TextIOWrapper() (see the io.text_encoding() section).
-   Many developers will not use the option.
    -   Some will, and report the warnings to libraries they use, so the
        option is worth it even if many developers don't enable it.
    -   For example, I found[7] and[8] by running pip install -U pip,
        and[9] by running tox with the reference implementation. This
        demonstrates how this option can be used to find potential
        issues.

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

[1] "Packages can't be installed when encoding is not UTF-8"
(https://github.com/methane/pep597-pypi-ascii)

[2] "Logging - Inconsistent behaviour when handling unicode"
(https://bugs.python.org/issue37111)

[3] Packaging tutorial in packaging.python.org didn't specify encoding
to read a README.md
(https://github.com/pypa/packaging.python.org/pull/682)

[4] json.tool had used locale encoding to read JSON files.
(https://bugs.python.org/issue33684)

[5] site: Potential UnicodeDecodeError when handling pth file
(https://bugs.python.org/issue33684)

[6] pypa/pip: "Installing packages fails if Python 3 installed into path
with non-ASCII characters" (https://github.com/pypa/pip/issues/9054)

[7] "site: Potential UnicodeDecodeError when handling pth file"
(https://bugs.python.org/issue43214)

[8] "[pypa/pip] Use encoding option or binary mode for open()"
(https://github.com/pypa/pip/pull/9608)

[9] "Possible UnicodeError caused by missing encoding="utf-8""
(https://github.com/tox-dev/tox/issues/1908)