PEP: 528 Title: Change Windows console encoding to UTF-8 Version:
$Revision$ Last-Modified: $Date$ Author: Steve Dower
<steve.dower@python.org> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 27-Aug-2016 Python-Version: 3.6
Post-History: 01-Sep-2016, 04-Sep-2016 Resolution:
https://mail.python.org/pipermail/python-dev/2016-September/146278.html

Abstract

Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have
been long discouraged in favor of the UTF-16 APIs. Within the operating
system, all text is represented as UTF-16, and the ANSI APIs perform
encoding and decoding using the active code page.

This PEP proposes changing the default standard stream implementation on
Windows to use the Unicode APIs. This will allow users to print and
input the full range of Unicode characters at the default Windows
console. This also requires a subtle change to how the tokenizer parses
text from readline hooks.

Specific Changes

Add _io.WindowsConsoleIO

Currently an instance of _io.FileIO is used to wrap the file descriptors
representing standard input, output and error. We add a new class
(implemented in C) _io.WindowsConsoleIO that acts as a raw IO object
using the Windows console functions, specifically, ReadConsoleW and
WriteConsoleW.

This class will be used when the legacy-mode flag is not in effect, when
opening a standard stream by file descriptor and the stream is a console
buffer rather than a redirected file. Otherwise, _io.FileIO will be used
as it is today.

This is a raw (bytes) IO class that requires text to be passed encoded
with utf-8, which will be decoded to utf-16-le and passed to the Windows
APIs. Similarly, bytes read from the class will be provided by the
operating system as utf-16-le and converted into utf-8 when returned to
Python.

The use of an ASCII compatible encoding is required to maintain
compatibility with code that bypasses the TextIOWrapper and directly
writes ASCII bytes to the standard streams (for example, Twisted's
process_stdinreader.py). Code that assumes a particular encoding for the
standard streams other than ASCII will likely break.

Add _PyOS_WindowsConsoleReadline

To allow Unicode entry at the interactive prompt, a new readline hook is
required. The existing PyOS_StdioReadline function will delegate to the
new _PyOS_WindowsConsoleReadline function when reading from a file
descriptor that is a console buffer and the legacy-mode flag is not in
effect (the logic should be identical to above).

Since the readline interface is required to return an 8-bit encoded
string with no embedded nulls, the _PyOS_WindowsConsoleReadline function
transcodes from utf-16-le as read from the operating system into utf-8.

The function PyRun_InteractiveOneObject which currently obtains the
encoding from sys.stdin will select utf-8 unless the legacy-mode flag is
in effect. This may require readline hooks to change their encodings to
utf-8, or to require legacy-mode for correct behaviour.

Add legacy mode

Launching Python with the environment variable PYTHONLEGACYWINDOWSSTDIO
set will enable the legacy-mode flag, which completely restores the
previous behaviour.

Alternative Approaches

The win_unicode_console package is a pure-Python alternative to changing
the default behaviour of the console. It implements essentially the same
modifications as described here using pure Python code.

Code that may break

The following code patterns may break or see different behaviour as a
result of this change. All of these code samples require explicitly
choosing to use a raw file object in place of a more convenient wrapper
that would prevent any visible change.

Assuming stdin/stdout encoding

Code that assumes that the encoding required by sys.stdin.buffer or
sys.stdout.buffer is 'mbcs' or a more specific encoding may currently be
working by chance, but could encounter issues under this change. For
example:

    >>> sys.stdout.buffer.write(text.encode('mbcs'))
    >>> r = sys.stdin.buffer.read(16).decode('cp437')

To correct this code, the encoding specified on the TextIOWrapper should
be used, either implicitly or explicitly:

    >>> # Fix 1: Use wrapper correctly
    >>> sys.stdout.write(text)
    >>> r = sys.stdin.read(16)

    >>> # Fix 2: Use encoding explicitly
    >>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
    >>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)

Incorrectly using the raw object

Code that uses the raw IO object and does not correctly handle partial
reads and writes may be affected. This is particularly important for
reads, where the number of characters read will never exceed one-fourth
of the number of bytes allowed, as there is no feasible way to prevent
input from encoding as much longer utf-8 strings:

    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(15)
    abcdefghijklm
    b'abc'
    # data contains at most 3 characters, and never more than 12 bytes
    # error, as "defghijklm\r\n" is passed to the interactive prompt

To correct this code, the buffered reader/writer should be used, or the
caller should continue reading until its buffer is full:

    >>> # Fix 1: Use the buffered reader/writer
    >>> stdin = sys.stdin.buffer
    >>> data = stdin.read(15)
    abcedfghijklm
    b'abcdefghijklm\r\n'

    >>> # Fix 2: Loop until enough bytes have been read
    >>> raw_stdin = sys.stdin.buffer.raw
    >>> b = b''
    >>> while len(b) < 15:
    ...     b += raw_stdin.read(15)
    abcedfghijklm
    b'abcdefghijklm\r\n'

Using the raw object with small buffers

Code that uses the raw IO object and attempts to read less than four
characters will now receive an error. Because it's possible that any
single character may require up to four bytes when represented in utf-8,
requests must fail:

    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(3)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: must read at least 4 bytes

The only workaround is to pass a larger buffer:

    >>> # Fix: Request at least four bytes
    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(4)
    a
    b'a'
    >>> >>>

(The extra >>> is due to the newline remaining in the input buffer and
is expected in this situation.)

Copyright

This document has been placed in the public domain.

References