PEP: 529 Title: Change Windows filesystem encoding to UTF-8 Version:
$Revision$ Last-Modified: $Date$ Author: Steve Dower
<steve.dower@python.org> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 27-Aug-2016 Python-Version: 3.6
Post-History: 01-Sep-2016, 04-Sep-2016 Resolution:
https://mail.python.org/pipermail/python-dev/2016-September/146277.html

Abstract

Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have
been long discouraged in favor of the UTF-16 APIs. Within the operating
system, all text is represented as UTF-16, and the ANSI APIs perform
encoding and decoding using the active code page. See Naming Files,
Paths, and Namespaces for more details.

This PEP proposes changing the default filesystem encoding on Windows to
utf-8, and changing all filesystem functions to use the Unicode APIs for
filesystem paths. This will not affect code that uses strings to
represent paths, however those that use bytes for paths will now be able
to correctly round-trip all valid paths in Windows filesystems.
Currently, the conversions between Unicode (in the OS) and bytes (in
Python) were lossy and would fail to round-trip characters outside of
the user's active code page.

Notably, this does not impact the encoding of the contents of files.
These will continue to default to locale.getpreferredencoding() (for
text files) or plain bytes (for binary files). This only affects the
encoding used when users pass a bytes object to Python where it is then
passed to the operating system as a path name.

Background

File system paths are almost universally represented as text with an
encoding determined by the file system. In Python, we expose these paths
via a number of interfaces, such as the os and io modules. Paths may be
passed either direction across these interfaces, that is, from the
filesystem to the application (for example, os.listdir()), or from the
application to the filesystem (for example, os.unlink()).

When paths are passed between the filesystem and the application, they
are either passed through as a bytes blob or converted to/from str using
os.fsencode() and os.fsdecode() or explicit encoding using
sys.getfilesystemencoding(). The result of encoding a string with
sys.getfilesystemencoding() is a blob of bytes in the native format for
the default file system.

On Windows, the native format for the filesystem is utf-16-le. The
recommended platform APIs for accessing the filesystem all accept and
return text encoded in this format. However, prior to Windows NT (and
possibly further back), the native format was a configurable machine
option and a separate set of APIs existed to accept this format. The
option (the "active code page") and these APIs (the "*A functions")
still exist in recent versions of Windows for backwards compatibility,
though new functionality often only has a utf-16-le API (the "*W
functions").

In Python, str is recommended because it can correctly round-trip all
characters used in paths (on POSIX with surrogateescape handling; on
Windows because str maps to the native representation). On Windows bytes
cannot round-trip all characters used in paths, as Python internally
uses the *A functions and hence the encoding is "whatever the active
code page is". Since the active code page cannot represent all Unicode
characters, the conversion of a path into bytes can lose information
without warning or any available indication.

As a demonstration of this:

    >>> open('test\uAB00.txt', 'wb').close()
    >>> import glob
    >>> glob.glob('test*')
    ['test\uab00.txt']
    >>> glob.glob(b'test*')
    [b'test?.txt']

The Unicode character in the second call to glob has been replaced by a
'?', which means passing the path back into the filesystem will result
in a FileNotFoundError. The same results may be observed with
os.listdir() or any function that matches the return type to the
parameter type.

While one user-accessible fix is to use str everywhere, POSIX systems
generally do not suffer from data loss when using bytes exclusively as
the bytes are the canonical representation. Even if the encoding is
"incorrect" by some standard, the file system will still map the bytes
back to the file. Making use of this avoids the cost of decoding and
reencoding, such that (theoretically, and only on POSIX), code such as
this may be faster because of the use of b'.' compared to using '.':

    >>> for f in os.listdir(b'.'):
    ...     os.stat(f)
    ...

As a result, POSIX-focused library authors prefer to use bytes to
represent paths. For some authors it is also a convenience, as their
code may receive bytes already known to be encoded correctly, while
others are attempting to simplify porting their code from Python 2.
However, the correctness assumptions do not carry over to Windows where
Unicode is the canonical representation, and errors may result. This
potential data loss is why the use of bytes paths on Windows was
deprecated in Python 3.3 - all of the above code snippets produce
deprecation warnings on Windows.

Proposal

Currently the default filesystem encoding is 'mbcs', which is a
meta-encoder that uses the active code page. However, when bytes are
passed to the filesystem they go through the *A APIs and the operating
system handles encoding. In this case, paths are always encoded using
the equivalent of 'mbcs:replace' with no opportunity for Python to
override or change this.

This proposal would remove all use of the *A APIs and only ever call the
*W APIs. When Windows returns paths to Python as str, they will be
decoded from utf-16-le and returned as text (in whatever the minimal
representation is). When Python code requests paths as bytes, the paths
will be transcoded from utf-16-le into utf-8 using surrogatepass
(Windows does not validate surrogate pairs, so it is possible to have
invalid surrogates in filenames). Equally, when paths are provided as
bytes, they are transcoded from utf-8 into utf-16-le and passed to the
*W APIs.

The use of utf-8 will not be configurable, except for the provision of a
"legacy mode" flag to revert to the previous behaviour.

The surrogateescape error mode does not apply here, as the concern is
not about retaining nonsensical bytes. Any path returned from the
operating system will be valid Unicode, while invalid paths created by
the user should raise a decoding error (currently these would raise
OSError or a subclass).

The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure
the ability to round-trip path names and allow basic manipulation (for
example, using the os.path module) when assuming an ASCII-compatible
encoding. Using utf-16-le as the encoding is more pure, but will cause
more issues than are resolved.

This change would also undeprecate the use of bytes paths on Windows. No
change to the semantics of using bytes as a path is required - as
before, they must be encoded with the encoding specified by
sys.getfilesystemencoding().

Specific Changes

Update sys.getfilesystemencoding

Remove the default value for Py_FileSystemDefaultEncoding and set it in
initfsencoding() to utf-8, or if the legacy-mode switch is enabled to
mbcs.

Update the implementations of PyUnicode_DecodeFSDefaultAndSize() and
PyUnicode_EncodeFSDefault() to use the utf-8 codec, or if the
legacy-mode switch is enabled the existing mbcs codec.

Add sys.getfilesystemencodeerrors

As the error mode may now change between surrogatepass and replace,
Python code that manually performs encoding also needs access to the
current error mode. This includes the implementation of os.fsencode()
and os.fsdecode(), which currently assume an error mode based on the
codec.

Add a public Py_FileSystemDefaultEncodeErrors, similar to the existing
Py_FileSystemDefaultEncoding. The default value on Windows will be
surrogatepass or in legacy mode, replace. The default value on all other
platforms will be surrogateescape.

Add a public sys.getfilesystemencodeerrors() function that returns the
current error mode.

Update the implementations of PyUnicode_DecodeFSDefaultAndSize() and
PyUnicode_EncodeFSDefault() to use the variable for error mode rather
than constant strings.

Update the implementations of os.fsencode() and os.fsdecode() to use
sys.getfilesystemencodeerrors() instead of assuming the mode.

Update path_converter

Update the path converter to always decode bytes or buffer objects into
text using PyUnicode_DecodeFSDefaultAndSize().

Change the narrow field from a char* string into a flag that indicates
whether the original object was bytes. This is required for functions
that need to return paths using the same type as was originally
provided.

Remove unused ANSI code

Remove all code paths using the narrow field, as these will no longer be
reachable by any caller. These are only used within posixmodule.c. Other
uses of paths should have use of bytes paths replaced with decoding and
use of the *W APIs.

Add legacy mode

Add a legacy mode flag, enabled by the environment variable
PYTHONLEGACYWINDOWSFSENCODING or by a function call to
sys._enablelegacywindowsfsencoding(). The function call can only be used
to enable the flag and should be used by programs as close to
initialization as possible. Legacy mode cannot be disabled while Python
is running.

When this flag is set, the default filesystem encoding is set to mbcs
rather than utf-8, and the error mode is set to replace rather than
surrogatepass. Paths will continue to decode to wide characters and only
*W APIs will be called, however, the bytes passed in and received from
Python will be encoded the same as prior to this change.

Undeprecate bytes paths on Windows

Using bytes as paths on Windows is currently deprecated. We would
announce that this is no longer the case, and that paths when encoded as
bytes should use whatever is returned from sys.getfilesystemencoding()
rather than the user's active code page.

Beta experiment

To assist with determining the impact of this change, we propose
applying it to 3.6.0b1 provisionally with the intent being to make a
final decision before 3.6.0b4.

During the experiment period, decoding and encoding exception messages
will be expanded to include a link to an active online discussion and
encourage reporting of problems.

If it is decided to revert the functionality for 3.6.0b4, the
implementation change would be to permanently enable the legacy mode
flag, change the environment variable to PYTHONWINDOWSUTF8FSENCODING and
function to sys._enablewindowsutf8fsencoding() to allow enabling the
functionality on a case-by-case basis, as opposed to disabling it.

It is expected that if we cannot feasibly make the change for 3.6 due to
compatibility concerns, it will not be possible to make the change at
any later time in Python 3.x.

Affected Modules

This PEP implicitly includes all modules within the Python that either
pass path names to the operating system, or otherwise use
sys.getfilesystemencoding().

As of 3.6.0a4, the following modules require modification:

-   os
-   _overlapped
-   _socket
-   subprocess
-   zipimport

The following modules use sys.getfilesystemencoding() but do not need
modification:

-   gc (already assumes bytes are utf-8)
-   grp (not compiled for Windows)
-   http.server (correctly includes codec name with transmitted data)
-   idlelib.editor (should not be needed; has fallback handling)
-   nis (not compiled for Windows)
-   pwd (not compiled for Windows)
-   spwd (not compiled for Windows)
-   _ssl (only used for ASCII constants)
-   tarfile (code unused on Windows)
-   _tkinter (already assumes bytes are utf-8)
-   wsgiref (assumed as the default encoding for unknown environments)
-   zipapp (code unused on Windows)

The following native code uses one of the encoding or decoding
functions, but do not require any modification:

-   Parser/parsetok.c (docs already specify sys.getfilesystemencoding())
-   Python/ast.c (docs already specify sys.getfilesystemencoding())
-   Python/compile.c (undocumented, but Python filesystem encoding
    implied)
-   Python/errors.c (docs already specify os.fsdecode())
-   Python/fileutils.c (code unused on Windows)
-   Python/future.c (undocumented, but Python filesystem encoding
    implied)
-   Python/import.c (docs already specify utf-8)
-   Python/importdl.c (code unused on Windows)
-   Python/pythonrun.c (docs already specify
    sys.getfilesystemencoding())
-   Python/symtable.c (undocumented, but Python filesystem encoding
    implied)
-   Python/thread.c (code unused on Windows)
-   Python/traceback.c (encodes correctly for comparing strings)
-   Python/_warnings.c (docs already specify os.fsdecode())

Rejected Alternatives

Use strict mbcs decoding

This is essentially the same as the proposed change, but instead of
changing sys.getfilesystemencoding() to utf-8 it is changed to mbcs
(which dynamically maps to the active code page).

This approach allows the use of new functionality that is only available
as *W APIs and also detection of encoding/decoding errors. For example,
rather than silently replacing Unicode characters with '?', it would be
possible to warn or fail the operation.

Compared to the proposed fix, this could enable some new functionality
but does not fix any of the problems described initially. New runtime
errors may cause some problems to be more obvious and lead to fixes,
provided library maintainers are interested in supporting Windows and
adding a separate code path to treat filesystem paths as strings.

Making the encoding mbcs without strict errors is equivalent to the
legacy-mode switch being enabled by default. This is a possible course
of action if there is significant breakage of actual code and a need to
extend the deprecation period, but still a desire to have the
simplifications to the CPython source.

Make bytes paths an error on Windows

By preventing the use of bytes paths on Windows completely we prevent
users from hitting encoding issues.

However, the motivation for this PEP is to increase the likelihood that
code written on POSIX will also work correctly on Windows. This
alternative would move the other direction and make such code completely
incompatible. As this does not benefit users in any way, we reject it.

Make bytes paths an error on all platforms

By deprecating and then disable the use of bytes paths on all platforms
we prevent users from hitting encoding issues regardless of where the
code was originally written. This would require a full deprecation
cycle, as there are currently no warnings on platforms other than
Windows.

This is likely to be seen as a hostile action against Python developers
in general, and as such is rejected at this time.

Code that may break

The following code patterns may break or see different behaviour as a
result of this change. Each of these examples would have been fragile in
code intended for cross-platform use. The suggested fixes demonstrate
the most compatible way to handle path encoding issues across all
platforms and across multiple Python versions.

Note that all of these examples produce deprecation warnings on Python
3.3 and later.

Not managing encodings across boundaries

Code that does not manage encodings when crossing protocol boundaries
may currently be working by chance, but could encounter issues when
either encoding changes. Note that the source of filename may be any
function that returns a bytes object, as illustrated in a second example
below:

    >>> filename = open('filename_in_mbcs.txt', 'rb').read()
    >>> text = open(filename, 'r').read()

To correct this code, the encoding of the bytes in filename should be
specified, either when reading from the file or before using the value:

    >>> # Fix 1: Open file as text (default encoding)
    >>> filename = open('filename_in_mbcs.txt', 'r').read()
    >>> text = open(filename, 'r').read()

    >>> # Fix 2: Open file as text (explicit encoding)
    >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
    >>> text = open(filename, 'r').read()

    >>> # Fix 3: Explicitly decode the path
    >>> filename = open('filename_in_mbcs.txt', 'rb').read()
    >>> text = open(filename.decode('mbcs'), 'r').read()

Where the creator of filename is separated from the user of filename,
the encoding is important information to include:

    >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs')

    >>> filename = some_object.filename
    >>> type(filename)
    <class 'bytes'>
    >>> text = open(filename, 'r').read()

To fix this code for best compatibility across operating systems and
Python versions, the filename should be exposed as str:

    >>> # Fix 1: Expose as str
    >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'

    >>> filename = some_object.filename
    >>> type(filename)
    <class 'str'>
    >>> text = open(filename, 'r').read()

Alternatively, the encoding used for the path needs to be made available
to the user. Specifying os.fsencode() (or sys.getfilesystemencoding())
is an acceptable choice, or a new attribute could be added with the
exact encoding:

    >>> # Fix 2: Use fsencode
    >>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt')

    >>> filename = some_object.filename
    >>> type(filename)
    <class 'bytes'>
    >>> text = open(filename, 'r').read()


    >>> # Fix 3: Expose as explicit encoding
    >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437')
    >>> some_object.filename_encoding = 'cp437'

    >>> filename = some_object.filename
    >>> type(filename)
    <class 'bytes'>
    >>> filename = filename.decode(some_object.filename_encoding)
    >>> type(filename)
    <class 'str'>
    >>> text = open(filename, 'r').read()

Explicitly using 'mbcs'

Code that explicitly encodes text using 'mbcs' before passing to file
system APIs is now passing incorrectly encoded bytes. Note that the
source of filename in this example is not relevant, provided that it is
a str:

    >>> filename = open('files.txt', 'r').readline().rstrip()
    >>> text = open(filename.encode('mbcs'), 'r')

To correct this code, the string should be passed without explicit
encoding, or should use os.fsencode():

    >>> # Fix 1: Do not encode the string
    >>> filename = open('files.txt', 'r').readline().rstrip()
    >>> text = open(filename, 'r')

    >>> # Fix 2: Use correct encoding
    >>> filename = open('files.txt', 'r').readline().rstrip()
    >>> text = open(os.fsencode(filename), 'r')

References

Copyright

This document has been placed in the public domain.