PEP: 277 Title: Unicode file name support for Windows NT Author: Neil
Hodgson <neilh@scintilla.org> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 11-Jan-2002 Python-Version: 2.3
Post-History:

Abstract

This PEP discusses supporting access to all files possible on Windows NT
by passing Unicode file names directly to the system's wide-character
functions.

Rationale

Python 2.2 on Win32 platforms converts Unicode file names passed to open
and to functions in the os module into the 'mbcs' encoding before
passing the result to the operating system. This is often successful in
the common case where the script is operating with the locale set to the
same value as when the file was created. Most machines are set up as one
locale and rarely if ever changed from this locale. For some users,
locale is changed more often and on servers there are often files saved
by users using different locales.

On Windows NT and descendent operating systems, including Windows 2000
and Windows XP, wide-character APIs are available that provide direct
access to all file names, including those that are not representable
using the current locale. The purpose of this proposal is to provide
access to these wide-character APIs through the standard Python file
object and posix module and so provide access to all files on Windows
NT.

Specification

On Windows platforms which provide wide-character file APIs, when
Unicode arguments are provided to file APIs, wide-character calls are
made instead of the standard C library and posix calls.

The Python file object is extended to use a Unicode file name argument
directly rather than converting it. This affects the file object
constructor file(filename[, mode[, bufsize]]) and also the open function
which is an alias of this constructor. When a Unicode filename argument
is used here then the name attribute of the file object will be Unicode.
The representation of a file object, repr(f) will display Unicode file
names as an escaped string in a similar manner to the representation of
Unicode strings.

The posix module contains functions that take file or directory names:
chdir, listdir, mkdir, open, remove, rename, rmdir, stat, and
_getfullpathname. These will use Unicode arguments directly rather than
converting them. For the rename function, this behaviour is triggered
when either of the arguments is Unicode and the other argument converted
to Unicode using the default encoding.

The listdir function currently returns a list of strings. Under this
proposal, it will return a list of Unicode strings when its path
argument is Unicode.

Restrictions

On the consumer Windows operating systems, Windows 95, Windows 98, and
Windows ME, there are no wide-character file APIs so behaviour is
unchanged under this proposal. It may be possible in the future to
extend this proposal to cover these operating systems as the VFAT-32
file system used by them does support Unicode file names but access is
difficult and so implementing this would require much work. The
"Microsoft Layer for Unicode" could be a starting point for implementing
this.

Python can be compiled with the size of Unicode characters set to 4
bytes rather than 2 by defining PY_UNICODE_TYPE to be a 4 byte type and
Py_UNICODE_SIZE to be 4. As the Windows API does not accept 4 byte
characters, the features described in this proposal will not work in
this mode so the implementation falls back to the current 'mbcs'
encoding technique. This restriction could be lifted in the future by
performing extra conversions using PyUnicode_AsWideChar but for now that
would add too much complexity for a very rarely used feature.

Reference Implementation

The implementation is available at[1].

References

[1] Microsoft Windows APIs https://msdn.microsoft.com/

Copyright

This document has been placed in the public domain.

[1] https://github.com/python/cpython/issues/37017