PEP 277 – Unicode file name support for Windows NT
- Neil Hodgson <neilh at scintilla.org>
- Standards Track
Table of Contents
This PEP discusses supporting access to all files possible on Windows NT by passing Unicode file names directly to the system’s wide-character functions.
Python 2.2 on Win32 platforms converts Unicode file names passed
to open and to functions in the
os module into the ‘mbcs’ encoding
before passing the result to the operating system. This is often
successful in the common case where the script is operating with
the locale set to the same value as when the file was created.
Most machines are set up as one locale and rarely if ever changed
from this locale. For some users, locale is changed more often
and on servers there are often files saved by users using
On Windows NT and descendent operating systems, including Windows 2000 and Windows XP, wide-character APIs are available that provide direct access to all file names, including those that are not representable using the current locale. The purpose of this proposal is to provide access to these wide-character APIs through the standard Python file object and posix module and so provide access to all files on Windows NT.
On Windows platforms which provide wide-character file APIs, when Unicode arguments are provided to file APIs, wide-character calls are made instead of the standard C library and posix calls.
The Python file object is extended to use a Unicode file name
argument directly rather than converting it. This affects the
file object constructor
file(filename[, mode[, bufsize]]) and also
open function which is an alias of this constructor. When a
Unicode filename argument is used here then the
name attribute of
the file object will be Unicode. The representation of a file
repr(f) will display Unicode file names as an escaped
string in a similar manner to the representation of Unicode
posix module contains functions that take file or directory
_getfullpathname. These will use Unicode
arguments directly rather than converting them. For the
rename function, this
behaviour is triggered when either of the arguments is Unicode and
the other argument converted to Unicode using the default
listdir function currently returns a list of strings. Under
this proposal, it will return a list of Unicode strings when its
path argument is Unicode.
On the consumer Windows operating systems, Windows 95, Windows 98, and Windows ME, there are no wide-character file APIs so behaviour is unchanged under this proposal. It may be possible in the future to extend this proposal to cover these operating systems as the VFAT-32 file system used by them does support Unicode file names but access is difficult and so implementing this would require much work. The “Microsoft Layer for Unicode” could be a starting point for implementing this.
Python can be compiled with the size of Unicode characters set to
4 bytes rather than 2 by defining
PY_UNICODE_TYPE to be a 4 byte
Py_UNICODE_SIZE to be 4. As the Windows API does not
accept 4 byte characters, the features described in this proposal
will not work in this mode so the implementation falls back to the
current ‘mbcs’ encoding technique. This restriction could be lifted
in the future by performing extra conversions using
PyUnicode_AsWideChar but for now that would add too much
complexity for a very rarely used feature.
The implementation is available at .
This document has been placed in the public domain.
Last modified: 2022-10-05 16:48:43+00:00 GMT