PEP: 471 Title: os.scandir() function -- a better and faster directory
iterator Version: $Revision$ Last-Modified: $Date$ Author: Ben Hoyt
<benhoyt@gmail.com> BDFL-Delegate: Victor Stinner <vstinner@python.org>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
30-May-2014 Python-Version: 3.5 Post-History: 27-Jun-2014, 08-Jul-2014,
14-Jul-2014

Abstract

This PEP proposes including a new directory iteration function,
os.scandir(), in the standard library. This new function adds useful
functionality and increases the speed of os.walk() by 2-20 times
(depending on the platform and file system) by avoiding calls to
os.stat() in most cases.

Rationale

Python's built-in os.walk() is significantly slower than it needs to be,
because -- in addition to calling os.listdir() on each directory -- it
executes the stat() system call or GetFileAttributes() on each file to
determine whether the entry is a directory or not.

But the underlying system calls -- FindFirstFile / FindNextFile on
Windows and readdir on POSIX systems --already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information for
a stat_result object on the directory entry, such as file size and last
modification time.

In short, you can reduce the number of system calls required for a tree
function like os.walk() from approximately 2N to N, where N is the total
number of files and directories in the tree. (And because directory
trees are usually wider than they are deep, it's often much better than
this.)

In practice, removing all those extra system calls makes os.walk() about
8-9 times as fast on Windows, and about 2-3 times as fast on POSIX
systems. So we're not talking about micro-optimizations. See more
benchmarks here.

Somewhat relatedly, many people (see Python Issue 11406) are also keen
on a version of os.listdir() that yields filenames as it iterates
instead of returning them as one big list. This improves memory
efficiency for iterating very large directories.

So, as well as providing a scandir() iterator function for calling
directly, Python's existing os.walk() function can be sped up a huge
amount.

Implementation

The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension module).
It lives on GitHub at benhoyt/scandir. (The implementation may lag
behind the updates to this PEP a little.)

Note that this module has been used and tested (see "Use in the wild"
section in this PEP), so it's more than a proof-of-concept. However, it
is marked as beta software and is not extensively battle-tested. It will
need some cleanup and more thorough testing before going into the
standard library, as well as integration into posixmodule.c.

Specifics of proposal

os.scandir()

Specifically, this PEP proposes adding a single function to the os
module in the standard library, scandir, that takes a single, optional
string as its argument:

    scandir(path='.') -> generator of DirEntry objects

Like listdir, scandir calls the operating system's directory iteration
system calls to get the names of the files in the given path, but it's
different from listdir in two ways:

-   Instead of returning bare filename strings, it returns lightweight
    DirEntry objects that hold the filename string and provide simple
    methods that allow access to the additional data the operating
    system may have returned.
-   It returns a generator instead of a list, so that scandir acts as a
    true iterator instead of returning the full list immediately.

scandir() yields a DirEntry object for each file and sub-directory in
path. Just like listdir, the '.' and '..' pseudo-directories are
skipped, and the entries are yielded in system-dependent order. Each
DirEntry object has the following attributes and methods:

-   name: the entry's filename, relative to the scandir path argument
    (corresponds to the return values of os.listdir)
-   path: the entry's full path name (not necessarily an absolute path)
    -- the equivalent of os.path.join(scandir_path, entry.name)
-   inode(): return the inode number of the entry. The result is cached
    on the DirEntry object, use
    os.stat(entry.path, follow_symlinks=False).st_ino to fetch
    up-to-date information. On Unix, no system call is required.
-   is_dir(*, follow_symlinks=True): similar to pathlib.Path.is_dir(),
    but the return value is cached on the DirEntry object; doesn't
    require a system call in most cases; don't follow symbolic links if
    follow_symlinks is False
-   is_file(*, follow_symlinks=True): similar to pathlib.Path.is_file(),
    but the return value is cached on the DirEntry object; doesn't
    require a system call in most cases; don't follow symbolic links if
    follow_symlinks is False
-   is_symlink(): similar to pathlib.Path.is_symlink(), but the return
    value is cached on the DirEntry object; doesn't require a system
    call in most cases
-   stat(*, follow_symlinks=True): like os.stat(), but the return value
    is cached on the DirEntry object; does not require a system call on
    Windows (except for symlinks); don't follow symbolic links (like
    os.lstat()) if follow_symlinks is False

All methods may perform system calls in some cases and therefore
possibly raise OSError -- see the "Notes on exception handling" section
for more details.

The DirEntry attribute and method names were chosen to be the same as
those in the new pathlib module where possible, for consistency. The
only difference in functionality is that the DirEntry methods cache
their values on the entry object after the first call.

Like the other functions in the os module, scandir() accepts either a
bytes or str object for the path parameter, and returns the
DirEntry.name and DirEntry.path attributes with the same type as path.
However, it is strongly recommended to use the str type, as this ensures
cross-platform support for Unicode filenames. (On Windows, bytes
filenames have been deprecated since Python 3.3).

os.walk()

As part of this proposal, os.walk() will also be modified to use
scandir() rather than listdir() and os.path.isdir(). This will increase
the speed of os.walk() very significantly (as mentioned above, by 2-20
times, depending on the system).

Examples

First, a very simple example of scandir() showing use of the
DirEntry.name attribute and the DirEntry.is_dir() method:

    def subdirs(path):
        """Yield directory names not starting with '.' under given path."""
        for entry in os.scandir(path):
            if not entry.name.startswith('.') and entry.is_dir():
                yield entry.name

This subdirs() function will be significantly faster with scandir than
os.listdir() and os.path.isdir() on both Windows and POSIX systems,
especially on medium-sized or large directories.

Or, for getting the total size of files in a directory tree, showing use
of the DirEntry.stat() method and DirEntry.path attribute:

    def get_tree_size(path):
        """Return total size of files in given path and subdirs."""
        total = 0
        for entry in os.scandir(path):
            if entry.is_dir(follow_symlinks=False):
                total += get_tree_size(entry.path)
            else:
                total += entry.stat(follow_symlinks=False).st_size
        return total

This also shows the use of the follow_symlinks parameter to is_dir() --
in a recursive function like this, we probably don't want to follow
links. (To properly follow links in a recursive function like this we'd
want special handling for the case where following a symlink leads to a
recursive loop.)

Note that get_tree_size() will get a huge speed boost on Windows,
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.

Notes on caching

The DirEntry objects are relatively dumb -- the name and path attributes
are obviously always cached, and the is_X and stat methods cache their
values (immediately on Windows via FindNextFile, and on first use on
POSIX systems via a stat system call) and never refetch from the system.

For this reason, DirEntry objects are intended to be used and thrown
away after iteration, not stored in long-lived data structured and the
methods called again and again.

If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use pathlib.Path objects, or call
the regular os.stat() or os.path.getsize() functions which get fresh
data from the operating system every call.

Notes on exception handling

DirEntry.is_X() and DirEntry.stat() are explicitly methods rather than
attributes or properties, to make it clear that they may not be cheap
operations (although they often are), and they may do a system call. As
a result, these methods may raise OSError.

For example, DirEntry.stat() will always make a system call on
POSIX-based systems, and the DirEntry.is_X() methods will make a stat()
system call on such systems if readdir() does not support d_type or
returns a d_type with a value of DT_UNKNOWN, which can occur under
certain conditions or on certain file systems.

Often this does not matter -- for example, os.walk() as defined in the
standard library only catches errors around the listdir() calls.

Also, because the exception-raising behaviour of the DirEntry.is_X
methods matches that of pathlib -- which only raises OSError in the case
of permissions or other fatal errors, but returns False if the path
doesn't exist or is a broken symlink -- it's often not necessary to
catch errors around the is_X() calls.

However, when a user requires fine-grained error handling, it may be
desirable to catch OSError around all method calls and handle as
appropriate.

For example, below is a version of the get_tree_size() example shown
above, but with fine-grained error handling added:

    def get_tree_size(path):
        """Return total size of files in path and subdirs. If
        is_dir() or stat() fails, print an error message to stderr
        and assume zero size (for example, file has been deleted).
        """
        total = 0
        for entry in os.scandir(path):
            try:
                is_dir = entry.is_dir(follow_symlinks=False)
            except OSError as error:
                print('Error calling is_dir():', error, file=sys.stderr)
                continue
            if is_dir:
                total += get_tree_size(entry.path)
            else:
                try:
                    total += entry.stat(follow_symlinks=False).st_size
                except OSError as error:
                    print('Error calling stat():', error, file=sys.stderr)
        return total

Support

The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP), but there's also been a fair bit of
direct support for a scandir-like function from core developers and
others on the python-dev and python-ideas mailing lists. A sampling:

-   python-dev: a good number of +1's and very few negatives for scandir
    and PEP 471 on this June 2014 python-dev thread
-   Alyssa Coghlan, a core Python developer: "I've had the local Red Hat
    release engineering team express their displeasure at having to stat
    every file in a network mounted directory tree for info that is
    present in the dirent structure, so a definite +1 to os.scandir from
    me, so long as it makes that info available." [source1]
-   Tim Golden, a core Python developer, supports scandir enough to have
    spent time refactoring and significantly improving scandir's C
    extension module. [source2]
-   Christian Heimes, a core Python developer: "+1 for something like
    yielddir()" [source3] and "Indeed! I'd like to see the feature in
    3.4 so I can remove my own hack from our code base." [source4]
-   Gregory P. Smith, a core Python developer: "As 3.4beta1 happens
    tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I
    really like the proposed design outlined above." [source5]
-   Guido van Rossum on the possibility of adding scandir to Python 3.5
    (as it was too late for 3.4): "The ship has likewise sailed for
    adding scandir() (whether to os or pathlib). By all means experiment
    and get it ready for consideration for 3.5, but I don't want to add
    it to 3.4." [source6]

Support for this PEP itself (meta-support?) was given by Alyssa (Nick)
Coghlan on python-dev: "A PEP reviewing all this for 3.5 and proposing a
specific os.scandir API would be a good thing." [source7]

Use in the wild

To date, the scandir implementation is definitely useful, but has been
clearly marked "beta", so it's uncertain how much use of it there is in
the wild. Ben Hoyt has had several reports from people using it. For
example:

-   Chris F: "I am processing some pretty large directories and was half
    expecting to have to modify getdents. So thanks for saving me the
    effort." [via personal email]
-   bschollnick: "I wanted to let you know about this, since I am using
    Scandir as a building block for this code. Here's a good example of
    scandir making a radical performance improvement over os.listdir."
    [source8]
-   Avram L: "I'm testing our scandir for a project I'm working on.
    Seems pretty solid, so first thing, just want to say nice work!"
    [via personal email]
-   Matt Z: "I used scandir to dump the contents of a network dir in
    under 15 seconds. 13 root dirs, 60,000 files in the structure. This
    will replace some old VBA code embedded in a spreadsheet that was
    taking 15-20 minutes to do the exact same thing." [via personal
    email]

Others have requested a PyPI package for it, which has been created. See
PyPI package.

GitHub stats don't mean too much, but scandir does have several
watchers, issues, forks, etc. Here's the run-down as of the stats as of
July 7, 2014:

-   Watchers: 17
-   Stars: 57
-   Forks: 20
-   Issues: 4 open, 26 closed

Also, because this PEP will increase the speed of os.walk()
significantly, there are thousands of developers and scripts, and a lot
of production code, that would benefit from it. For example, on GitHub,
there are almost as many uses of os.walk (194,000) as there are of
os.mkdir (230,000).

Rejected ideas

Naming

The only other real contender for this function's name was iterdir().
However, iterX() functions in Python (mostly found in Python 2) tend to
be simple iterator equivalents of their non-iterator counterparts. For
example, dict.iterkeys() is just an iterator version of dict.keys(), but
the objects returned are identical. In scandir()'s case, however, the
return values are quite different objects (DirEntry objects vs filename
strings), so this should probably be reflected by a difference in name
-- hence scandir().

See some relevant discussion on python-dev.

Wildcard support

FindFirstFile/FindNextFile on Windows support passing a "wildcard" like
*.jpg, so at first folks (this PEP's author included) felt it would be a
good idea to include a windows_wildcard keyword argument to the scandir
function so users could pass this in.

However, on further thought and discussion it was decided that this
would be bad idea, unless it could be made cross-platform (a pattern
keyword argument or similar). This seems easy enough at first -- just
use the OS wildcard support on Windows, and something like fnmatch or re
afterwards on POSIX-based systems.

Unfortunately the exact Windows wildcard matching rules aren't really
documented anywhere by Microsoft, and they're quite quirky (see this
blog post), meaning it's very problematic to emulate using fnmatch or
regexes.

So the consensus was that Windows wildcard support was a bad idea. It
would be possible to add at a later date if there's a cross-platform way
to achieve it, but not for the initial version.

Read more on the this Nov 2012 python-ideas thread and this June 2014
python-dev thread on PEP 471.

Methods not following symlinks by default

There was much debate on python-dev (see messages in this thread) over
whether the DirEntry methods should follow symbolic links or not (when
the is_X() methods had no follow_symlinks parameter).

Initially they did not (see previous versions of this PEP and the
scandir.py module), but Victor Stinner made a pretty compelling case on
python-dev that following symlinks by default is a better idea, because:

-   following links is usually what you want (in 92% of cases in the
    standard library, functions using os.listdir() and os.path.isdir()
    do follow symlinks)
-   that's the precedent set by the similar functions os.path.isdir()
    and pathlib.Path.is_dir(), so to do otherwise would be confusing
-   with the non-link-following approach, if you wanted to follow links
    you'd have to say something like
    if (entry.is_symlink() and os.path.isdir(entry.path)) or entry.is_dir(),
    which is clumsy

As a case in point that shows the non-symlink-following version is error
prone, this PEP's author had a bug caused by getting this exact test
wrong in his initial implementation of scandir.walk() in scandir.py (see
Issue #4 here).

In the end there was not total agreement that the methods should follow
symlinks, but there was basic consensus among the most involved
participants, and this PEP's author believes that the above case is
strong enough to warrant following symlinks by default.

In addition, it's straightforward to call the relevant methods with
follow_symlinks=False if the other behaviour is desired.

DirEntry attributes being properties

In some ways it would be nicer for the DirEntry is_X() and stat() to be
properties instead of methods, to indicate they're very cheap or free.
However, this isn't quite the case, as stat() will require an OS call on
POSIX-based systems but not on Windows. Even is_dir() and friends may
perform an OS call on POSIX-based systems if the dirent.d_type value is
DT_UNKNOWN (on certain file systems).

Also, people would expect the attribute access entry.is_dir to only ever
raise AttributeError, not OSError in the case it makes a system call
under the covers. Calling code would have to have a try/except around
what looks like a simple attribute access, and so it's much better to
make them methods.

See this May 2013 python-dev thread where this PEP author makes this
case and there's agreement from a core developers.

DirEntry fields being "static" attribute-only objects

In this July 2014 python-dev message, Paul Moore suggested a solution
that was a "thin wrapper round the OS feature", where the DirEntry
object had only static attributes: name, path, and is_X, with the st_X
attributes only present on Windows. The idea was to use this simpler,
lower-level function as a building block for higher-level functions.

At first there was general agreement that simplifying in this way was a
good thing. However, there were two problems with this approach. First,
the assumption is the is_dir and similar attributes are always present
on POSIX, which isn't the case (if d_type is not present or is
DT_UNKNOWN). Second, it's a much harder-to-use API in practice, as even
the is_dir attributes aren't always present on POSIX, and would need to
be tested with hasattr() and then os.stat() called if they weren't
present.

See this July 2014 python-dev response from this PEP's author detailing
why this option is a non-ideal solution, and the subsequent reply from
Paul Moore voicing agreement.

DirEntry fields being static with an ensure_lstat option

Another seemingly simpler and attractive option was suggested by Alyssa
Coghlan in this June 2014 python-dev message: make DirEntry.is_X and
DirEntry.lstat_result properties, and populate DirEntry.lstat_result at
iteration time, but only if the new argument ensure_lstat=True was
specified on the scandir() call.

This does have the advantage over the above in that you can easily get
the stat result from scandir() if you need it. However, it has the
serious disadvantage that fine-grained error handling is messy, because
stat() will be called (and hence potentially raise OSError) during
iteration, leading to a rather ugly, hand-made iteration loop:

    it = os.scandir(path)
    while True:
        try:
            entry = next(it)
        except OSError as error:
            handle_error(path, error)
        except StopIteration:
            break

Or it means that scandir() would have to accept an onerror argument -- a
function to call when stat() errors occur during iteration. This seems
to this PEP's author neither as direct nor as Pythonic as try/except
around a DirEntry.stat() call.

Another drawback is that os.scandir() is written to make code faster.
Always calling os.lstat() on POSIX would not bring any speedup. In most
cases, you don't need the full stat_result object -- the is_X() methods
are enough and this information is already known.

See Ben Hoyt's July 2014 reply to the discussion summarizing this and
detailing why he thinks the original PEP 471 proposal is "the right one"
after all.

Return values being (name, stat_result) two-tuples

Initially this PEP's author proposed this concept as a function called
iterdir_stat() which yielded two-tuples of (name, stat_result). This
does have the advantage that there are no new types introduced. However,
the stat_result is only partially filled on POSIX-based systems (most
fields set to None and other quirks), so they're not really stat_result
objects at all, and this would have to be thoroughly documented as
different from os.stat().

Also, Python has good support for proper objects with attributes and
methods, which makes for a saner and simpler API than two-tuples. It
also makes the DirEntry objects more extensible and future-proof as
operating systems add functionality and we want to include this in
DirEntry.

See also some previous discussion:

-   May 2013 python-dev thread where Alyssa Coghlan makes the original
    case for a DirEntry-style object.
-   June 2014 python-dev thread where Alyssa Coghlan makes (another)
    good case against the two-tuple approach.

Return values being overloaded stat_result objects

Another alternative discussed was making the return values to be
overloaded stat_result objects with name and path attributes. However,
apart from this being a strange (and strained!) kind of overloading,
this has the same problems mentioned above --most of the stat_result
information is not fetched by readdir() on POSIX systems, only (part of)
the st_mode value.

Return values being pathlib.Path objects

With Antoine Pitrou's new standard library pathlib module, it at first
seems like a great idea for scandir() to return instances of
pathlib.Path. However, pathlib.Path's is_X() and stat() functions are
explicitly not cached, whereas scandir has to cache them by design,
because it's (often) returning values from the original directory
iteration system call.

And if the pathlib.Path instances returned by scandir cached stat
values, but the ordinary pathlib.Path objects explicitly don't, that
would be more than a little confusing.

Guido van Rossum explicitly rejected pathlib.Path caching stat in the
context of scandir here, making pathlib.Path objects a bad choice for
scandir return values.

Possible improvements

There are many possible improvements one could make to scandir, but here
is a short list of some this PEP's author has in mind:

-   scandir could potentially be further sped up by calling readdir /
    FindNextFile say 50 times per Py_BEGIN_ALLOW_THREADS block so that
    it stays in the C extension module for longer, and may be somewhat
    faster as a result. This approach hasn't been tested, but was
    suggested by on Issue 11406 by Antoine Pitrou. [source9]
-   scandir could use a free list to avoid the cost of memory allocation
    for each iteration -- a short free list of 10 or maybe even 1 may
    help. Suggested by Victor Stinner on a python-dev thread on June 27.

Previous discussion

-   Original November 2012 thread Ben Hoyt started on python-ideas about
    speeding up os.walk()
-   Python Issue 11406, which includes the original proposal for a
    scandir-like function
-   Further May 2013 thread Ben Hoyt started on python-dev that refined
    the scandir() API, including Alyssa Coghlan's suggestion of scandir
    yielding DirEntry-like objects
-   November 2013 thread Ben Hoyt started on python-dev to discuss the
    interaction between scandir and the new pathlib module
-   June 2014 thread Ben Hoyt started on python-dev to discuss the first
    version of this PEP, with extensive discussion about the API
-   First July 2014 thread Ben Hoyt started on python-dev to discuss his
    updates to PEP 471
-   Second July 2014 thread Ben Hoyt started on python-dev to discuss
    the remaining decisions needed to finalize PEP 471, specifically
    whether the DirEntry methods should follow symlinks by default
-   Question on StackOverflow about why os.walk() is slow and pointers
    on how to fix it (this inspired the author of this PEP early on)
-   BetterWalk, this PEP's author's previous attempt at this, on which
    the scandir code is based

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End: