PEP: 402 Title: Simplified Package Layout and Partitioning Author:
Phillip J. Eby Status: Rejected Type: Standards Track Topic: Packaging
Content-Type: text/x-rst Created: 12-Jul-2011 Python-Version: 3.3
Post-History: 20-Jul-2011 Replaces: 382

Rejection Notice

On the first day of sprints at US PyCon 2012 we had a long and fruitful
discussion about PEP 382 and PEP 402. We ended up rejecting both but a
new PEP will be written to carry on in the spirit of PEP 402. Martin von
Löwis wrote up a summary:[1].

Abstract

This PEP proposes an enhancement to Python's package importing to:

-   Surprise users of other languages less,
-   Make it easier to convert a module into a package, and
-   Support dividing packages into separately installed components (ala
    "namespace packages", as described in PEP 382)

The proposed enhancements do not change the semantics of any
currently-importable directory layouts, but make it possible for
packages to use a simplified directory layout (that is not importable
currently).

However, the proposed changes do NOT add any performance overhead to the
importing of existing modules or packages, and performance for the new
directory layout should be about the same as that of previous "namespace
package" solutions (such as pkgutil.extend_path()).

The Problem

  "Most packages are like modules. Their contents are highly
  interdependent and can't be pulled apart. [However,] some packages
  exist to provide a separate namespace. ... It should be possible to
  distribute sub-packages or submodules of these [namespace packages]
  independently."

  -- Jim Fulton, shortly before the release of Python 2.3[2]

When new users come to Python from other languages, they are often
confused by Python's package import semantics. At Google, for example,
Guido received complaints from "a large crowd with pitchforks"[3] that
the requirement for packages to contain an __init__ module was a
"misfeature", and should be dropped.

In addition, users coming from languages like Java or Perl are sometimes
confused by a difference in Python's import path searching.

In most other languages that have a similar path mechanism to Python's
sys.path, a package is merely a namespace that contains modules or
classes, and can thus be spread across multiple directories in the
language's path. In Perl, for instance, a Foo::Bar module will be
searched for in Foo/ subdirectories all along the module include path,
not just in the first such subdirectory found.

Worse, this is not just a problem for new users: it prevents anyone from
easily splitting a package into separately-installable components. In
Perl terms, it would be as if every possible Net:: module on CPAN had to
be bundled up and shipped in a single tarball!

For that reason, various workarounds for this latter limitation exist,
circulated under the term "namespace packages". The Python standard
library has provided one such workaround since Python 2.3 (via the
pkgutil.extend_path() function), and the "setuptools" package provides
another (via pkg_resources.declare_namespace()).

The workarounds themselves, however, fall prey to a third issue with
Python's way of laying out packages in the filesystem.

Because a package must contain an __init__ module, any attempt to
distribute modules for that package must necessarily include that
__init__ module, if those modules are to be importable.

However, the very fact that each distribution of modules for a package
must contain this (duplicated) __init__ module, means that OS vendors
who package up these module distributions must somehow handle the
conflict caused by several module distributions installing that __init__
module to the same location in the filesystem.

This led to the proposing of PEP 382 ("Namespace Packages") - a way to
signal to Python's import machinery that a directory was importable,
using unique filenames per module distribution.

However, there was more than one downside to this approach. Performance
for all import operations would be affected, and the process of
designating a package became even more complex. New terminology had to
be invented to explain the solution, and so on.

As terminology discussions continued on the Import-SIG, it soon became
apparent that the main reason it was so difficult to explain the
concepts related to "namespace packages" was because Python's current
way of handling packages is somewhat underpowered, when compared to
other languages.

That is, in other popular languages with package systems, no special
term is needed to describe "namespace packages", because all packages
generally behave in the desired fashion.

Rather than being an isolated single directory with a special marker
module (as in Python), packages in other languages are typically just
the union of appropriately-named directories across the entire import or
inclusion path.

In Perl, for example, the module Foo is always found in a Foo.pm file,
and a module Foo::Bar is always found in a Foo/Bar.pm file. (In other
words, there is One Obvious Way to find the location of a particular
module.)

This is because Perl considers a module to be different from a package:
the package is purely a namespace in which other modules may reside, and
is only coincidentally the name of a module as well.

In current versions of Python, however, the module and the package are
more tightly bound together. Foo is always a module -- whether it is
found in Foo.py or Foo/__init__.py -- and it is tightly linked to its
submodules (if any), which must reside in the exact same directory where
the __init__.py was found.

On the positive side, this design choice means that a package is quite
self-contained, and can be installed, copied, etc. as a unit just by
performing an operation on the package's root directory.

On the negative side, however, it is non-intuitive for beginners, and
requires a more complex step to turn a module into a package. If Foo
begins its life as Foo.py, then it must be moved and renamed to
Foo/__init__.py.

Conversely, if you intend to create a Foo.Bar module from the start, but
have no particular module contents to put in Foo itself, then you have
to create an empty and seemingly-irrelevant Foo/__init__.py file, just
so that Foo.Bar can be imported.

(And these issues don't just confuse newcomers to the language, either:
they annoy many experienced developers as well.)

So, after some discussion on the Import-SIG, this PEP was created as an
alternative to PEP 382, in an attempt to solve all of the above
problems, not just the "namespace package" use cases.

And, as a delightful side effect, the solution proposed in this PEP does
not affect the import performance of ordinary modules or self-contained
(i.e. __init__-based) packages.

The Solution

In the past, various proposals have been made to allow more intuitive
approaches to package directory layout. However, most of them failed
because of an apparent backward-compatibility problem.

That is, if the requirement for an __init__ module were simply dropped,
it would open up the possibility for a directory named, say, string on
sys.path, to block importing of the standard library string module.

Paradoxically, however, the failure of this approach does not arise from
the elimination of the __init__ requirement!

Rather, the failure arises because the underlying approach takes for
granted that a package is just ONE thing, instead of two.

In truth, a package comprises two separate, but related entities: a
module (with its own, optional contents), and a namespace where other
modules or packages can be found.

In current versions of Python, however, the module part (found in
__init__) and the namespace for submodule imports (represented by the
__path__ attribute) are both initialized at the same time, when the
package is first imported.

And, if you assume this is the only way to initialize these two things,
then there is no way to drop the need for an __init__ module, while
still being backwards-compatible with existing directory layouts.

After all, as soon as you encounter a directory on sys.path matching the
desired name, that means you've "found" the package, and must stop
searching, right?

Well, not quite.

A Thought Experiment

Let's hop into the time machine for a moment, and pretend we're back in
the early 1990s, shortly before Python packages and __init__.py have
been invented. But, imagine that we are familiar with Perl-like package
imports, and we want to implement a similar system in Python.

We'd still have Python's module imports to build on, so we could
certainly conceive of having Foo.py as a parent Foo module for a Foo
package. But how would we implement submodule and subpackage imports?

Well, if we didn't have the idea of __path__ attributes yet, we'd
probably just search sys.path looking for Foo/Bar.py.

But we'd only do it when someone actually tried to import Foo.Bar.

NOT when they imported Foo.

And that lets us get rid of the backwards-compatibility problem of
dropping the __init__ requirement, back here in 2011.

How?

Well, when we import Foo, we're not even looking for Foo/ directories on
sys.path, because we don't care yet. The only point at which we care, is
the point when somebody tries to actually import a submodule or
subpackage of Foo.

That means that if Foo is a standard library module (for example), and I
happen to have a Foo directory on sys.path (without an __init__.py, of
course), then nothing breaks. The Foo module is still just a module, and
it's still imported normally.

Self-Contained vs. "Virtual" Packages

Of course, in today's Python, trying to import Foo.Bar will fail if Foo
is just a Foo.py module (and thus lacks a __path__ attribute).

So, this PEP proposes to dynamically create a __path__, in the case
where one is missing.

That is, if I try to import Foo.Bar the proposed change to the import
machinery will notice that the Foo module lacks a __path__, and will
therefore try to build one before proceeding.

And it will do this by making a list of all the existing Foo/
subdirectories of the directories listed in sys.path.

If the list is empty, the import will fail with ImportError, just like
today. But if the list is not empty, then it is saved in a new
Foo.__path__ attribute, making the module a "virtual package".

That is, because it now has a valid __path__, we can proceed to import
submodules or subpackages in the normal way.

Now, notice that this change does not affect "classic", self-contained
packages that have an __init__ module in them. Such packages already
have a __path__ attribute (initialized at import time) so the import
machinery won't try to create another one later.

This means that (for example) the standard library email package will
not be affected in any way by you having a bunch of unrelated
directories named email on sys.path. (Even if they contain *.py files.)

But it does mean that if you want to turn your Foo module into a Foo
package, all you have to do is add a Foo/ directory somewhere on
sys.path, and start adding modules to it.

But what if you only want a "namespace package"? That is, a package that
is only a namespace for various separately-distributed submodules and
subpackages?

For example, if you're Zope Corporation, distributing dozens of separate
tools like zc.buildout, each in packages under the zc namespace, you
don't want to have to make and include an empty zc.py in every tool you
ship. (And, if you're a Linux or other OS vendor, you don't want to deal
with the package installation conflicts created by trying to install ten
copies of zc.py to the same location!)

No problem. All we have to do is make one more minor tweak to the import
process: if the "classic" import process fails to find a self-contained
module or package (e.g., if import zc fails to find a zc.py or
zc/__init__.py), then we once more try to build a __path__ by searching
for all the zc/ directories on sys.path, and putting them in a list.

If this list is empty, we raise ImportError. But if it's non-empty, we
create an empty zc module, and put the list in zc.__path__.
Congratulations: zc is now a namespace-only, "pure virtual" package! It
has no module contents, but you can still import submodules and
subpackages from it, regardless of where they're located on sys.path.

(By the way, both of these additions to the import protocol (i.e. the
dynamically-added __path__, and dynamically-created modules) apply
recursively to child packages, using the parent package's __path__ in
place of sys.path as a basis for generating a child __path__. This means
that self-contained and virtual packages can contain each other without
limitation, with the caveat that if you put a virtual package inside a
self-contained one, it's gonna have a really short __path__!)

Backwards Compatibility and Performance

Notice that these two changes only affect import operations that today
would result in ImportError. As a result, the performance of imports
that do not involve virtual packages is unaffected, and potential
backward compatibility issues are very restricted.

Today, if you try to import submodules or subpackages from a module with
no __path__, it's an immediate error. And of course, if you don't have a
zc.py or zc/__init__.py somewhere on sys.path today, import zc would
likewise fail.

Thus, the only potential backwards-compatibility issues are:

1.  Tools that expect package directories to have an __init__ module,
    that expect directories without an __init__ module to be
    unimportable, or that expect __path__ attributes to be static, will
    not recognize virtual packages as packages.

    (In practice, this just means that tools will need updating to
    support virtual packages, e.g. by using pkgutil.walk_modules()
    instead of using hardcoded filesystem searches.)

2.  Code that expects certain imports to fail may now do something
    unexpected. This should be fairly rare in practice, as most sane,
    non-test code does not import things that are expected not to exist!

The biggest likely exception to the above would be when a piece of code
tries to check whether some package is installed by importing it. If
this is done only by importing a top-level module (i.e., not checking
for a __version__ or some other attribute), and there is a directory of
the same name as the sought-for package on sys.path somewhere, and the
package is not actually installed, then such code could be fooled into
thinking a package is installed that really isn't.

For example, suppose someone writes a script (datagen.py) containing the
following code:

    try:
        import json
    except ImportError:
        import simplejson as json

And runs it in a directory laid out like this:

    datagen.py
    json/
        foo.js
        bar.js

If import json succeeded due to the mere presence of the json/
subdirectory, the code would incorrectly believe that the json module
was available, and proceed to fail with an error.

However, we can prevent corner cases like these from arising, simply by
making one small change to the algorithm presented so far. Instead of
allowing you to import a "pure virtual" package (like zc), we allow only
importing of the contents of virtual packages.

That is, a statement like import zc should raise ImportError if there is
no zc.py or zc/__init__.py on sys.path. But, doing import zc.buildout
should still succeed, as long as there's a zc/buildout.py or
zc/buildout/__init__.py on sys.path.

In other words, we don't allow pure virtual packages to be imported
directly, only modules and self-contained packages. (This is an
acceptable limitation, because there is no functional value to importing
such a package by itself. After all, the module object will have no
contents until you import at least one of its subpackages or
submodules!)

Once zc.buildout has been successfully imported, though, there will be a
zc module in sys.modules, and trying to import it will of course
succeed. We are only preventing an initial import from succeeding, in
order to prevent false-positive import successes when clashing
subdirectories are present on sys.path.

So, with this slight change, the datagen.py example above will work
correctly. When it does import json, the mere presence of a json/
directory will simply not affect the import process at all, even if it
contains .py files. The json/ directory will still only be searched in
the case where an import like import json.converter is attempted.

Meanwhile, tools that expect to locate packages and modules by walking a
directory tree can be updated to use the existing pkgutil.walk_modules()
API, and tools that need to inspect packages in memory should use the
other APIs described in the Standard Library Changes/Additions section
below.

Specification

A change is made to the existing import process, when importing names
containing at least one . -- that is, imports of modules that have a
parent package.

Specifically, if the parent package does not exist, or exists but lacks
a __path__ attribute, an attempt is first made to create a "virtual
path" for the parent package (following the algorithm described in the
section on virtual paths, below).

If the computed "virtual path" is empty, an ImportError results, just as
it would today. However, if a non-empty virtual path is obtained, the
normal import of the submodule or subpackage proceeds, using that
virtual path to find the submodule or subpackage. (Just as it would have
with the parent's __path__, if the parent package had existed and had a
__path__.)

When a submodule or subpackage is found (but not yet loaded), the parent
package is created and added to sys.modules (if it didn't exist before),
and its __path__ is set to the computed virtual path (if it wasn't
already set).

In this way, when the actual loading of the submodule or subpackage
occurs, it will see a parent package existing, and any relative imports
will work correctly. However, if no submodule or subpackage exists, then
the parent package will not be created, nor will a standalone module be
converted into a package (by the addition of a spurious __path__
attribute).

Note, by the way, that this change must be applied recursively: that is,
if foo and foo.bar are pure virtual packages, then import foo.bar.baz
must wait until foo.bar.baz is found before creating module objects for
both foo and foo.bar, and then create both of them together, properly
setting the foo module's .bar attribute to point to the foo.bar module.

In this way, pure virtual packages are never directly importable: an
import foo or import foo.bar by itself will fail, and the corresponding
modules will not appear in sys.modules until they are needed to point to
a successfully imported submodule or self-contained subpackage.

Virtual Paths

A virtual path is created by obtaining a PEP 302 "importer" object for
each of the path entries found in sys.path (for a top-level module) or
the parent __path__ (for a submodule).

(Note: because sys.meta_path importers are not associated with sys.path
or __path__ entry strings, such importers do not participate in this
process.)

Each importer is checked for a get_subpath() method, and if present, the
method is called with the full name of the module/package the path is
being constructed for. The return value is either a string representing
a subdirectory for the requested package, or None if no such
subdirectory exists.

The strings returned by the importers are added to the path list being
built, in the same order as they are found. (None values and missing
get_subpath() methods are simply skipped.)

The resulting list (whether empty or not) is then stored in a
sys.virtual_package_paths dictionary, keyed by module name.

This dictionary has two purposes. First, it serves as a cache, in the
event that more than one attempt is made to import a submodule of a
virtual package.

Second, and more importantly, the dictionary can be used by code that
extends sys.path at runtime to update imported packages' __path__
attributes accordingly. (See Standard Library Changes/Additions below
for more details.)

In Python code, the virtual path construction algorithm would look
something like this:

    def get_virtual_path(modulename, parent_path=None):

        if modulename in sys.virtual_package_paths:
            return sys.virtual_package_paths[modulename]

        if parent_path is None:
            parent_path = sys.path

        path = []

        for entry in parent_path:
            # Obtain a PEP 302 importer object - see pkgutil module
            importer = pkgutil.get_importer(entry)

            if hasattr(importer, 'get_subpath'):
                subpath = importer.get_subpath(modulename)
                if subpath is not None:
                    path.append(subpath)

        sys.virtual_package_paths[modulename] = path
        return path

And a function like this one should be exposed in the standard library
as e.g. imp.get_virtual_path(), so that people creating __import__
replacements or sys.meta_path hooks can reuse it.

Standard Library Changes/Additions

The pkgutil module should be updated to handle this specification
appropriately, including any necessary changes to extend_path(),
iter_modules(), etc.

Specifically the proposed changes and additions to pkgutil are:

-   A new extend_virtual_paths(path_entry) function, to extend existing,
    already-imported virtual packages' __path__ attributes to include
    any portions found in a new sys.path entry. This function should be
    called by applications extending sys.path at runtime, e.g. when
    adding a plugin directory or an egg to the path.

    The implementation of this function does a simple top-down traversal
    of sys.virtual_package_paths, and performs any necessary
    get_subpath() calls to identify what path entries need to be added
    to the virtual path for that package, given that path_entry has been
    added to sys.path. (Or, in the case of sub-packages, adding a
    derived subpath entry, based on their parent package's virtual
    path.)

    (Note: this function must update both the path values in
    sys.virtual_package_paths as well as the __path__ attributes of any
    corresponding modules in sys.modules, even though in the common case
    they will both be the same list object.)

-   A new iter_virtual_packages(parent='') function to allow top-down
    traversal of virtual packages from sys.virtual_package_paths, by
    yielding the child virtual packages of parent. For example, calling
    iter_virtual_packages("zope") might yield zope.app and zope.products
    (if they are virtual packages listed in sys.virtual_package_paths),
    but not zope.foo.bar. (This function is needed to implement
    extend_virtual_paths(), but is also potentially useful for other
    code that needs to inspect imported virtual packages.)

-   ImpImporter.iter_modules() should be changed to also detect and
    yield the names of modules found in virtual packages.

In addition to the above changes, the zipimport importer should have its
iter_modules() implementation similarly changed. (Note: current versions
of Python implement this via a shim in pkgutil, so technically this is
also a change to pkgutil.)

Last, but not least, the imp module (or importlib, if appropriate)
should expose the algorithm described in the virtual paths section
above, as a get_virtual_path(modulename, parent_path=None) function, so
that creators of __import__ replacements can use it.

Implementation Notes

For users, developers, and distributors of virtual packages:

-   While virtual packages are easy to set up and use, there is still a
    time and place for using self-contained packages. While it's not
    strictly necessary, adding an __init__ module to your self-contained
    packages lets users of the package (and Python itself) know that all
    of the package's code will be found in that single subdirectory. In
    addition, it lets you define __all__, expose a public API, provide a
    package-level docstring, and do other things that make more sense
    for a self-contained project than for a mere "namespace" package.

-   sys.virtual_package_paths is allowed to contain entries for
    non-existent or not-yet-imported package names; code that uses its
    contents should not assume that every key in this dictionary is also
    present in sys.modules or that importing the name will necessarily
    succeed.

-   If you are changing a currently self-contained package into a
    virtual one, it's important to note that you can no longer use its
    __file__ attribute to locate data files stored in a package
    directory. Instead, you must search __path__ or use the __file__ of
    a submodule adjacent to the desired files, or of a self-contained
    subpackage that contains the desired files.

    (Note: this caveat is already true for existing users of "namespace
    packages" today. That is, it is an inherent result of being able to
    partition a package, that you must know which partition the desired
    data file lives in. We mention it here simply so that new users
    converting from self-contained to virtual packages will also be
    aware of it.)

-   XXX what is the __file__ of a "pure virtual" package? None? Some
    arbitrary string? The path of the first directory with a trailing
    separator? No matter what we put, some code is going to break, but
    the last choice might allow some code to accidentally work. Is that
    good or bad?

For those implementing PEP 302 importer objects:

-   Importers that support the iter_modules() method (used by pkgutil to
    locate importable modules and packages) and want to add virtual
    package support should modify their iter_modules() method so that it
    discovers and lists virtual packages as well as standard modules and
    packages. To do this, the importer should simply list all immediate
    subdirectory names in its jurisdiction that are valid Python
    identifiers.

    XXX This might list a lot of not-really-packages. Should we require
    importable contents to exist? If so, how deep do we search, and how
    do we prevent e.g. link loops, or traversing onto different
    filesystems, etc.? Ick. Also, if virtual packages are listed, they
    still can't be imported, which is a problem for the way that
    pkgutil.walk_modules() is currently implemented.

-   "Meta" importers (i.e., importers placed on sys.meta_path) do not
    need to implement get_subpath(), because the method is only called
    on importers corresponding to sys.path entries and __path__ entries.
    If a meta importer wishes to support virtual packages, it must do so
    entirely within its own find_module() implementation.

    Unfortunately, it is unlikely that any such implementation will be
    able to merge its package subpaths with those of other meta
    importers or sys.path importers, so the meaning of "supporting
    virtual packages" for a meta importer is currently undefined!

    (However, since the intended use case for meta importers is to
    replace Python's normal import process entirely for some subset of
    modules, and the number of such importers currently implemented is
    quite small, this seems unlikely to be a big issue in practice.)

References

Copyright

This document has been placed in the public domain.

[1] Namespace Packages resolution
(https://mail.python.org/pipermail/import-sig/2012-March/000421.html)

[2] "namespace" vs "module" packages (mailing list thread)
(http://mail.zope.org/pipermail/zope3-dev/2002-December/004251.html)

[3] "Dropping __init__.py requirement for subpackages"
(https://mail.python.org/pipermail/python-dev/2006-April/064400.html)