PEP: 302 Title: New Import Hooks Version: $Revision$ Last-Modified:
$Date$ Author: Just van Rossum <just@letterror.com>, Paul Moore
<p.f.moore@gmail.com> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 19-Dec-2002 Python-Version: 2.3 Post-History:
19-Dec-2002

Warning

The language reference for import[1] and importlib documentation [2] now
supersede this PEP. This document is no longer updated and provided for
historical purposes only.

Abstract

This PEP proposes to add a new set of import hooks that offer better
customization of the Python import mechanism. Contrary to the current
__import__ hook, a new-style hook can be injected into the existing
scheme, allowing for a finer grained control of how modules are found
and how they are loaded.

Motivation

The only way to customize the import mechanism is currently to override
the built-in __import__ function. However, overriding __import__ has
many problems. To begin with:

-   An __import__ replacement needs to fully reimplement the entire
    import mechanism, or call the original __import__ before or after
    the custom code.
-   It has very complex semantics and responsibilities.
-   __import__ gets called even for modules that are already in
    sys.modules, which is almost never what you want, unless you're
    writing some sort of monitoring tool.

The situation gets worse when you need to extend the import mechanism
from C: it's currently impossible, apart from hacking Python's import.c
or reimplementing much of import.c from scratch.

There is a fairly long history of tools written in Python that allow
extending the import mechanism in various way, based on the __import__
hook. The Standard Library includes two such tools: ihooks.py (by GvR)
and imputil.py[3] (Greg Stein), but perhaps the most famous is iu.py by
Gordon McMillan, available as part of his Installer package. Their
usefulness is somewhat limited because they are written in Python;
bootstrapping issues need to worked around as you can't load the module
containing the hook with the hook itself. So if you want the entire
Standard Library to be loadable from an import hook, the hook must be
written in C.

Use cases

This section lists several existing applications that depend on import
hooks. Among these, a lot of duplicate work was done that could have
been saved if there had been a more flexible import hook at the time.
This PEP should make life a lot easier for similar projects in the
future.

Extending the import mechanism is needed when you want to load modules
that are stored in a non-standard way. Examples include modules that are
bundled together in an archive; byte code that is not stored in a pyc
formatted file; modules that are loaded from a database over a network.

The work on this PEP was partly triggered by the implementation of PEP
273, which adds imports from Zip archives as a built-in feature to
Python. While the PEP itself was widely accepted as a must-have feature,
the implementation left a few things to desire. For one thing it went
through great lengths to integrate itself with import.c, adding lots of
code that was either specific for Zip file imports or not specific to
Zip imports, yet was not generally useful (or even desirable) either.
Yet the PEP 273 implementation can hardly be blamed for this: it is
simply extremely hard to do, given the current state of import.c.

Packaging applications for end users is a typical use case for import
hooks, if not the typical use case. Distributing lots of source or pyc
files around is not always appropriate (let alone a separate Python
installation), so there is a frequent desire to package all needed
modules in a single file. So frequent in fact that multiple solutions
have been implemented over the years.

The oldest one is included with the Python source code: Freeze[4]. It
puts marshalled byte code into static objects in C source code. Freeze's
"import hook" is hard wired into import.c, and has a couple of issues.
Later solutions include Fredrik Lundh's Squeeze, Gordon McMillan's
Installer, and Thomas Heller's py2exe[5]. MacPython ships with a tool
called BuildApplication.

Squeeze, Installer and py2exe use an __import__ based scheme (py2exe
currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython has
two Mac-specific import hooks hard wired into import.c, that are similar
to the Freeze hook. The hooks proposed in this PEP enables us (at least
in theory; it's not a short-term goal) to get rid of the hard coded
hooks in import.c, and would allow the __import__-based tools to get rid
of most of their import.c emulation code.

Before work on the design and implementation of this PEP was started, a
new BuildApplication-like tool for Mac OS X prompted one of the authors
of this PEP (JvR) to expose the table of frozen modules to Python, in
the imp module. The main reason was to be able to use the freeze import
hook (avoiding fancy __import__ support), yet to also be able to supply
a set of modules at runtime. This resulted in issue #642578[6], which
was mysteriously accepted (mostly because nobody seemed to care either
way ;-). Yet it is completely superfluous when this PEP gets accepted,
as it offers a much nicer and general way to do the same thing.

Rationale

While experimenting with alternative implementation ideas to get
built-in Zip import, it was discovered that achieving this is possible
with only a fairly small amount of changes to import.c. This allowed to
factor out the Zip-specific stuff into a new source file, while at the
same time creating a general new import hook scheme: the one you're
reading about now.

An earlier design allowed non-string objects on sys.path. Such an object
would have the necessary methods to handle an import. This has two
disadvantages: 1) it breaks code that assumes all items on sys.path are
strings; 2) it is not compatible with the PYTHONPATH environment
variable. The latter is directly needed for Zip imports. A compromise
came from Jython: allow string subclasses on sys.path, which would then
act as importer objects. This avoids some breakage, and seems to work
well for Jython (where it is used to load modules from .jar files), but
it was perceived as an "ugly hack".

This led to a more elaborate scheme, (mostly copied from McMillan's
iu.py) in which each in a list of candidates is asked whether it can
handle the sys.path item, until one is found that can. This list of
candidates is a new object in the sys module: sys.path_hooks.

Traversing sys.path_hooks for each path item for each new import can be
expensive, so the results are cached in another new object in the sys
module: sys.path_importer_cache. It maps sys.path entries to importer
objects.

To minimize the impact on import.c as well as to avoid adding extra
overhead, it was chosen to not add an explicit hook and importer object
for the existing file system import logic (as iu.py has), but to simply
fall back to the built-in logic if no hook on sys.path_hooks could
handle the path item. If this is the case, a None value is stored in
sys.path_importer_cache, again to avoid repeated lookups. (Later we can
go further and add a real importer object for the built-in mechanism,
for now, the None fallback scheme should suffice.)

A question was raised: what about importers that don't need any entry on
sys.path? (Built-in and frozen modules fall into that category.) Again,
Gordon McMillan to the rescue: iu.py contains a thing he calls the
metapath. In this PEP's implementation, it's a list of importer objects
that is traversed before sys.path. This list is yet another new object
in the sys module: sys.meta_path. Currently, this list is empty by
default, and frozen and built-in module imports are done after
traversing sys.meta_path, but still before sys.path.

Specification part 1: The Importer Protocol

This PEP introduces a new protocol: the "Importer Protocol". It is
important to understand the context in which the protocol operates, so
here is a brief overview of the outer shells of the import mechanism.

When an import statement is encountered, the interpreter looks up the
__import__ function in the built-in name space. __import__ is then
called with four arguments, amongst which are the name of the module
being imported (may be a dotted name) and a reference to the current
global namespace.

The built-in __import__ function (known as PyImport_ImportModuleEx() in
import.c) will then check to see whether the module doing the import is
a package or a submodule of a package. If it is indeed a (submodule of
a) package, it first tries to do the import relative to the package (the
parent package for a submodule). For example, if a package named "spam"
does "import eggs", it will first look for a module named "spam.eggs".
If that fails, the import continues as an absolute import: it will look
for a module named "eggs". Dotted name imports work pretty much the
same: if package "spam" does "import eggs.bacon" (and "spam.eggs" exists
and is itself a package), "spam.eggs.bacon" is tried. If that fails
"eggs.bacon" is tried. (There are more subtleties that are not described
here, but these are not relevant for implementers of the Importer
Protocol.)

Deeper down in the mechanism, a dotted name import is split up by its
components. For "import spam.ham", first an "import spam" is done, and
only when that succeeds is "ham" imported as a submodule of "spam".

The Importer Protocol operates at this level of individual imports. By
the time an importer gets a request for "spam.ham", module "spam" has
already been imported.

The protocol involves two objects: a finder and a loader. A finder
object has a single method:

    finder.find_module(fullname, path=None)

This method will be called with the fully qualified name of the module.
If the finder is installed on sys.meta_path, it will receive a second
argument, which is None for a top-level module, or package.__path__ for
submodules or subpackages[7]. It should return a loader object if the
module was found, or None if it wasn't. If find_module() raises an
exception, it will be propagated to the caller, aborting the import.

A loader object also has one method:

    loader.load_module(fullname)

This method returns the loaded module or raises an exception, preferably
ImportError if an existing exception is not being propagated. If
load_module() is asked to load a module that it cannot, ImportError is
to be raised.

In many cases the finder and loader can be one and the same object:
finder.find_module() would just return self.

The fullname argument of both methods is the fully qualified module
name, for example "spam.eggs.ham". As explained above, when
finder.find_module("spam.eggs.ham") is called, "spam.eggs" has already
been imported and added to sys.modules. However, the find_module()
method isn't necessarily always called during an actual import: meta
tools that analyze import dependencies (such as freeze, Installer or
py2exe) don't actually load modules, so a finder shouldn't depend on the
parent package being available in sys.modules.

The load_module() method has a few responsibilities that it must fulfill
before it runs any code:

-   If there is an existing module object named 'fullname' in
    sys.modules, the loader must use that existing module. (Otherwise,
    the reload() builtin will not work correctly.) If a module named
    'fullname' does not exist in sys.modules, the loader must create a
    new module object and add it to sys.modules.

    Note that the module object must be in sys.modules before the loader
    executes the module code. This is crucial because the module code
    may (directly or indirectly) import itself; adding it to sys.modules
    beforehand prevents unbounded recursion in the worst case and
    multiple loading in the best.

    If the load fails, the loader needs to remove any module it may have
    inserted into sys.modules. If the module was already in sys.modules
    then the loader should leave it alone.

-   The __file__ attribute must be set. This must be a string, but it
    may be a dummy value, for example "<frozen>". The privilege of not
    having a __file__ attribute at all is reserved for built-in modules.

-   The __name__ attribute must be set. If one uses imp.new_module()
    then the attribute is set automatically.

-   If it's a package, the __path__ variable must be set. This must be a
    list, but may be empty if __path__ has no further significance to
    the importer (more on this later).

-   The __loader__ attribute must be set to the loader object. This is
    mostly for introspection and reloading, but can be used for
    importer-specific extras, for example getting data associated with
    an importer.

-   The __package__ attribute must be set (PEP 366).

    If the module is a Python module (as opposed to a built-in module or
    a dynamically loaded extension), it should execute the module's code
    in the module's global name space (module.__dict__).

    Here is a minimal pattern for a load_module() method:

        # Consider using importlib.util.module_for_loader() to handle
        # most of these details for you.
        def load_module(self, fullname):
            code = self.get_code(fullname)
            ispkg = self.is_package(fullname)
            mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
            mod.__file__ = "<%s>" % self.__class__.__name__
            mod.__loader__ = self
            if ispkg:
                mod.__path__ = []
                mod.__package__ = fullname
            else:
                mod.__package__ = fullname.rpartition('.')[0]
            exec(code, mod.__dict__)
            return mod

Specification part 2: Registering Hooks

There are two types of import hooks: Meta hooks and Path hooks. Meta
hooks are called at the start of import processing, before any other
import processing (so that meta hooks can override sys.path processing,
frozen modules, or even built-in modules). To register a meta hook,
simply add the finder object to sys.meta_path (the list of registered
meta hooks).

Path hooks are called as part of sys.path (or package.__path__)
processing, at the point where their associated path item is
encountered. A path hook is registered by adding an importer factory to
sys.path_hooks.

sys.path_hooks is a list of callables, which will be checked in sequence
to determine if they can handle a given path item. The callable is
called with one argument, the path item. The callable must raise
ImportError if it is unable to handle the path item, and return an
importer object if it can handle the path item. Note that if the
callable returns an importer object for a specific sys.path entry, the
builtin import machinery will not be invoked to handle that entry any
longer, even if the importer object later fails to find a specific
module. The callable is typically the class of the import hook, and
hence the class __init__() method is called. (This is also the reason
why it should raise ImportError: an __init__() method can't return
anything. This would be possible with a __new__() method in a new style
class, but we don't want to require anything about how a hook is
implemented.)

The results of path hook checks are cached in sys.path_importer_cache,
which is a dictionary mapping path entries to importer objects. The
cache is checked before sys.path_hooks is scanned. If it is necessary to
force a rescan of sys.path_hooks, it is possible to manually clear all
or part of sys.path_importer_cache.

Just like sys.path itself, the new sys variables must have specific
types:

-   sys.meta_path and sys.path_hooks must be Python lists.
-   sys.path_importer_cache must be a Python dict.

Modifying these variables in place is allowed, as is replacing them with
new objects.

Packages and the role of __path__

If a module has a __path__ attribute, the import mechanism will treat it
as a package. The __path__ variable is used instead of sys.path when
importing submodules of the package. The rules for sys.path therefore
also apply to pkg.__path__. So sys.path_hooks is also consulted when
pkg.__path__ is traversed. Meta importers don't necessarily use sys.path
at all to do their work and may therefore ignore the value of
pkg.__path__. In this case it is still advised to set it to list, which
can be empty.

Optional Extensions to the Importer Protocol

The Importer Protocol defines three optional extensions. One is to
retrieve data files, the second is to support module packaging tools
and/or tools that analyze module dependencies (for example Freeze),
while the last is to support execution of modules as scripts. The latter
two categories of tools usually don't actually load modules, they only
need to know if and where they are available. All three extensions are
highly recommended for general purpose importers, but may safely be left
out if those features aren't needed.

To retrieve the data for arbitrary "files" from the underlying storage
backend, loader objects may supply a method named get_data():

    loader.get_data(path)

This method returns the data as a string, or raise IOError if the "file"
wasn't found. The data is always returned as if "binary" mode was used
-there is no CRLF translation of text files, for example. It is meant
for importers that have some file-system-like properties. The 'path'
argument is a path that can be constructed by munging module.__file__
(or pkg.__path__ items) with the os.path.* functions, for example:

    d = os.path.dirname(__file__)
    data = __loader__.get_data(os.path.join(d, "logo.gif"))

The following set of methods may be implemented if support for (for
example) Freeze-like tools is desirable. It consists of three additional
methods which, to make it easier for the caller, each of which should be
implemented, or none at all:

    loader.is_package(fullname)
    loader.get_code(fullname)
    loader.get_source(fullname)

All three methods should raise ImportError if the module wasn't found.

The loader.is_package(fullname) method should return True if the module
specified by 'fullname' is a package and False if it isn't.

The loader.get_code(fullname) method should return the code object
associated with the module, or None if it's a built-in or extension
module. If the loader doesn't have the code object but it does have the
source code, it should return the compiled source code. (This is so that
our caller doesn't also need to check get_source() if all it needs is
the code object.)

The loader.get_source(fullname) method should return the source code for
the module as a string (using newline characters for line endings) or
None if the source is not available (yet it should still raise
ImportError if the module can't be found by the importer at all).

To support execution of modules as scripts (PEP 338), the above three
methods for finding the code associated with a module must be
implemented. In addition to those methods, the following method may be
provided in order to allow the runpy module to correctly set the
__file__ attribute:

    loader.get_filename(fullname)

This method should return the value that __file__ would be set to if the
named module was loaded. If the module is not found, then ImportError
should be raised.

Integration with the 'imp' module

The new import hooks are not easily integrated in the existing
imp.find_module() and imp.load_module() calls. It's questionable whether
it's possible at all without breaking code; it is better to simply add a
new function to the imp module. The meaning of the existing
imp.find_module() and imp.load_module() calls changes from: "they expose
the built-in import mechanism" to "they expose the basic unhooked
built-in import mechanism". They simply won't invoke any import hooks. A
new imp module function is proposed (but not yet implemented) under the
name get_loader(), which is used as in the following pattern:

    loader = imp.get_loader(fullname, path)
    if loader is not None:
        loader.load_module(fullname)

In the case of a "basic" import, one the imp.find_module() function
would handle, the loader object would be a wrapper for the current
output of imp.find_module(), and loader.load_module() would call
imp.load_module() with that output.

Note that this wrapper is currently not yet implemented, although a
Python prototype exists in the test_importhooks.py script (the
ImpWrapper class) included with the patch.

Forward Compatibility

Existing __import__ hooks will not invoke new-style hooks by magic,
unless they call the original __import__ function as a fallback. For
example, ihooks.py, iu.py and imputil.py are in this sense not forward
compatible with this PEP.

Open Issues

Modules often need supporting data files to do their job, particularly
in the case of complex packages or full applications. Current practice
is generally to locate such files via sys.path (or a package.__path__
attribute). This approach will not work, in general, for modules loaded
via an import hook.

There are a number of possible ways to address this problem:

-   "Don't do that". If a package needs to locate data files via its
    __path__, it is not suitable for loading via an import hook. The
    package can still be located on a directory in sys.path, as at
    present, so this should not be seen as a major issue.
-   Locate data files from a standard location, rather than relative to
    the module file. A relatively simple approach (which is supported by
    distutils) would be to locate data files based on sys.prefix (or
    sys.exec_prefix). For example, looking in
    os.path.join(sys.prefix, "data", package_name).
-   Import hooks could offer a standard way of getting at data files
    relative to the module file. The standard zipimport object provides
    a method get_data(name) which returns the content of the "file"
    called name, as a string. To allow modules to get at the importer
    object, zipimport also adds an attribute __loader__ to the module,
    containing the zipimport object used to load the module. If such an
    approach is used, it is important that client code takes care not to
    break if the get_data() method is not available, so it is not clear
    that this approach offers a general answer to the problem.

It was suggested on python-dev that it would be useful to be able to
receive a list of available modules from an importer and/or a list of
available data files for use with the get_data() method. The protocol
could grow two additional extensions, say list_modules() and
list_files(). The latter makes sense on loader objects with a get_data()
method. However, it's a bit unclear which object should implement
list_modules(): the importer or the loader or both?

This PEP is biased towards loading modules from alternative places: it
currently doesn't offer dedicated solutions for loading modules from
alternative file formats or with alternative compilers. In contrast, the
ihooks module from the standard library does have a fairly
straightforward way to do this. The Quixote project[8] uses this
technique to import PTL files as if they are ordinary Python modules. To
do the same with the new hooks would either mean to add a new module
implementing a subset of ihooks as a new-style importer, or add a
hookable built-in path importer object.

There is no specific support within this PEP for "stacking" hooks. For
example, it is not obvious how to write a hook to load modules from
tar.gz files by combining separate hooks to load modules from .tar and
.gz files. However, there is no support for such stacking in the
existing hook mechanisms (either the basic "replace __import__" method,
or any of the existing import hook modules) and so this functionality is
not an obvious requirement of the new mechanism. It may be worth
considering as a future enhancement, however.

It is possible (via sys.meta_path) to add hooks which run before
sys.path is processed. However, there is no equivalent way of adding
hooks to run after sys.path is processed. For now, if a hook is required
after sys.path has been processed, it can be simulated by adding an
arbitrary "cookie" string at the end of sys.path, and having the
required hook associated with this cookie, via the normal sys.path_hooks
processing. In the longer term, the path handling code will become a
"real" hook on sys.meta_path, and at that stage it will be possible to
insert user-defined hooks either before or after it.

Implementation

The PEP 302 implementation has been integrated with Python as of 2.3a1.
An earlier version is available as patch #652586[9], but more
interestingly, the issue contains a fairly detailed history of the
development and design.

PEP 273 has been implemented using PEP 302's import hooks.

References and Footnotes

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

[1] Language reference for imports
http://docs.python.org/3/reference/import.html

[2] importlib documentation
http://docs.python.org/3/library/importlib.html#module-importlib

[3] imputil module http://docs.python.org/library/imputil.html

[4] The Freeze tool. See also the Tools/freeze/ directory in a Python
source distribution

[5] py2exe by Thomas Heller http://www.py2exe.org/

[6] imp.set_frozenmodules() patch http://bugs.python.org/issue642578

[7] The path argument to finder.find_module() is there because the
pkg.__path__ variable may be needed at this point. It may either come
from the actual parent module or be supplied by imp.find_module() or the
proposed imp.get_loader() function.

[8] Quixote, a framework for developing Web applications
http://www.mems-exchange.org/software/quixote/

[9] New import hooks + Import from Zip files
http://bugs.python.org/issue652586