PEP: 262 Title: A Database of Installed Python Packages Author: A.M.
Kuchling <amk@amk.ca> Status: Rejected Type: Standards Track Topic:
Packaging Content-Type: text/x-rst Created: 08-Jul-2001 Post-History:
27-Mar-2002

Note

This PEP was superseded by PEP 345 and PEP 376, which were accepted.
Therefore, this PEP is (by implication) rejected.

Introduction

This PEP describes a format for a database of the Python software
installed on a system.

(In this document, the term "distribution" is used to mean a set of code
that's developed and distributed together. A "distribution" is the same
as a Red Hat or Debian package, but the term "package" already has a
meaning in Python terminology, meaning "a directory with an __init__.py
file in it.")

Requirements

We need a way to figure out what distributions, and what versions of
those distributions, are installed on a system. We want to provide
features similar to CPAN, APT, or RPM. Required use cases that should be
supported are:

-   Is distribution X on a system?
-   What version of distribution X is installed?
-   Where can the new version of distribution X be found? (This can be
    defined as either "a home page where the user can go and find a
    download link", or "a place where a program can find the newest
    version?" Both should probably be supported.)
-   What files did distribution X put on my system?
-   What distribution did the file x/y/z.py come from?
-   Has anyone modified x/y/z.py locally?
-   What other distributions does this software need?
-   What Python modules does this distribution provide?

Database Location

The database lives in a bunch of files under
<prefix>/lib/python<version>/install-db/. This location will be called
INSTALLDB through the remainder of this PEP.

The structure of the database is deliberately kept simple; each file in
this directory or its subdirectories (if any) describes a single
distribution. Binary packagings of Python software such as RPMs can then
update Python's database by just installing the corresponding file into
the INSTALLDB directory.

The rationale for scanning subdirectories is that we can move to a
directory-based indexing scheme if the database directory contains too
many entries. For example, this would let us transparently switch from
INSTALLDB/Numeric to INSTALLDB/N/Nu/Numeric or some similar hashing
scheme.

Database Contents

Each file in INSTALLDB or its subdirectories describes a single
distribution, and has the following contents:

An initial line listing the sections in this file, separated by
whitespace. Currently this will always be 'PKG-INFO FILES REQUIRES
PROVIDES'. This is for future-proofing; if we add a new section, for
example to list documentation files, then we'd add a DOCS section and
list it in the contents. Sections are always separated by blank lines.

A distribution that uses the Distutils for installation should
automatically update the database. Distributions that roll their own
installation will have to use the database's API to manually add or
update their own entry. System package managers such as RPM or pkgadd
can just create the new file in the INSTALLDB directory.

Each section of the file is used for a different purpose.

PKG-INFO section

An initial set of 822 headers containing the distribution information
for a file, as described in PEP 241, "Metadata for Python Software
Packages".

FILES section

An entry for each file installed by the distribution. Generated files
such as .pyc and .pyo files are on this list as well as the original .py
files installed by a distribution; their checksums won't be stored or
checked, though.

Each file's entry is a single tab-delimited line that contains the
following fields:

-   The file's full path, as installed on the system.
-   The file's size
-   The file's permissions. On Windows, this field will always be
    'unknown'
-   The owner and group of the file, separated by a tab. On Windows,
    these fields will both be 'unknown'.
-   A SHA1 digest of the file, encoded in hex. For generated files such
    as *.pyc files, this field must contain the string "-", which
    indicates that the file's checksum should not be verified.

REQUIRES section

This section is a list of strings giving the services required for this
module distribution to run properly. This list includes the distribution
name ("python-stdlib") and module names ("rfc822", "htmllib", "email",
"email.Charset"). It will be specified by an extra 'requires' argument
to the distutils.core.setup() function. For example:

    setup(..., requires=['xml.utils.iso8601',

Eventually there may be automated tools that look through all of the
code and produce a list of requirements, but it's unlikely that these
tools can handle all possible cases; a manual way to specify
requirements will always be necessary.

PROVIDES section

This section is a list of strings giving the services provided by an
installed distribution. This list includes the distribution name
("python-stdlib") and module names ("rfc822", "htmllib", "email",
"email.Charset").

XXX should files be listed? e.g. $PREFIX/lib/color-table.txt, to pick up
data files, required scripts, etc.

Eventually there may be an option to let module developers add their own
strings to this section. For example, you might add "XML parser" to this
section, and other module distributions could then list "XML parser" as
one of their dependencies to indicate that multiple different XML
parsers can be used. For now this ability isn't supported because it
raises too many issues: do we need a central registry of legal strings,
or just let people put whatever they like? Etc., etc...

API Description

There's a single fundamental class, InstallationDatabase. The code for
it lives in distutils/install_db.py. (XXX any suggestions for alternate
locations in the standard library, or an alternate module name?)

The InstallationDatabase returns instances of Distribution that contain
all the information about an installed distribution.

XXX Several of the fields in Distribution are duplicates of ones in
distutils.dist.Distribution. Probably they should be factored out into
the Distribution class proposed here, but can this be done in a
backward-compatible way?

InstallationDatabase has the following interface:

    class InstallationDatabase:
        def __init__ (self, path=None):
            """InstallationDatabase(path:string)
            Read the installation database rooted at the specified path.
            If path is None, INSTALLDB is used as the default.
            """

        def get_distribution (self, distribution_name):
            """get_distribution(distribution_name:string) : Distribution
            Get the object corresponding to a single distribution.
            """

        def list_distributions (self):
            """list_distributions() : [Distribution]
            Return a list of all distributions installed on the system,
            enumerated in no particular order.
            """

        def find_distribution (self, path):
            """find_file(path:string) : Distribution
            Search and return the distribution containing the file 'path'.
            Returns None if the file doesn't belong to any distribution
            that the InstallationDatabase knows about.
            XXX should this work for directories?
            """

    class Distribution:
        """Instance attributes:
        name : string
          Distribution name
        files : {string : (size:int, perms:int, owner:string, group:string,
                           digest:string)}
           Dictionary mapping the path of a file installed by this distribution
           to information about the file.

        The following fields all come from PEP 241.

        version : distutils.version.Version
          Version of this distribution
        platform : [string]
        summary : string
        description : string
        keywords : string
        home_page : string
        author : string
        author_email : string
        license : string
        """

        def add_file (self, path):
            """add_file(path:string):None
            Record the size, ownership, &c., information for an installed file.
            XXX as written, this would stat() the file.  Should the size/perms/
            checksum all be provided as parameters to this method instead?
            """

        def has_file (self, path):
            """has_file(path:string) : Boolean
            Returns true if the specified path belongs to a file in this
            distribution.
            """

         def check_file (self, path):
            """check_file(path:string) : Boolean
            Checks whether the file's size, checksum, and ownership match,
            returning true if they do.
            """

Deliverables

A description of the database API, to be added to this PEP.

Patches to the Distutils that 1) implement an InstallationDatabase
class, 2) Update the database when a new distribution is installed. 3)
add a simple package management tool, features to be added to this PEP.
(Or should that be a separate PEP?) See[1] for the current patch.

Open Issues

PJE suggests the installation database "be potentially present on every
directory in sys.path, with the contents merged in sys.path order. This
would allow home-directory or other alternate-location installs to work,
and ease the process of a distutils install command writing the file."
Nice feature: it does mean that package manager tools can take into
account Python packages that a user has privately installed.

AMK wonders: what does setup.py do if it's told to install packages to a
directory not on sys.path? Does it write an install-db directory to the
directory it's told to write to, or does it do nothing?

Should the package-database file itself be included in the files list?
(PJE would think yes, but of course it can't contain its own checksum.
AMK can't think of a use case where including the DB file matters.)

PJE wonders about writing the package DB file first, before installing
any other files, so that failed partial installations can both be backed
out, and recognized as broken. This PEP may have to specify some
algorithm for recognizing this situation.

Should we guarantee the format of installation databases remains
compatible across Python versions, or is it subject to arbitrary change?
Probably we need to guarantee compatibility.

Rejected Suggestions

Instead of using one text file per distribution, one large text file or
an anydbm file could be used. This has been rejected for a few reasons.
First, performance is probably not an extremely pressing concern as the
database is only used when installing or removing software, a relatively
infrequent task. Scalability also likely isn't a problem, as people may
have hundreds of Python packages installed, but thousands or tens of
thousands seems unlikely. Finally, individual text files are compatible
with installers such as RPM or DPKG because a binary packager can just
drop the new database file into the database directory. If one large
text file or a binary file were used, the Python database would then
have to be updated by running a postinstall script.

On Windows, the permissions and owner/group of a file aren't stored.
Windows does in fact support ownership and access permissions, but
reading and setting them requires the win32all extensions, and they
aren't present in the basic Python installer for Windows.

References

[1] Michael Muller's patch (posted to the Distutils-SIG around 28 Dec
1999) generates a list of installed files.

Acknowledgements

Ideas for this PEP originally came from postings by Greg Ward, Fred L.
Drake Jr., Thomas Heller, Mats Wichmann, Phillip J. Eby, and others.

Many changes and rewrites to this document were suggested by the readers
of the Distutils SIG.

Copyright

This document has been placed in the public domain.

[1] A patch to implement this PEP will be tracked as patch #562100 on
SourceForge. https://bugs.python.org/issue562100 . Code implementing the
installation database is currently in Python CVS in the
nondist/sandbox/pep262 directory.