PEP: 438 Title: Transitioning to release-file hosting on PyPI Version:
$Revision$ Last-Modified: $Date$ Author: Holger Krekel
<holger@merlinux.eu>, Carl Meyer <carl@oddbird.net> BDFL-Delegate:
Richard Jones <richard@python.org> Discussions-To:
distutils-sig@python.org Status: Superseded Type: Process Topic:
Packaging Content-Type: text/x-rst Created: 15-Mar-2013 Post-History:
19-May-2013 Superseded-By: 470 Resolution:
https://mail.python.org/pipermail/distutils-sig/2013-May/020773.html

Abstract

This PEP proposes a backward-compatible two-phase transition process to
speed up, simplify and robustify installing from the pypi.python.org
(PyPI) package index. To ease the transition and minimize client-side
friction, no changes to distutils or existing installation tools are
required in order to benefit from the first transition phase, which will
result in faster, more reliable installs for most existing packages.

The first transition phase implements easy and explicit means for a
package maintainer to control which release file links are served to
present-day installation tools. The first phase also includes the
implementation of analysis tools for present-day packages, to support
communication with package maintainers and the automated setting of
default modes for controlling release file links. The first phase also
will default newly-registered projects on PyPI to only serve links to
release files which were uploaded to PyPI.

The second transition phase concerns end-user installation tools, which
shall default to only install release files that are hosted on PyPI and
tell the user if external release files exist, offering a choice to
automatically use those external files. External release files shall in
the future be registered together with a checksum hash so that
installation tools can verify the integrity of the eventual download
(PyPI-hosted release files always carry such a checksum).

Alternative PyPI server implementations should implement the new simple
index serving behaviour of transition phase 1 to avoid installation
tools treating their release links as external ones in phase 2.

Rationale

History and motivations for external hosting

When PyPI went online, it offered release registration but had no
facility to host release files itself. When hosting was added, no
automated downloading tool existed yet. When Phillip Eby implemented
automated downloading (through setuptools), he made the choice to allow
people to use download hosts of their choice. The finding of
externally-hosted packages was implemented as follows:

1.  The PyPI simple/ index for a package contains all links found by
    scraping them from that package's long_description metadata for any
    release. Links in the "Download-URL" and "Home-page" metadata fields
    are given rel=download and rel=homepage attributes, respectively.
2.  Any of these links whose target is a file whose name appears to be
    in the form of an installable source or binary distribution, with
    name in the form "packagename-version.ARCHIVEEXT", is considered a
    potential installation candidate by installation tools.
3.  Similarly, any links suffixed with an "#egg=packagename-version"
    fragment are considered an installation candidate.
4.  Additionally, the rel=homepage and rel=download links are crawled by
    installation tools and, if HTML, are themselves scraped for
    release-file links in the above formats.

See the easy_install documentation for a complete description of this
behavior.[1]

Today, most packages indexed on PyPI host their release files on PyPI.
Out of 29,117 total projects on PyPI, only 2,581 (less than 10%) include
any links to installable files that are available only off-PyPI.[2]

There are many reasons[3] why people have chosen external hosting. To
cite just a few:

-   release processes and scripts have been developed already and upload
    to external sites
-   it takes too long to upload large files from some places in the
    world
-   export restrictions e.g. for crypto-related software
-   company policies which require offering open source packages through
    own sites
-   problems with integrating uploading to PyPI into one's release
    process (because of release policies)
-   desiring download statistics different from those maintained by PyPI
-   perceived bad reliability of PyPI
-   not aware that PyPI offers file-hosting

Irrespective of the present-day validity of these reasons, there clearly
is a history why people choose to host files externally and it even was
for some time the only way you could do things. This PEP takes the
position that there remain some valid reasons for external hosting even
today.

Problem

Today, python package installers (pip, easy_install, buildout, and
others) often need to query many non-PyPI URLs even if there are no
externally hosted files. Apart from querying pypi.python.org's simple
index pages, also all homepages and download pages ever specified with
any release of a package are crawled by an installer. The need for
installers to crawl external sites slows down installation and makes for
a brittle and unreliable installation process. Those sites and packages
also don't take part in the PEP 381 mirroring infrastructure, further
decreasing reliability and speed of automated installation processes
around the world.

Most packages are hosted directly on pypi.python.org[4]. Even for these
packages, installers still crawl their homepage and download-url, if
specified. Many package uploaders are not aware that specifying the
"homepage" or "download-url" in their package metadata will needlessly
slow down the installation process for all users.

Relying on third party sites also opens up more attack vectors for
injecting malicious packages into sites using automated installs. A
simple attack might just involve getting hold of an old now-unused
homepage domain and placing malicious packages there. Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject malicious packages on the
installation site. As many homepages and download locations are using
HTTP and not HTTPS, such attacks are not hard to launch. Such MITM
attacks can easily happen even for packages which never intended to host
files externally as their homepages are contacted by installers anyway.

There is currently no way for package maintainers to avoid external-link
crawling, other than removing all homepage/download url metadata for all
historic releases. While a script[5] has been written to perform this
action, it is not a good general solution because it removes useful
metadata from PyPI releases.

Even if the sites referenced by "Homepage" and "Download-URL" links were
not scraped for further links, there is no obvious way under the current
system for a package owner to link to an installable file from a
long_description metadata field (which is shown as package documentation
on /pypi/PKG) without installation tools automatically considering that
file a candidate for installation. Conversely, there is no way to
explicitly register multiple external release files without putting them
in metadata fields.

Goals

These are the goals to be achieved by implementation of this PEP:

-   Package owners should be able to explicitly control which files are
    presented by PyPI to installer tools as installation candidates.
    Installation should not be slowed and made less reliable by
    extensive and unnecessary crawling of links that package owners did
    not explicitly nominate as installation files.
-   It should remain possible for package owners to choose to host their
    release files on their own hosting, external to PyPI. It should be
    easy for a user to request the installation of such releases using
    automated installer tools, especially if the external release files
    were registered together with a checksum hash.
-   Automated installer tools should not install externally-hosted
    packages by default, but require explicit authorization to do so by
    the user. When tools refuse to install such a package by default,
    they should tell the user exactly which external link(s) the
    installer needs to follow, and what option(s) the user can provide
    to authorize the tool to follow those links. PyPI should provide all
    necessary metadata for installer tools to implement this easily and
    within a single request/reply interaction.
-   Migration from the status quo to the above points should be gradual
    and minimize breakage. This includes tooling that makes it easy for
    package owners with an existing release process that uploads to
    non-PyPI hosting to also upload those release files to PyPI.

Solution / two transition phases

The first transition phase introduces a "hosting-mode" field for each
project on PyPI, allowing package owners explicit control of which
release file links are served to present-day installation tools in the
machine-readable simple/ index. The first transition will, after
successful hosting-mode manipulations by individual early-adopters, set
a default hosting mode for existing packages, based on automated
analysis. Maintainers will be notified one month ahead of any such
automated change. At completion of the first transition phase, all
present-day existing release and installation processes and tools are
expected to continue working. Any remaining errors or problems are
expected to only relate to installation of individual packages and can
be easily corrected by package maintainers or PyPI admins if maintainers
are not reachable.

Also in the first phase, each link served in the simple/ index will be
explicitly marked as rel="internal" if it is hosted by the index itself
(even if on a separate domain, which may be the case if the index uses a
CDN for file-serving). Any link not so marked will be considered an
external link.

In the second transition phase, PyPI client installation tools shall be
updated to default to only install rel="internal" packages unless a user
specifies option(s) to permit installing from external links. See second
transition phase for details on how installers should behave.

Maintainers of packages which currently host release files on non-PyPI
sites shall receive instructions and tools to ease "re-hosting" of their
historic and future package release files. This re-hosting tool MUST be
available before automated hosting-mode changes are announced to package
maintainers.

Implementation

Hosting modes

The foundation of the first transition phase is the introduction of
three "modes" of PyPI hosting for a package, affecting which links are
generated for the simple/ index. These modes are implemented without
requiring changes to installation tools via changes to the algorithm for
generating the machine-readable simple/ index.

The modes are:

-   pypi-scrape-crawl: no change from the current situation of
    generating machine-readable links for installation tools, as
    outlined in the history.
-   pypi-scrape: for a package in this mode, links to be added to the
    simple/ index are still scraped from package metadata. However, the
    "Home-page" and "Download-url" links are given rel=ext-homepage and
    rel=ext-download attributes instead of rel=homepage and
    rel=download. The effect of this (with no change in installation
    tools necessary) is that these links will not be followed and
    scraped for further candidate links by present-day installation
    tools: only installable files directly hosted from PyPI or linked
    directly from PyPI metadata will be considered for installation.
    Installation tools MAY evolve to offer an option to use the new
    rel-attribution to crawl external pages but MUST NOT default to it.
-   pypi-explicit: for a package in this mode, only links to release
    files uploaded to PyPI, and external links to release files
    explicitly nominated by the package owner, will be added to the
    simple/ index. PyPI will provide a new interface for package owners
    to supply external release-file URLs. These URLs MUST include a URL
    fragment in the form "#hashtype=hashvalue" specifying a hash of the
    externally-linked file which installer tools MUST use to validate
    that they have downloaded the intended file.

Thus the hope is that eventually all projects on PyPI can be migrated to
the pypi-explicit mode, while preserving the ability to install release
files hosted externally via installer tools. Deprecation of hosting
modes to eventually only allow the pypi-explicit mode is NOT REGULATED
by this PEP but is expected to become feasible some time after
successful implementation of the transition phases described in this
PEP. It is expected that deprecation requires a new process to deal with
abandoned packages because of unreachable maintainers for still popular
packages.

First transition phase (PyPI)

The proposed solution consists of multiple implementation and
communication steps:

1.  Implement in PyPI the three modes described above, with an interface
    for package owners to select the mode for each package and register
    explicit external file URLs.
2.  For packages in all modes, label links in the simple/ index to
    index-hosted files with rel="internal", to make it easier for client
    tools to distinguish these links in the second phase.
3.  Add an HTML tag <meta name="api-version" value="2"> to all simple/
    index pages, to allow clients to distinguish between indexes
    providing the rel="internal" metadata and older ones that do not.
4.  Default all newly-registered packages to pypi-explicit mode (package
    owners can still switch to the other modes as desired).
5.  Determine (via automated analysis[6]) which packages have all
    installable files available on PyPI itself (group A), which have all
    installable files on PyPI or linked directly from PyPI metadata
    (group B), and which have installable versions available that are
    linked only from external homepage/download HTML pages (group C).
6.  Send mail to maintainers of projects in group A that their project
    will be automatically configured to pypi-explicit mode in one month,
    and similarly to maintainers of projects in group B that their
    project will be automatically configured to pypi-scrape mode. Inform
    them that this change is not expected to affect installability of
    their project at all, but will result in faster and safer installs
    for their users. Encourage them to set this mode themselves sooner
    to benefit their users.
7.  Send mail to maintainers of packages in group C that their package
    hosting mode is pypi-scrape-crawl, list the URLs which currently are
    crawled, and suggest that they either re-host their packages
    directly on PyPI and switch to pypi-explicit, or at least provide
    direct links to release files in PyPI metadata and switch to
    pypi-scrape. Provide instructions and tools to help with these
    transitions.

Second transition phase (installer tools)

For the second transition phase, maintainers of installation tools are
asked to release two updates.

The first update shall provide clear warnings if externally-hosted
release files (that is, files whose link does not include
rel="internal") are selected for download, for which projects and URLs
exactly this happens, and warn that in future versions externally-hosted
downloads will be disabled by default.

The second update should change the default mode to allow only
installation of rel="internal" package files, and allow installation of
externally-hosted packages only when the user supplies an option.

The installer should distinguish between verifiable and non-verifiable
external links. A verifiable external link is a direct link to an
installable file from the PyPI simple/ index that includes a hash in the
URL fragment ("#hashtype=hashvalue") which can be used to verify the
integrity of the downloaded file. A non-verifiable external link is any
link (other than those explicitly supplied by the user of an installer
tool) without a hash, scraped from external HTML, or injected into the
search via some other non-PyPI source (e.g. setuptools' dependency_links
feature).

Installers should provide a blanket option to allow installing any
verifiable external link. Non-verifiable external links should only be
installed if the user-provided option specifies exactly which external
domains can be used or for which specific package names external links
can be used.

When download of an externally-hosted package is disallowed by the
default configuration, the user should be notified, with instructions
for how to make the install succeed and warnings about the implication
(that a file will be downloaded from a site that is not part of the
package index). The warning given for non-verifiable links should
clearly state that the installer cannot verify the integrity of the
downloaded file. The warning given for verifiable external links should
simply note that the file will be downloaded from an external URL, but
that the file integrity can be verified by checksum.

Alternative PyPI-compatible index implementations should upgrade to
begin providing the rel="internal" metadata and the
<meta name="api-version" value="2"> tag as soon as possible. For
alternative indexes which do not yet provide the meta tag in their
simple/ pages, installation tools should provide backwards-compatible
fallback behavior (treat links as internal as in pre-PEP times and
provide a warning).

API For Submitting External Distribution URLs

New distribution URLs may be submitted by performing a HTTP POST to the
URL:

  https://pypi.python.org/pypi

With the following form-encoded data:

  Name             Value
  ---------------- ---------------------------------
  :action          The string "urls"
  name             The package name as a string
  version          The release version as a string
  new-url          The new URL to store
  submit_new_url   The string "yes"

The POST must be accompanied by an HTTP Basic Auth header encoding the
username and password of the user authorized to maintain the package on
PyPI.

The HTTP response to this request will be one of:

  Code   Meaning        URL submission implications
  ------ -------------- -------------------------------------------------------------------------------------------
  200    OK             Everything worked just fine
  400    Bad request    Data provided for submission was malformed
  401    Unauthorised   The username or password supplied were incorrect
  403    Forbidden      User does not have permission to update the package information (not Owner or Maintainer)

References

Acknowledgments

Phillip Eby for precise information and the basic ideas to implement the
transition via server-side changes only.

Donald Stufft for pushing away from external hosting and offering to
implement both a Pull Request for the necessary PyPI changes and the
analysis tool to drive the transition phase 1.

Marc-Andre Lemburg, Alyssa Coghlan and catalog-sig in general for
thinking through issues regarding getting rid of "external hosting".

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

[1] Phillip Eby, easy_install 'Package Index "API"' documentation,
http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api

[2] Donald Stufft, automated analysis of PyPI project links,
https://github.com/dstufft/pypi.linkcheck

[3] Marc-Andre Lemburg, reasons for external hosting,
https://mail.python.org/pipermail/catalog-sig/2013-March/005626.html

[4] Donald Stufft, automated analysis of PyPI project links,
https://github.com/dstufft/pypi.linkcheck

[5] Holger Krekel, script to remove homepage/download metadata for all
releases
https://mail.python.org/pipermail/catalog-sig/2013-February/005423.html

[6] Donald Stufft, automated analysis of PyPI project links,
https://github.com/dstufft/pypi.linkcheck