PEP: 381 Title: Mirroring infrastructure for PyPI Author: Tarek Ziadé
<tarek@ziade.org>, Martin von Löwis <martin@v.loewis.de> Status:
Withdrawn Type: Standards Track Topic: Packaging Content-Type:
text/x-rst Created: 21-Mar-2009 Post-History:

Abstract

This PEP describes a mirroring infrastructure for PyPI.

PEP Withdrawal

The main PyPI web service was moved behind the Fastly caching CDN in May
2013:
https://mail.python.org/pipermail/distutils-sig/2013-May/020848.html

Subsequently, this arrangement was formalised as an in-kind sponsorship
with the PSF, and the PSF has also taken on the task of risk management
in the event that that sponsorship arrangement were to ever cease.

The download statistics that were previously provided directly on PyPI,
are now published indirectly via Google Big Query:
https://packaging.python.org/guides/analyzing-pypi-package-downloads/

Accordingly, the mirroring proposal described in this PEP is no longer
required, and has been marked as Withdrawn.

Rationale

PyPI is hosting over 6000 projects and is used on a daily basis by
people to build applications. Especially systems like easy_install and
zc.buildout make intensive usage of PyPI.

For people making intensive use of PyPI, it can act as a single point of
failure. People have started to set up some mirrors, both private and
public. Those mirrors are active mirrors, which means that they are
browsing PyPI to get synced.

In order to make the system more reliable, this PEP describes:

-   the mirror listing and registering at PyPI
-   the pages a public mirror should maintain. These pages will be used
    by PyPI, in order to get hit counts and the last modified date.
-   how a mirror should synchronize with PyPI
-   how a client can implement a fail-over mechanism

Mirror listing and registering

People that wants to mirror PyPI make a proposal on catalog-SIG. When a
mirror is proposed on the mailing list, it is manually added in a mirror
list in the PyPI application after it has been checked to be compliant
with the mirroring rules.

The mirror list is provided as a list of host names of the form

  X.pypi.python.org

The values of X are the sequence a,b,c,...,aa,ab,... a.pypi.python.org
is the master server; the mirrors start with b. A CNAME record
last.pypi.python.org points to the last host name. Mirror operators
should use a static address, and report planned changes to that address
in advance to distutils-sig.

The new mirror also appears at http://pypi.python.org/mirrors which is a
human-readable page that gives the list of mirrors. This page also
explains how to register a new mirror.

Statistics page

PyPI provides statistics on downloads at /stats. This page is calculated
daily by PyPI, by reading all mirrors' local stats and summing them.

The stats are presented in daily or monthly files, under /stats/days and
/stats/months. Each file is a bzip2 file with these formats:

-   YYYY-MM-DD.bz2 for daily files
-   YYYY-MM.bz2 for monthly files

Examples:

-   /stats/days/2008-11-06.bz2
-   /stats/days/2008-11-07.bz2
-   /stats/days/2008-11-08.bz2
-   /stats/months/2008-11.bz2
-   /stats/months/2008-10.bz2

Mirror Authenticity

With a distributed mirroring system, clients may want to verify that the
mirrored copies are authentic. There are multiple threats to consider:

1.  the central index may get compromised
2.  the central index is assumed to be trusted, but the mirrors might be
    tampered.
3.  a man in the middle between the central index and the end user, or
    between a mirror and the end user might tamper with datagrams.

This specification only deals with the second threat. Some provisions
are made to detect man-in-the-middle attacks. To detect the first
attack, package authors need to sign their packages using PGP keys, so
that users verify that the package comes from the author they trust.

The central index provides a DSA key at the URL /serverkey, in the PEM
format as generated by "openssl dsa -pubout" (i.e. 3280
SubjectPublicKeyInfo, with the algorithm 1.3.14.3.2.12). This URL must
not be mirrored, and clients must fetch the official serverkey from PyPI
directly, or use the copy that came with the PyPI client software.
Mirrors should still download the key, to detect a key rollover.

For each package, a mirrored signature is provided at
/serversig/<package>. This is the DSA signature of the parallel URL
/simple/<package>, in DER form, using SHA-1 with DSA (i.e. as a 3279
Dsa-Sig-Value, created by algorithm 1.2.840.10040.4.3)

Clients using a mirror need to perform the following steps to verify a
package:

1.  download the /simple page, and compute its SHA-1 hash
2.  compute the DSA signature of that hash
3.  download the corresponding /serversig, and compare it
    (byte-for-byte) with the value computed in step 2.
4.  compute and verify (against the /simple page) the MD-5 hashes of all
    files they download from the mirror.

An implementation of the verification algorithm is available from
https://svn.python.org/packages/trunk/pypi/tools/verify.py

Verification is not needed when downloading from central index, and
should be avoided to reduce the computation overhead.

About once a year, the key will be replaced with a new one. Mirrors will
have to re-fetch all /serversig pages. Clients using mirrors need to
find a trusted copy of the new server key. One way to obtain one is to
download it from https://pypi.python.org/serverkey. To detect
man-in-the-middle attacks, clients need to verify the SSL server
certificate, which will be signed by the CACert authority.

Special pages a mirror needs to provide

A mirror is a subset copy of PyPI, so it provides the same structure by
copying it.

-   simple: rest version of the package index
-   packages: packages, stored by Python version, and letters
-   serversig: signatures for the simple pages

It also needs to provide two specific elements:

-   last-modified
-   local-stats

Last modified date

CPAN uses a freshness date system where the mirror's last
synchronisation date is made available.

For PyPI, each mirror needs to maintain a URL with simple text content
that represents the last synchronisation date the mirror maintains.

The date is provided in GMT time, using the ISO 8601 format[1]. Each
mirror will be responsible to maintain its last modified date.

This page must be located at : /last-modified and must be a text/plain
page.

Local statistics

Each mirror is responsible to count all the downloads that where done
via it. This is used by PyPI to sum up all downloads, to be able to
display the grand total.

These statistics are in CSV-like form, with a header in the first line.
It needs to obey PEP 305. Basically, it should be readable by Python's
csv module.

The fields in this file are:

-   package: the distutils id of the package.
-   filename: the filename that has been downloaded.
-   useragent: the User-Agent of the client that has downloaded the
    package.
-   count: the number of downloads.

The content will look like this:

    # package,filename,useragent,count
    zc.buildout,zc.buildout-1.6.0.tgz,MyAgent,142
    ...

The counting starts the day the mirror is launched, and there is one
file per day, compressed using the bzip2 format. Each file is named like
the day. For example, 2008-11-06.bz2 is the file for the 6th of November
2008.

They are then provided in a folder called days. For example:

-   /local-stats/days/2008-11-06.bz2
-   /local-stats/days/2008-11-07.bz2
-   /local-stats/days/2008-11-08.bz2

This page must be located at /local-stats.

How a mirror should synchronize with PyPI

A mirroring protocol called Simple Index was described and implemented
by Martin v. Loewis and Jim Fulton, based on how easy_install works.
This section synthesizes it and gives a few relevant links, plus a small
part about User-Agent.

The mirroring protocol

Mirrors must reduce the amount of data transferred between the central
server and the mirror. To achieve that, they MUST use the changelog()
PyPI XML-RPC call, and only refetch the packages that have been changed
since the last time. For each package P, they MUST copy documents
/simple/P/ and /serversig/P. If a package is deleted on the central
server, they MUST delete the package and all associated files. To detect
modification of package files, they MAY cache the file's ETag, and MAY
request skipping it using the If-none-match header.

Each mirroring tool MUST identify itself using a descripte User-agent
header.

The pep381client package[2] provides an application that respects this
protocol to browse PyPI.

User-agent request header

In order to be able to differentiate actions taken by clients over PyPI,
a specific user agent name should be provided by all mirroring software.

This is also true for all clients like:

-   zc.buildout[3].
-   setuptools[4].
-   pip[5].

XXX user agent registering mechanism at PyPI ?

How a client can use PyPI and its mirrors

Clients that are browsing PyPI should be able to use alternative
mirrors, by getting the list of the mirrors using last.pypi.python.org.

Code example:

    >>> import socket
    >>> socket.gethostbyname_ex('last.pypi.python.org')[0]
    'h.pypi.python.org'

The clients so far that could use this mechanism:

-   setuptools
-   zc.buildout (through setuptools)
-   pip

Fail-over mechanism

Clients that are browsing PyPI should be able to use a fail-over
mechanism when PyPI or the used mirror is not responding.

It is up to the client to decide which mirror should be used, maybe by
looking at its geographical location and its responsiveness.

This PEP does not describe how this fail-over mechanism should work, but
it is strongly encouraged that the clients try to use the nearest
mirror.

The clients so far that could use this mechanism:

-   setuptools
-   zc.buildout (through setuptools)
-   pip

Extra package indexes

It is obvious that some packages will not be uploaded to PyPI, whether
because they are private or whether because the project maintainer runs
their own server where people might get the project package. However, it
is strongly encouraged that a public package index follows PyPI and
Distutils protocols.

In other words, the register and upload command should be compatible
with any package index server out there.

Software that are compatible with PyPI and Distutils so far:

-   PloneSoftwareCenter[6] which is used to run plone.org products
    section.
-   EggBasket[7].

An extra package index is not a mirror of PyPI, but can have some
mirrors itself.

Merging several indexes

When a client needs to get some packages from several distinct indexes,
it should be able to use each one of them as a potential source of
packages. Different indexes should be defined as a sorted list for the
client to look for a package.

Each independent index can of course provide a list of its mirrors.

XXX define how to get the hostname for the mirrors of an arbitrary
index.

That permits all combinations at client level, for a reliable packaging
system with all levels of privacy.

It is up the client to deal with the merging.

References

Acknowledgments

Georg Brandl.

Copyright

This document has been placed in the public domain.

[1] http://en.wikipedia.org/wiki/ISO_8601

[2] http://pypi.python.org/pypi/pep381client

[3] http://pypi.python.org/pypi/zc.buildout

[4] http://pypi.python.org/pypi/setuptools

[5] http://pypi.python.org/pypi/pip

[6] http://plone.org/products/plonesoftwarecenter

[7] http://www.chrisarndt.de/projects/eggbasket