PEP: 710 Title: Recording the provenance of installed packages Author:
Fridolín Pokorný <fridolin.pokorny at gmail.com> Sponsor: Donald Stufft
<donald@stufft.io> PEP-Delegate: Paul Moore <p.f.moore@gmail.com>
Discussions-To:
https://discuss.python.org/t/pep-710-recording-the-provenance-of-installed-packages/25428
Status: Draft Type: Standards Track Topic: Packaging Content-Type:
text/x-rst Created: 27-Mar-2023 Post-History: 03-Dec-2021, 30-Jan-2023,
14-Mar-2023, 03-Apr-2023,

Abstract

This PEP describes a way to record the provenance of installed Python
distributions. The record is created by an installer and is available to
users in the form of a JSON file provenance_url.json in the .dist-info
directory. The mentioned JSON file captures additional metadata to allow
recording a URL to a distribution package together with the installed
distribution hash. This proposal is built on top of PEP 610 following
its corresponding
canonical PyPA spec <packaging:direct-url> and complements
direct_url.json with provenance_url.json for when packages are
identified by a name, and optionally a version.

Motivation

Installing a Python Project involves downloading a Distribution Package
from a Package Index and extracting its content to an appropriate place.
After the installation process is done, information about the release
artifact used as well as its source is generally lost. However, there
are use cases for keeping records of distributions used for installing
packages and their provenance.

Python wheels can be built with different compiler flags or supporting
different wheel tags. In both cases, users might get into a situation in
which multiple wheels might be considered by installers (possibly from
different package indexes) and immediately finding out which wheel file
was actually used during the installation might be helpful. This way,
developers can use information about wheels to debug issues making sure
the desired wheel was actually installed. Another use case could be
tools reporting software installed, such as tools reporting a SBOM
(Software Bill of Materials), that might give more accurate reports. Yet
another use case could be reconstruction of the Python environment by
pinning each installed package to a specific distribution artifact
consumed from a Python package index.

Rationale

The motivation described in this PEP is an extension of Recording the
Direct URL Origin of installed distributions <packaging:direct-url>
specification. In addition to recording provenance information for
packages installed using a direct URL, installers should also do so for
packages installed by name (and optionally version) from Python package
indexes.

The idea described in this PEP originated in a tool called micropipenv
that is used to install distribution packages <Distribution Package> in
containerized environments (see the reported issue
thoth-station/micropipenv#206). Currently, the assembled containerized
application does not implicitly carry information about the provenance
of installed distribution packages (unless these are installed from full
URLs and recorded via direct_url.json). This requires container image
suppliers to link container images with the corresponding build process,
its configuration and the application source code for checking
requirements files in cases when software present in containerized
environments needs to be audited.

The subsequent discussion in the Discourse thread also brought up pip's
new --report option that can generate a detailed JSON report about the
installation process. This option could help with the provenance problem
this PEP approaches. Nevertheless, this option needs to be explicitly
passed to pip to obtain the provenance information, and includes
additional metadata that might not be necessary for checking the
provenance (such as Python version requirements of each distribution
package). Also, this option is specific to pip as of the writing of this
PEP.

Note the current spec for recording installed packages
<packaging:recording-installed-packages> defines a RECORD file that
records installed files, but not the distribution artifact from which
these files were obtained. Auditing installed artifacts can be performed
based on matching the entries listed in the RECORD file. However, this
technique requires a pre-computed database of files each artifact
provides or a comparison with the actual artifact content. Both
approaches are relatively expensive and time consuming operations which
could be eliminated with the proposed provenance_url.json file.

Recording provenance information for installed distribution packages,
both those obtained from direct URLs and by name/version from an index,
can simplify auditing Python environments in general, beyond just the
specific use case for containerized applications mentioned earlier. A
community project pip-audit raised their possible interest in
pypa/pip-audit#170.

Specification

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”,
“RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be
interpreted as described in 2119.

The provenance_url.json file SHOULD be created in the .dist-info
directory by installers when installing a Distribution Package specified
by name (and optionally by Version Specifier).

This file MUST NOT be created when installing a distribution package
from a requirement specifying a direct URL reference (including a VCS
URL).

Only one of the files provenance_url.json and direct_url.json (from
Recording the Direct URL Origin of installed distributions
<packaging:direct-url> specification and the corresponding specification
of the Direct URL Data Structure <packaging:direct-url-data-structure>),
may be present in a given .dist-info directory; installers MUST NOT add
both.

The provenance_url.json JSON file MUST be a dictionary, compliant with
8259 and UTF-8 encoded.

If present, it MUST contain exactly two keys. The first MUST be url,
with type string. The second key MUST be archive_info with a value
defined below.

The value of the url key MUST be the URL from which the distribution
package was downloaded. If a wheel is built from a source distribution,
the url value MUST be the URL from which the source distribution was
downloaded. If a wheel is downloaded and installed directly, the url
field MUST be the URL from which the wheel was downloaded. As in the
Direct URL
Data Structure <packaging:direct-url-data-structure> specification, the
url value MUST be stripped of any sensitive authentication information
for security reasons.

The user:password section of the URL MAY however be composed of
environment variables, matching the following regular expression:

    \$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?

Additionally, the user:password section of the URL MAY be a well-known,
non-security sensitive string. A typical example is git in the case of
an URL such as ssh://git@gitlab.com.

The value of archive_info MUST be a dictionary with a single key hashes.
The value of hashes is a dictionary mapping hash function names to a
hex-encoded digest of the file referenced by the url value. At least one
hash MUST be recorded. Multiple hashes MAY be included, and it is up to
the consumer to decide what to do with multiple hashes (it may validate
all of them or a subset of them, or nothing at all).

Each hash MUST be one of the single argument hashes provided by
py3.11:hashlib.algorithms_guaranteed, excluding sha1 and md5 which MUST
NOT be used. As of Python 3.11, with shake_128 and shake_256 excluded
for being multi-argument, the allowed set of hashes is:

    >>> import hashlib
    >>> sorted(hashlib.algorithms_guaranteed - {"shake_128", "shake_256", "sha1", "md5"})
    ['blake2b', 'blake2s', 'sha224', 'sha256', 'sha384', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', 'sha512']

Each hash MUST be referenced by the canonical name of the hash, always
lower case.

Hashes sha1 and md5 MUST NOT be present, due to the security limitations
of these hash algorithms. Conversely, hash sha256 SHOULD be included.

Installers that cache distribution packages from an index SHOULD keep
information related to the cached distribution artifact, so that the
provenance_url.json file can be created even when installing
distribution packages from the installer's cache.

Backwards Compatibility

Following the packaging:recording-installed-packages specification,
installers may keep additional installer-specific files in the
.dist-info directory. To make sure this PEP does not cause any backwards
compatibility issues, a comprehensive survey of installers and libraries
found no current tools that are using a similarly-named file, or other
major feasibility concerns.

The Wheel specification <packaging:binary-distribution-format> lists
files that can be present in the .dist-info directory. None of these
file names collide with the proposed provenance_url.json file from this
PEP.

Presence of provenance_url.json in installers and libraries

A comprehensive survey of the existing installers, libraries, and
dependency managers in the Python ecosystem analyzed the implications of
adding support for provenance_url.json to each tool. In summary, no
major backwards compatibility issues, conflicts or feasibility blockers
were found as of the time of writing of this PEP. More details about the
survey can be found in the Appendix: Survey of installers and libraries
section.

Compatibility with direct_url.json

This proposal does not make any changes to the direct_url.json file
described in PEP 610 and its corresponding canonical PyPA spec
<packaging:direct-url>.

The content of provenance_url.json file was designed in a way to
eventually allow installers reuse some of the logic supporting
direct_url.json when a direct URL refers to a source archive or a wheel.

The main difference between the provenance_url.json and direct_url.json
files are the mandatory keys and their values in the provenance_url.json
file. This helps make sure consumers of the provenance_url.json file can
rely on its content, if the file is present in the .dist-info directory.

Security Implications

One of the main security features of the provenance_url.json file is the
ability to audit installed artifacts in Python environments. Tools can
check which Python package indexes were used to install Python
distribution
packages <Distribution Package> as well as the hash digests of their
release artifacts.

As an example, we can take the recent compromised dependency chain in
the PyTorch incident. The PyTorch index provided a package named
torchtriton. An attacker published torchtriton on PyPI, which ran a
malicious binary. By checking the URL of the installed Python
distribution stated in the provenance_url.json file, tools can
automatically check the source of the installed Python distribution. In
case of the PyTorch incident, the URL of torchtriton should point to the
PyTorch index, not PyPI. Tools can help identifying such malicious
Python distributions installed by checking the installed Python
distribution URL. A more exact check can include also the hash of the
installed Python distribution stated in the provenance_url.json file.
Such checks on hashes can be helpful for mirrored Python package indexes
where Python distributions are not distinguishable by their source URLs,
making sure only desired Python package distributions are installed.

A malicious actor can intentionally adjust the content of
provenance_url.json to possibly hide provenance information of the
installed Python distribution. A security check which would uncover such
malicious activity is beyond scope of this PEP as it would require
monitoring actions on the filesystem and eventually reviewing user or
file permissions.

How to Teach This

The provenance_url.json metadata file is intended for tools and is not
directly visible to end users.

Examples

Examples of a valid provenance_url.json

A valid provenance_url.json list multiple hashes:

    {
      "archive_info": {
        "hashes": {
          "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
          "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
          "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
          "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
        }
      },
      "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
    }

A valid provenance_url.json listing a single hash entry:

    {
      "archive_info": {
        "hashes": {
          "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
        }
      },
      "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
    }

A valid provenance_url.json listing a source distribution which was used
to build and install a wheel:

    {
      "archive_info": {
        "hashes": {
          "sha256": "8bfe29f17c10e2f2e619de8033a07a224058d96b3bfe2ed61777596f7ffd7fa9"
        }
      },
      "url": "https://files.pythonhosted.org/packages/1d/43/ad8ae671de795ec2eafd86515ef9842ab68455009d864c058d0c3dcf680d/micropipenv-0.0.1.tar.gz"
    }

Examples of an invalid provenance_url.json

The following example includes a hash key in the archive_info dictionary
as originally designed in the data structure documented in
packaging:direct-url. The hash key MUST NOT be present to prevent from
any possible confusion with hashes and additional checks that would be
required to keep hash values in sync.

    {
      "archive_info": {
        "hash": "sha256=236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
        "hashes": {
          "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
        }
      },
      "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
    }

Another example demonstrates an invalid hash name. The referenced hash
name does not correspond to the canonical hash names described in this
PEP and in the Python docs under py3.11:hashlib.hash.name.

    {
      "archive_info": {
        "hashes": {
          "SHA-256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
        }
      },
      "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
    }

The last example demonstrates a provenance_url.json file with no hashes
available for the downloaded artifact:

    {
      "archive_info": {
        "hashes": {}
       }
      "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
    }

Example pip commands and their effect on provenance_url.json and direct_url.json

These commands generate a direct_url.json file but do not generate a
provenance_url.json file. These examples follow examples from Direct
URL Data Structure <packaging:direct-url-data-structure> specification:

-   pip install https://example.com/app-1.0.tgz
-   pip install https://example.com/app-1.0.whl
-   pip install "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"
-   pip install ./app
-   pip install file:///home/user/app
-   pip install --editable "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"
    (in which case, url will be the local directory where the git
    repository has been cloned to, and dir_info will be present with
    "editable": true and no vcs_info will be set)
-   pip install -e ./app

Commands that generate a provenance_url.json file but do not generate a
direct_url.json file:

-   pip install app
-   pip install app~=2.2.0
-   pip install app --no-index --find-links "https://example.com/"

This behaviour can be tested using changes to pip implemented in the PR
pypa/pip#11865.

Reference Implementation

A proof-of-concept for creating the provenance_url.json metadata file
when installing a Python Distribution Package is available in the PR to
pip pypa/pip#11865. It reuses the already available implementation for
the direct URL data structure <packaging:direct-url-data-structure> to
provide the provenance_url.json metadata file for cases when
direct_url.json is not created.

A reference implementation for supporting the provenance_url.json file
in PDM exists is available in pdm-project/pdm#3013.

A prototype called pip-preserve was developed to demonstrate creation of
requirements.txt files considering direct_url.json and
provenance_url.json metadata files. This tool mimics the pip freeze
functionality, but the listing of installed packages also includes the
hashes of the Python distribution artifacts.

To further support this proposal, pip-sbom demonstrates creation of SBOM
in the SPDX format. The tool uses information stored in the
provenance_url.json file.

Rejected Ideas

Naming the file direct_url.json instead of provenance_url.json

To preserve backwards compatibility with the
Recording the Direct URL Origin of installed distributions <packaging:direct-url>,
the file cannot be named direct_url.json, as per the text of that
specification:

  This file MUST NOT be created when installing a distribution from an
  other type of requirement (i.e. name plus version specifier).

Such a change might introduce backwards compatibility issues for
consumers of direct_url.json who rely on its presence only when
distributions are installed using a direct URL reference.

Deprecating direct_url.json and using only provenance_url.json

File direct_url.json is already well established by the Direct URL
Data Structure <packaging:direct-url-data-structure> specification and
is already used by installers. For example, pip uses direct_url.json to
report a direct URL reference on pip freeze. Deprecating direct_url.json
would require additional changes to the pip freeze implementation in pip
(see PR fridex/pip#2) and could introduce backwards compatibility issues
for already existing direct_url.json consumers.

Keeping the hash key in the archive_info dictionary

Direct URL Data Structure <packaging:direct-url-data-structure>
specification discusses the possibility to include the hash key
alongside the hashes key in the archive_info dictionary. This PEP
explicitly does not include the hash key in the provenance_url.json file
and allows only the hashes key to be present. By doing so we eliminate
possible redundancy in the file, possible confusion, and any additional
checks that would need to be done to make sure the hashes are in sync.

Allowing no hashes stated

For cases when a wheel file is installed from pip's cache and built
using an older version of pip, pip does not record hashes of the
downloaded source distributions. As we do not have hashes of these
downloaded source distributions, the hashes key in the
provenance_url.json file would not contain any entries. In such cases,
pip does not create any provenance_url.json file as the provenance
information is not complete. It is encouraged for consumers to rebuild
wheels with a newer version of pip in these cases.

Making the hashes key optional

PEP 610 and its corresponding canonical PyPA spec <packaging:direct-url>
recommend including the hashes key of the archive_info in the
direct_url.json file but it is not required (per the 2119 language):

  A hashes key SHOULD be present as a dictionary mapping a hash name to
  a hex encoded digest of the file.

This PEP requires the hashes key be included in archive_info in the
provenance_url.json file if that file is created; per this PEP:

  The value of archive_info MUST be a dictionary with a single key
  hashes.

By doing so, consumers of provenance_url.json can check artifact digests
when the provenance_url.json file is created by installers.

Storing index URL

A possibility was raised for storing the index URL as part of the file
content. This index URL would represent the index configured in pip's
configuration or specified using the --index-url or --extra-index-url
options. Storing this information was considered confusing, especially
when using other installation options like --find-links. Since the
actual index URL is not strictly bound to the location from which the
wheel file was downloaded, we decided not to store the index URL in the
provenance_url.json file.

Open Issues

Availability of the provenance_url.json file in Conda

We would like to get feedback on the provenance_url.json file from the
Conda maintainers. It is not clear whether Conda would like to adopt the
provenance_url.json file. Conda already stores provenance related
information (similar to the provenance information proposed in this PEP)
in JSON files located in the conda-meta directory following its actions
during installation.

Using provenance_url.json in downstream installers

The proposed provenance_url.json file was meant to be adopted primarily
by Python installers. Other installers, such as APT or DNF, might record
the provenance of the installed downstream Python distributions in their
own way specific to downstream package management. The proposed file is
not expected to be created by these downstream package installers and
thus they were intentionally left out of this PEP. However, any input by
developers or maintainers of these installers is valuable to possibly
enrich the provenance_url.json file with information that would help in
some way.

Appendix: Survey of installers and libraries

pip

The function from pip's internal API responsible for installing wheels,
named _install_wheel, does not store any provenance_url.json file in the
.dist-info directory. Additionally, a prototype introducing the
mentioned file to pip in pypa/pip#11865 demonstrates incorporating logic
for handling the provenance_url.json file in pip's source code.

As pip is used by some of the tools mentioned below to install Python
package distributions, findings for pip apply to these tools, as well as
pip does not allow parametrizing creation of files in the .dist-info
directory in its internal API. Most of the tools mentioned below that
use pip invoke pip as a subprocess which has no effect on the eventual
presence of the provenance_url.json file in the .dist-info directory.

distlib

distlib implements low-level functionality to manipulate the dist-info
directory. The database of installed distributions does not use any file
named provenance_url.json, based on the distlib's source code.

Pipenv

Pipenv uses pip to install Python package distributions. There wasn't
any additional identified logic that would cause backwards compatibility
issues when introducing the provenance_url.json file in the .dist-info
directory.

installer

installer does not create a provenance_url.json file explicitly.
Nevertheless, as per the
Recording Installed Projects <packaging:recording-installed-packages>
specification, installer allows passing the additional_metadata argument
to create a file in the .dist-info directory - see the source code. To
avoid any backwards compatibility issues, any library or tool using
installer must not request creating the provenance_url.json file using
the mentioned additional_metadata argument.

Poetry

The installation logic in Poetry depends on the
installer.modern-installer configuration option (see docs).

For cases when the installer.modern-installer configuration option is
set to false, Poetry uses pip for installing Python package
distributions.

On the other hand, when installer.modern-installer configuration option
is set to true, Poetry uses installer to install Python package
distributions. As can be seen from the linked sources, there isn't
passed any additional metadata file named provenance_url.json that would
cause compatibility issues with this PEP.

Conda

Conda does not create any provenance_url.json file when Python package
distributions are installed.

Hatch

Hatch uses pip to install project dependencies.

micropipenv

As micropipenv is a wrapper on top of pip, it uses pip to install Python
distributions, for both lock files as well as for requirements files.

Thamos

Thamos uses micropipenv to install Python package distributions, hence
any findings for micropipenv apply for Thamos.

PDM

PDM uses installer to install binary distributions. The only additional
metadata file it eventually creates in the .dist-info directory is the
REFER_TO file.

uv

uv is written in Rust and uses its own installation logic when
installing wheels. It does not create any additional files in the
.dist-info directory that would collide with the provenance_url.json
file naming.

References

Acknowledgements

Thanks to Dustin Ingram, Brett Cannon, and Paul Moore for the initial
discussion in which this idea originated.

Thanks to Donald Stufft, Ofek Lev, and Trishank Kuppusamy for early
feedback and support to work on this PEP.

Thanks to Gregory P. Smith, Stéphane Bidoul, and C.A.M. Gerlach for
reviewing this PEP and providing valuable suggestions.

Thanks to Seth Michael Larson for providing valuable suggestions and for
the proposed pip-sbom prototype.

Thanks to Stéphane Bidoul and Chris Jerdonek for PEP 610, and related
Recording the Direct URL Origin of installed distributions
<packaging:direct-url> and Direct URL Data Structure
<packaging:direct-url-data-structure> specifications.

Thanks to Frost Ming for raising possible concern around storing index
URL in the provenance_url.json file and initial PEP 710 support in PDM.

Last, but not least, thanks to Donald Stufft for sponsoring this PEP.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.