PEP 710 – Recording the provenance of installed packages
- Author:
- Fridolín Pokorný <fridolin.pokorny at gmail.com>
- Sponsor:
- Donald Stufft <donald at stufft.io>
- PEP-Delegate:
- Paul Moore <p.f.moore at gmail.com>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 27-Mar-2023
- Post-History:
- 03-Dec-2021, 30-Jan-2023, 14-Mar-2023, 03-Apr-2023
Abstract
This PEP describes a way to record the provenance of installed Python distributions.
The record is created by an installer and is available to users in
the form of a JSON file provenance_url.json
in the .dist-info
directory.
The mentioned JSON file captures additional metadata to allow recording a URL to a
distribution package together with the installed distribution hash.
This proposal is built on top of PEP 610 following its corresponding
canonical PyPA spec and complements direct_url.json
with provenance_url.json
for when packages are identified by a name, and
optionally a version.
Motivation
Installing a Python Project involves downloading a Distribution Package from a Package Index and extracting its content to an appropriate place. After the installation process is done, information about the release artifact used as well as its source is generally lost. However, there are use cases for keeping records of distributions used for installing packages and their provenance.
Python wheels can be built with different compiler flags or supporting different wheel tags. In both cases, users might get into a situation in which multiple wheels might be considered by installers (possibly from different package indexes) and immediately finding out which wheel file was actually used during the installation might be helpful. This way, developers can use information about wheels to debug issues making sure the desired wheel was actually installed. Another use case could be tools reporting software installed, such as tools reporting a SBOM (Software Bill of Materials), that might give more accurate reports. Yet another use case could be reconstruction of the Python environment by pinning each installed package to a specific distribution artifact consumed from a Python package index.
Rationale
The motivation described in this PEP is an extension of Recording the Direct URL Origin of installed distributions specification. In addition to recording provenance information for packages installed using a direct URL, installers should also do so for packages installed by name (and optionally version) from Python package indexes.
The idea described in this PEP originated in a tool called micropipenv
that is used to install
distribution packages in containerized
environments (see the reported issue thoth-station/micropipenv#206).
Currently, the assembled containerized application does not implicitly carry
information about the provenance of installed distribution packages
(unless these are installed from full URLs and recorded via direct_url.json
).
This requires container image suppliers to link
container images with the corresponding build process, its configuration and
the application source code for checking requirements files in cases when
software present in containerized environments needs to be audited.
The subsequent discussion in the Discourse thread also brought up
pip’s new --report
option that can
generate a detailed JSON report about
the installation process. This option could help with the provenance problem
this PEP approaches. Nevertheless, this option needs to be explicitly passed
to pip to obtain the provenance information, and includes additional metadata that
might not be necessary for checking the provenance (such as Python version
requirements of each distribution package). Also, this option is
specific to pip as of the writing of this PEP.
Note the current spec for recording installed packages defines a RECORD
file that
records installed files, but not the distribution artifact from which these
files were obtained. Auditing installed artifacts can be performed
based on matching the entries listed in the RECORD
file. However, this
technique requires a pre-computed database of files each artifact provides or a
comparison with the actual artifact content. Both approaches are relatively
expensive and time consuming operations which could be eliminated with the
proposed provenance_url.json
file.
Recording provenance information for installed distribution packages, both those obtained from direct URLs and by name/version from an index, can simplify auditing Python environments in general, beyond just the specific use case for containerized applications mentioned earlier. A community project pip-audit raised their possible interest in pypa/pip-audit#170.
Specification
The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
The provenance_url.json
file SHOULD be created in the .dist-info
directory by installers when installing a Distribution Package
specified by name (and optionally by Version Specifier).
This file MUST NOT be created when installing a distribution package from a requirement specifying a direct URL reference (including a VCS URL).
Only one of the files provenance_url.json
and direct_url.json
(from
Recording the Direct URL Origin of installed distributions specification and the corresponding specification of
the Direct URL Data Structure),
may be present in a given .dist-info
directory; installers MUST NOT add
both.
The provenance_url.json
JSON file MUST be a dictionary, compliant with
RFC 8259 and UTF-8 encoded.
If present, it MUST contain exactly two keys. The first MUST be url
, with
type string
. The second key MUST be archive_info
with a value defined
below.
The value of the url
key MUST be the URL from which the distribution
package was downloaded. If a wheel is built from a source distribution, the
url
value MUST be the URL from which the source distribution was
downloaded. If a wheel is downloaded and installed directly, the url
field
MUST be the URL from which the wheel was downloaded. As in the Direct URL
Data Structure specification, the url
value MUST be stripped of any sensitive authentication information for security
reasons.
The user:password section of the URL MAY however be composed of environment variables, matching the following regular expression:
\$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?
Additionally, the user:password section of the URL MAY be a well-known,
non-security sensitive string. A typical example is git
in the case of an
URL such as ssh://git@gitlab.com
.
The value of archive_info
MUST be a dictionary with a single key
hashes
. The value of hashes
is a dictionary mapping hash function
names to a hex-encoded digest of the file referenced by the url
value. At
least one hash MUST be recorded. Multiple hashes MAY be included, and it is up
to the consumer to decide what to do with multiple hashes (it may validate all
of them or a subset of them, or nothing at all).
Each hash MUST be one of the single argument hashes provided by
hashlib.algorithms_guaranteed
, excluding sha1
and md5
which MUST NOT be used.
As of Python 3.11, with shake_128
and shake_256
excluded
for being multi-argument, the allowed set of hashes is:
>>> import hashlib
>>> sorted(hashlib.algorithms_guaranteed - {"shake_128", "shake_256", "sha1", "md5"})
['blake2b', 'blake2s', 'sha224', 'sha256', 'sha384', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', 'sha512']
Each hash MUST be referenced by the canonical name of the hash, always lower case.
Hashes sha1
and md5
MUST NOT be present, due to the security
limitations of these hash algorithms. Conversely, hash sha256
SHOULD
be included.
Installers that cache distribution packages from an index SHOULD keep
information related to the cached distribution artifact, so that
the provenance_url.json
file can be created even when installing distribution packages
from the installer’s cache.
Backwards Compatibility
Following the Recording installed projects specification,
installers may keep additional installer-specific files in the .dist-info
directory. To make sure this PEP does not cause any backwards compatibility
issues, a comprehensive survey of installers and libraries
found no current tools that are using a similarly-named file,
or other major feasibility concerns.
The Wheel specification lists files that can be
present in the .dist-info
directory. None of these file names collide with
the proposed provenance_url.json
file from this PEP.
Presence of provenance_url.json in installers and libraries
A comprehensive survey of the existing installers, libraries, and dependency
managers in the Python ecosystem analyzed the implications of adding support for
provenance_url.json
to each tool.
In summary, no major backwards compatibility issues, conflicts or feasibility blockers
were found as of the time of writing of this PEP. More details about the survey
can be found in the Appendix: Survey of installers and libraries section.
Compatibility with direct_url.json
This proposal does not make any changes to the direct_url.json
file
described in PEP 610 and its corresponding canonical PyPA spec.
The content of provenance_url.json
file was designed in a way to eventually
allow installers reuse some of the logic supporting direct_url.json
when a
direct URL refers to a source archive or a wheel.
The main difference between the provenance_url.json
and direct_url.json
files are the mandatory keys and their values in the provenance_url.json
file.
This helps make sure consumers of the provenance_url.json
file can rely
on its content, if the file is present in the .dist-info
directory.
Security Implications
One of the main security features of the provenance_url.json
file is the
ability to audit installed artifacts in Python environments. Tools can check
which Python package indexes were used to install Python distribution
packages as well as the hash digests of their release
artifacts.
As an example, we can take the recent compromised dependency chain in the
PyTorch incident.
The PyTorch index provided a package named torchtriton
. An attacker
published torchtriton
on PyPI, which ran a malicious binary. By checking
the URL of the installed Python distribution stated in the
provenance_url.json
file, tools can automatically check the source of the
installed Python distribution. In case of the PyTorch incident, the URL of
torchtriton
should point to the PyTorch index, not PyPI. Tools can help
identifying such malicious Python distributions installed by checking the
installed Python distribution URL. A more exact check can include also the hash
of the installed Python distribution stated in the provenance_url.json
file. Such checks on hashes can be helpful for mirrored Python package indexes
where Python distributions are not distinguishable by their source URLs, making
sure only desired Python package distributions are installed.
A malicious actor can intentionally adjust the content of
provenance_url.json
to possibly hide provenance information of the
installed Python distribution. A security check which would uncover such
malicious activity is beyond scope of this PEP as it would require monitoring
actions on the filesystem and eventually reviewing user or file permissions.
How to Teach This
The provenance_url.json
metadata file is intended for tools and is not
directly visible to end users.
Examples
Examples of a valid provenance_url.json
A valid provenance_url.json
list multiple hashes:
{
"archive_info": {
"hashes": {
"blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
"sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
"sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
"sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
}
},
"url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
A valid provenance_url.json
listing a single hash entry:
{
"archive_info": {
"hashes": {
"sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
}
},
"url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
A valid provenance_url.json
listing a source distribution which was used to
build and install a wheel:
{
"archive_info": {
"hashes": {
"sha256": "8bfe29f17c10e2f2e619de8033a07a224058d96b3bfe2ed61777596f7ffd7fa9"
}
},
"url": "https://files.pythonhosted.org/packages/1d/43/ad8ae671de795ec2eafd86515ef9842ab68455009d864c058d0c3dcf680d/micropipenv-0.0.1.tar.gz"
}
Examples of an invalid provenance_url.json
The following example includes a hash
key in the archive_info
dictionary as originally designed in the data structure documented in
Recording the Direct URL Origin of installed distributions. The hash
key MUST NOT be present to prevent
from any possible confusion with hashes
and additional checks that would be
required to keep hash values in sync.
{
"archive_info": {
"hash": "sha256=236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
"hashes": {
"sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
}
},
"url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
Another example demonstrates an invalid hash name. The referenced hash name does not
correspond to the canonical hash names described in this PEP and
in the Python docs under hashlib.hash.name
.
{
"archive_info": {
"hashes": {
"SHA-256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
}
},
"url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
The last example demonstrates a provenance_url.json
file with no hashes
available for the downloaded artifact:
{
"archive_info": {
"hashes": {}
}
"url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
Example pip commands and their effect on provenance_url.json and direct_url.json
These commands generate a direct_url.json
file but do not generate a
provenance_url.json
file. These examples follow examples from Direct
URL Data Structure specification:
pip install https://example.com/app-1.0.tgz
pip install https://example.com/app-1.0.whl
pip install "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"
pip install ./app
pip install file:///home/user/app
pip install --editable "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"
(in which case,url
will be the local directory where the git repository has been cloned to, anddir_info
will be present with"editable": true
and novcs_info
will be set)pip install -e ./app
Commands that generate a provenance_url.json
file but do not generate
a direct_url.json
file:
pip install app
pip install app~=2.2.0
pip install app --no-index --find-links "https://example.com/"
This behaviour can be tested using changes to pip implemented in the PR pypa/pip#11865.
Reference Implementation
A proof-of-concept for creating the provenance_url.json
metadata file when
installing a Python Distribution Package is available in the PR to pip
pypa/pip#11865. It reuses the already available implementation for the
direct URL data structure to
provide the provenance_url.json
metadata file for cases when
direct_url.json
is not created.
A reference implementation for supporting the provenance_url.json
file
in PDM exists is available in pdm-project/pdm#3013.
A prototype called pip-preserve was developed to
demonstrate creation of requirements.txt
files considering direct_url.json
and provenance_url.json
metadata files. This tool mimics the pip
freeze
functionality, but the listing of installed packages also includes
the hashes of the Python distribution artifacts.
To further support this proposal, pip-sbom demonstrates creation
of SBOM in the SPDX format. The tool uses information stored in the provenance_url.json
file.
Rejected Ideas
Naming the file direct_url.json instead of provenance_url.json
To preserve backwards compatibility with the Recording the Direct URL Origin of installed distributions, the file cannot be named
direct_url.json
, as per the text of that specification:
This file MUST NOT be created when installing a distribution from an other type of requirement (i.e. name plus version specifier).
Such a change might introduce backwards compatibility issues for consumers of
direct_url.json
who rely on its presence only when distributions are
installed using a direct URL reference.
Deprecating direct_url.json and using only provenance_url.json
File direct_url.json
is already well established by the Direct URL
Data Structure specification and is
already used by installers. For example, pip
uses direct_url.json
to
report a direct URL reference on pip freeze
. Deprecating
direct_url.json
would require additional changes to the pip freeze
implementation in pip (see PR fridex/pip#2) and could introduce backwards
compatibility issues for already existing direct_url.json
consumers.
Keeping the hash key in the archive_info dictionary
Direct URL Data Structure
specification discusses the possibility to include the hash
key alongside
the hashes
key in the archive_info
dictionary. This PEP explicitly does
not include the hash
key in the provenance_url.json
file and allows
only the hashes
key to be present. By doing so we eliminate possible
redundancy in the file, possible confusion, and any additional checks that
would need to be done to make sure the hashes are in sync.
Allowing no hashes stated
For cases when a wheel file is installed from pip’s cache and built using an
older version of pip, pip does not record hashes of the downloaded source
distributions. As we do not have hashes of these downloaded source
distributions, the hashes
key in the provenance_url.json
file would not
contain any entries. In such cases, pip does not create any
provenance_url.json
file as the provenance information is not complete. It
is encouraged for consumers to rebuild wheels with a newer version of pip in
these cases.
Making the hashes key optional
PEP 610 and its corresponding canonical PyPA spec
recommend including the hashes
key of the archive_info
in the
direct_url.json
file but it is not required (per the RFC 2119 language):
A hashes key SHOULD be present as a dictionary mapping a hash name to a hex encoded digest of the file.
This PEP requires the hashes
key be included in archive_info
in the provenance_url.json
file if that file is created; per this PEP:
The value ofarchive_info
MUST be a dictionary with a single keyhashes
.
By doing so, consumers of provenance_url.json
can check
artifact digests when the provenance_url.json
file is created by installers.
Storing index URL
A possibility was raised for storing the index URL as part of the file content.
This index URL would represent the index configured in pip’s configuration or
specified using the --index-url
or --extra-index-url
options. Storing
this information was considered confusing, especially when using other
installation options like --find-links
. Since the actual index URL is not
strictly bound to the location from which the wheel file was downloaded, we
decided not to store the index URL in the provenance_url.json
file.
Open Issues
Availability of the provenance_url.json file in Conda
We would like to get feedback on the provenance_url.json
file from the Conda
maintainers. It is not clear whether Conda would like to adopt the
provenance_url.json
file. Conda already stores provenance related
information (similar to the provenance information proposed in this PEP) in
JSON files located in the conda-meta
directory following its actions
during installation.
Using provenance_url.json in downstream installers
The proposed provenance_url.json
file was meant to be adopted primarily by
Python installers. Other installers, such as APT or DNF, might record the
provenance of the installed downstream Python distributions in their own
way specific to downstream package management. The proposed file is
not expected to be created by these downstream package installers and thus they
were intentionally left out of this PEP. However, any input by developers or
maintainers of these installers is valuable to possibly enrich the
provenance_url.json
file with information that would help in some way.
Appendix: Survey of installers and libraries
pip
The function from pip’s internal API responsible for installing wheels, named
_install_wheel,
does not store any provenance_url.json
file in the .dist-info
directory. Additionally, a prototype introducing the mentioned file to pip in
pypa/pip#11865 demonstrates incorporating logic for handling the
provenance_url.json
file in pip’s source code.
As pip is used by some of the tools mentioned below to install Python package
distributions, findings for pip apply to these tools, as well as pip does not
allow parametrizing creation of files in the .dist-info
directory in its
internal API. Most of the tools mentioned below that use pip invoke pip as a
subprocess which has no effect on the eventual presence of the
provenance_url.json
file in the .dist-info
directory.
distlib
distlib implements low-level functionality to manipulate the
dist-info
directory. The database of installed distributions does not use
any file named provenance_url.json
, based on the distlib’s source code.
Pipenv
Pipenv uses pip to install Python package distributions.
There wasn’t any additional identified logic that would cause backwards
compatibility issues when introducing the provenance_url.json
file in the
.dist-info
directory.
installer
installer does not create a provenance_url.json
file explicitly.
Nevertheless, as per the Recording Installed Projects
specification, installer allows passing the additional_metadata
argument to
create a file in the .dist-info
directory - see the source code.
To avoid any backwards compatibility issues, any library or tool using
installer must not request creating the provenance_url.json
file using the
mentioned additional_metadata
argument.
Poetry
The installation logic in Poetry depends on the
installer.modern-installer
configuration option (see docs).
For cases when the installer.modern-installer
configuration option is set
to false
, Poetry uses pip for installing Python package distributions.
On the other hand, when installer.modern-installer
configuration option is
set to true
, Poetry uses installer to install Python package distributions.
As can be seen from the linked sources, there isn’t passed any additional
metadata file named provenance_url.json
that would cause compatibility
issues with this PEP.
Conda
Conda does not create any provenance_url.json
file
when Python package distributions are installed.
Hatch
Hatch uses pip to install project dependencies.
micropipenv
As micropipenv is a wrapper on top of pip, it uses pip to install Python distributions, for both lock files as well as for requirements files.
Thamos
Thamos uses micropipenv to install Python package distributions, hence any findings for micropipenv apply for Thamos.
PDM
PDM uses installer to install binary distributions.
The only additional metadata file it eventually creates in the .dist-info
directory is the REFER_TO file.
uv
uv is written in Rust and uses its own installation logic when installing
wheels.
It does not create any additional files
in the .dist-info
directory that would collide with the
provenance_url.json
file naming.
Acknowledgements
Thanks to Dustin Ingram, Brett Cannon, and Paul Moore for the initial discussion in which this idea originated.
Thanks to Donald Stufft, Ofek Lev, and Trishank Kuppusamy for early feedback and support to work on this PEP.
Thanks to Gregory P. Smith, Stéphane Bidoul, and C.A.M. Gerlach for reviewing this PEP and providing valuable suggestions.
Thanks to Seth Michael Larson for providing valuable suggestions and for the proposed pip-sbom prototype.
Thanks to Stéphane Bidoul and Chris Jerdonek for PEP 610, and related Recording the Direct URL Origin of installed distributions and Direct URL Data Structure specifications.
Thanks to Frost Ming for raising possible concern around storing index URL in
the provenance_url.json
file and initial PEP 710 support in PDM.
Last, but not least, thanks to Donald Stufft for sponsoring this PEP.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0710.rst
Last modified: 2024-08-03 11:39:19 GMT