PEP 770 – Improving measurability of Python packages with Software Bill-of-Materials
- Author:
- Seth Larson <seth at python.org>
- Sponsor:
- Brett Cannon <brett at python.org>
- PEP-Delegate:
- Brett Cannon <brett at python.org>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 02-Jan-2025
- Post-History:
- 05-Nov-2024, 06-Jan-2025
Abstract
Almost all Python packages today are accurately measurable by software composition analysis (SCA) tools. For projects that are not accurately measurable, there is no existing mechanism to annotate a Python package with composition data to improve measurability.
Software Bill-of-Materials (SBOM) is a technology-and-ecosystem-agnostic method for describing software composition, provenance, heritage, and more. SBOMs are used as inputs for SCA tools, such as scanners for vulnerabilities and licenses, and have been gaining traction in global software regulations and frameworks.
This PEP proposes using SBOM documents included in Python packages as a means to improve automated software measurability for Python packages.
Motivation
Measurability and Phantom Dependencies
Python packages are particularly affected by the “phantom dependency” problem, where software components that aren’t written in Python are included in Python packages for many reasons, such as ease of installation and compatibility with standards:
- Python serves scientific, data, web, and machine-learning use-cases which use compiled or non-Python languages like Rust, C, C++, Fortran, JavaScript, and others.
- The Python wheel format is preferred by users due to the ease-of-installation. No code is executed during the installation step, only extracting the archive.
- The Python wheel format requires bundling shared compiled libraries without a method to encode metadata about these libraries.
- Packages related to Python packaging sometimes need to solve the “bootstrapping” problem, so include pure Python projects inside their source code.
These software components can’t be described using Python package metadata and thus are likely to be missed by software composition analysis (SCA) software which can mean vulnerable software components aren’t reported accurately.
For example, the Python package Pillow includes 16 shared object libraries in the wheel that were bundled by auditwheel as a part of the build. None of those shared object libraries are detected when using common SCA tools like Syft and Grype. If an SBOM document is included annotating all the included shared libraries then SCA tools can identify the included software reliably.
Build Tools, Environment, and Reproducibility
Going beyond the runtime dependencies of a package: SBOMs can also record the tools and environments used to build a package. Recording the exact tools and versions used to build a package is often required to establish build reproducibility. Build reproducibility is a property of software that can be used to detect incorrectly or maliciously modified software components when compared to their upstream sources. Without a recorded list of build tools and versions it can become difficult to impossible for a third-party to verify build reproducibility.
Regulations
SBOMs are required by recent software security regulations, like the Secure Software Development Framework (SSDF) and the Cyber Resilience Act (CRA). Due to their inclusion in these regulations, the demand for SBOM documents of open source projects is expected to be high. One goal is to minimize the demands on open source project maintainers by enabling open source users that need SBOMs to self-serve using existing tooling.
Another goal is to enable contributions from users who need SBOMs to annotate projects they depend on with SBOM information. Today there is no mechanism to propagate the results of those contributions for a Python package so there is no incentive for users to contribute this type of work.
Rationale
Using SBOM standards instead of Core Metadata fields
Attempting to add every field offered by SBOM standards into Python package Core Metadata would result in an explosion of new Core Metadata fields, including the need to keep up-to-date as SBOM standards continue to evolve to suit new needs in that space.
Instead, this proposal delegates SBOM-specific metadata to SBOM documents that are included in Python packages and adds a new Core Metadata field for discoverability of included SBOM documents.
This standard also doesn’t aim to replace Core Metadata with SBOMs, instead focusing on the SBOM information being supplemental to Core Metadata. Included SBOMs only contain information about dependencies included in the package archive or information about the top-level software in the package that can’t be encoded into Core Metadata but is relevant for the SBOM use-case (“software identifiers”, “purpose”, “support level”, etc).
Zero-or-more opaque SBOM documents
Rather than requiring at most one included SBOM document per Python package, this PEP proposes that one or more SBOM documents may be included in a Python package. This means that code attempting to annotate a Python package with SBOM data may do so without being concerned about corrupting data already contained within other SBOM documents.
Additionally, this PEP treats SBOM document data opaquely instead relying on final end-users of the SBOM data to process the contained SBOM data. This choice acknowledges that SBOM standards are an active area of development where there is not yet (and may never be) a single definitive SBOM standard and that SBOM standards can continue to evolve independent of Python packaging standards. Already tools that consume SBOM documents support a multitude of SBOM standards to handle this reality.
These decisions mean this PEP is capable of supporting any SBOM standard and does not favor one over the other, instead deferring the decision to producing projects and tools and consuming user tooling.
Adding data to Python packages without new metadata versions
The rollout of a new metadata version and field requires that many different projects and teams need to adopt the metadata version in sequence to avoid widespread breakage. This effect usually means a substantial delay in how quickly users and tools can start using new packaging features.
For example, a single metadata version bump requires
updates to PyPI, various pyproject.toml
parsing and schema projects,
the packaging
library, wait for releases, then pip
and other installers
need to bundle the changes to packaging
and release, then build backends can
begin emitting the new metadata version, again wait for releases, and only then
can projects begin using the new features. Even with this careful approach it’s
not guaranteed that tools won’t break on new metadata versions and fields.
To avoid this delay, simplify overall how to include SBOMs, and to give
flexibility to build backends and tools, this PEP proposes a new top-level table
in pyproject.toml
, [dist-info.files]
, to safely add data to a Python
package through a registry of reserved names that
avoids the need for new metadata fields and versions. This mechanism allows
build backends and tools to begin using the features described in this PEP
immediately after acceptance without the head-of-line blocking on other
projects adopting the PEP.
A new top-level table was chosen over using the [project]
table because,
as described in PEP 621, the [project]
table is used for storing core
metadata and this mechanism doesn’t use core metadata.
Storing files in the .dist-info
or .data
directory
There are two top-level directories in binary distributions where files beyond
the software itself can be stored: .dist-info
and .data
.
This specification chose to use the .dist-info
directory for storing
subdirectories and files from the new [dist-info.files]
top-level table
for two reasons:
Firstly, the .data
directory has no corresponding location in the installed
package, compared to .dist-info
which does preserve the link between the
binary distribution to the installed package in an environment. The .data
directory instead has all its contents merged between all installed packages in
an environment which can lead to collisions between similarly named files.
Secondly, subdirectories under the .data
directory require new definitions
to the Python sysconfig
module. This means defining additional directories require waiting for a change
to Python and using the directory requires waiting for adoption of the new
Python version by users. Subdirectories under .dist-info
don’t have these
requirements, they can be used by any user, build backend, and installer
immediately after a new subdirectory name is registered regardless of Python
or metadata version.
What are the differences between PEP 770 and PEP 725?
PEP 725 (“Specifying external dependencies in pyproject.toml”) is a different PEP with some similarities to PEP 770, such as attempting to describe non-Python software within Python packaging metadata. This section aims to show how these two PEPs are tracking different information and serving different use-cases:
- PEP 725 describes abstract dependencies, such as requiring “a C compiler”
as a build-time dependency (
virtual:compiler/c
) or needing to link “the OpenSSL library” at build time (pkg:generic/openssl
). PEP 770 describes concrete dependencies, more akin to dependencies in a “lock file”, such as an exact name, version, architecture, and hash of a software library distributed through AlmaLinux distribution (pkg:rpm/almalinux/libssl3@3.2.0
). For cases like build dependencies this might result in a dependency being requested via PEP 725 and then recorded concretely in an SBOM post-build with PEP 770. - PEP 725 is for describing external dependencies, provided by the system being used to either build or run the software. PEP 770 is for describing bundled software inside Python package archives, the SBOM documents don’t describe software on the system.
- PEP 725 is primarily about identification, using a list of software identifiers. PEP 770 provides the complete functionality of SBOM standards to describe various software attributes such as license, checksum, download location, etc.
- PEP 725 and PEP 770 have different users and use-cases. PEP 725 is
primarily for humans writing dependencies in
pyproject.toml
by hand. The users of the information are build backends and users who want to build software from source. PEP 770 is primarily for tools which are capable of generating SBOM documents to be included in a Python package archive and SBOM/SCA tools which want to SBOM documents about installed software to do some other task such as vulnerability scanning or software analysis.
Specification
The changes necessary to implement this PEP include:
- Explicitly reserving all subdirectory names in the
.dist-info
directory. - A new registry of reserved subdirectory names in the
.dist-info
directory. - An optional top-level table,
[dist-info.files]
, added to project source metadata, - An optional
sboms
key in the new[dist-info.files]
table, - Additions to the built distribution (wheel), and installed project specifications
In addition to the above, an informational PEP will be created for tools consuming included SBOM documents and other Python package metadata to generate complete SBOM documents for Python packages.
Reserving all subdirectory names in .dist-info
This PEP explicitly reserves all subdirectory names in the .dist-info
directory for future usage.
Build backends MUST NOT create subdirectories in the .dist-info
directory
beyond the names in the registry
to avoid collisions with future reserved names.
Build frontends and publishing tools MAY warn users if any .dist-info
subdirectories aren’t in the registry.
Registry of reserved .dist-info
subdirectory names
This PEP introduces a new registry of reserved subdirectory names allowed in
the .dist-info
directory for the distribution archive
and installed project s project types. Future additions to this registry
will be made through the PEP process. The initial values in this registry are:
Subdirectory name | PEP / Standard |
---|---|
licenses |
PEP 639 |
license_files |
PEP 639 (draft-only) |
LICENSES |
REUSE licensing framework |
sboms |
PEP 770 |
See Backwards Compatibility for a complete methodology for creating this initial set of values to avoid backwards incompatibility issues.
Project source metadata
This PEP specifies changes to the project’s source metadata
in the pyproject.toml
file:
Add new [dist-info.files]
table
A new optional [dist-info.files]
table is added for specifying paths
in the project source tree relative to pyproject.toml
to file(s) which
should be included in the built project to a subdirectory of .dist-info
.
This new table has only one defined optional key: sboms
. The value of the
sboms
key MUST be an array of valid glob patterns, as specified below:
- Alphanumeric characters, underscores (
_
), hyphens (-
) and dots (.
) MUST be matched verbatim. - Special glob characters:
*
,?
,**
and character ranges:[]
containing only the verbatim matched characters MUST be supported. Within[...]
, the hyphen indicates a locale-agnostic range (e.g. a-z, order based on Unicode code points). Hyphens at the start or end are matched literally. - Path delimiters MUST be the forward slash character (
/
). Patterns are relative to the directory containingpyproject.toml
, therefore the leading slash character MUST NOT be used. - Parent directory indicators (
..
) MUST NOT be used.
Any characters or character sequences not covered by this specification are invalid. Projects MUST NOT use such values. Tools consuming this field SHOULD reject invalid values with an error.
Literal paths (e.g. bom.cdx.json
) are treated as valid globs which means
they can also be defined.
Build tools:
- MUST treat each value in the array as a glob pattern, and MUST raise an error if the pattern contains invalid glob syntax.
- MUST include all files matched by a listed pattern in all distribution
archives under the
.dist-info/sboms
directory. - MUST raise an error if any individual user-specified pattern does not match at least one file.
If the sboms
key is present and is set to a value of an empty array,
then tools MUST NOT include any SBOM files and MUST NOT raise an error.
Examples of valid SBOM files declarations:
[dist-info.files]
sboms = ["bom.json"]
[dist-info.files]
sboms = ["sboms/openssl.cdx.json", "sboms/openssl.spdx.json"]
[dist-info.files]
sboms = ["sboms/*"]
[dist-info.files]
sboms = []
Examples of invalid SBOM files declarations:
[dist-info.files]
sboms = ["..\bom.json"]
Reason: ..
must not be used. \\
is an invalid path delimiter, /
must be used.
[dist-info.files]
sboms = ["bom{.json*"]
Reason: bom{.json*
is not a valid glob.
SBOM files in project formats
A few additions will be made to the existing specifications.
- Project source trees
- Per Project source metadata section, the
Declaring Project Metadata specification
will be updated to add the
[dist-info.files]
table and optionalsboms
key. - Built distributions (wheels)
- The wheel specification will be updated to add the new registry of reserved
directory names and to reflect that if the
.dist-info/sboms
subdirectory is specified that the directory contains SBOM files. - Installed projects
- The Recording Installed Projects specification will be updated to reflect
that if the
.dist-info/sboms
subdirectory is specified that the directory contains SBOM files and that any files in this directory MUST be copied from wheels by install tools.
SBOM data interoperability
This PEP treats data contained within SBOM documents as opaque, recognizing that SBOM standards are an active area of development. However, there are some considerations for SBOM data producers that when followed will improve the interoperability and usability of SBOM data made available in Python packages:
- SBOM documents SHOULD use a widely-accepted SBOM standard, such as CycloneDX or SPDX.
- SBOM documents SHOULD use UTF-8-encoded JSON (RFC 8259) when available for the SBOM standard in use.
- SBOM documents SHOULD include all required fields for the SBOM standard in use.
- SBOM documents SHOULD include a “time of creation” and “creating tool” field for the SBOM standard in use. This information is important for users attempting to reconstruct different stages for a Python package being built.
- The primary component described by the SBOM document SHOULD be the top-level software within the Python package (for example, “pkg:pypi/pillow” for the Pillow package).
- All non-primary components SHOULD have one or more paths in the relationship graph showing the relationship between components. If this information isn’t included, SCA tools might exclude components outside of the relationship graph.
- All software components SHOULD have a name, version, and one or more software identifiers (PURL, CPE, download URL).
PyPI and other indices MAY validate the contents of SBOM documents specified by this PEP, but MUST NOT validate or reject data for unknown SBOM standards, versions, or fields.
Backwards Compatibility
Reserved .dist-info
subdirectories registry
The new registry of reserved .dist-info
subdirectories represents
a new reservation that wasn’t previously documented, thus has the potential to
break assumptions being made by already existing tools.
To check what .dist-info
subdirectory names are in use today
a query across
all files in package archives on PyPI
was executed:
SELECT (
regexp_extract(archive_path, '.*\.dist-info/([^/]+)/', 1) AS dirname,
COUNT(DISTINCT project_name) AS projects
)
FROM '*.parquet'
WHERE archive_path LIKE '%.dist-info/%/%'
GROUP BY dirname ORDER BY projects DESC;
Note that this only includes records for files and thus won’t return results for empty directories. Empty directories being pervasively used and somehow load-bearing is unlikely, so is an accepted risk of using this method. This query yielded the following results:
Subdirectory | Unique Projects |
---|---|
licenses |
22,026 |
license_files |
1,828 |
LICENSES |
170 |
.ipynb_checkpoints |
85 |
license |
18 |
.wex |
9 |
dist |
8 |
include |
6 |
build |
5 |
tmp |
4 |
src |
3 |
calmjs_artifacts |
3 |
.idea |
2 |
Not shown above are around ~50 other subdirectory names that are used in a single project. From these results we can see:
- Most subdirectories under
.dist-info
are to do with licensing, one of which (licenses
) is specified by PEP 639 and others (license_files
,LICENSES
) are from draft implementations of PEP 639. - The
sboms
subdirectory doesn’t collide with existing use. - Other subdirectory names under
.dist-info
appear to be either not widespread or accidental.
As a result of this query we can see there are already some projects placing
directories under .dist-info
, so we can’t require that build frontends
raise errors for unregistered subdirectories. Instead the recommendation is
that build frontends MAY warn the user or raise an error in this scenario.
Security Implications
SBOM documents are only as useful as the information encoded in them. If an SBOM document contains incorrect information then this can result in incorrect downstream analysis by SCA tools. For this reason, it’s important for tools including SBOM data into Python packages to be confident in the information they are recording. SBOMs are capable of recording “known unknowns” in addition to known data. This practice is recommended when not certain about the data being recorded to allow for further analysis by users.
Because SBOM documents can encode information about the original system where a Python package is built (for example, the operating system name and version, less commonly the names of paths). This information has the potential to “leak” through the Python package to installers via SBOMs. If this information is sensitive, then that could represent a security risk.
How to Teach This
Most typical users of Python and Python packages won’t need to know the details of this standard. The details of this standard are most important to either maintainers of Python packages and developers of SCA tools such as SBOM generation tools and vulnerability scanners.
What do Python package maintainers need to know?
Python package metadata can already describe the top-level software included in
a package archive, but what if a package archive contains other software
components beyond the top-level software? For example, the Python wheel for
“Pillow” contains a handful of other software libraries bundled inside, like
libjpeg
, libpng
, libwebp
, and so on. This scenario is where this PEP
is most useful, for adding metadata about bundled software to a Python package.
Some build tools may be able to automatically annotate bundled dependencies. Typically tools can automatically annotate bundled dependencies when those dependencies come from a “packaging ecosystem” (such as PyPI, Linux distros, Crates.io, NPM, etc).
For packages which cannot be automatically annotated and if the package author
wishes to provide an SBOM the approach will be to generate or author SBOM files
and then include those files using pyproject.toml
:
[dist-info.files]
sboms = [
"sboms/bom.cdx.json"
]
For projects manually specifying an SBOM document the challenge will be keeping the document up-to-date. The CPython project has some customized tooling for this task, but it can likely be generalized into a tool reusable by other projects.
What do users of SBOM documents need to know?
Many users of this PEP won’t know of its existence, instead their software composition analysis tools, SBOM tools, or vulnerability scanners will simply begin giving more comprehensive information after an upgrade. For users that are interested in the sources of this new information, the “tool” field of SBOM metadata already provides linkages to the projects generating their SBOMs.
For users who need SBOM documents describing their open source dependencies the first step should always be “create them yourself”. Using the benchmarks above a list of tools that are known to be accurate for Python packages can be documented and recommended to users. For projects which require additional manual SBOM annotation: tips for contributing this data and tools for maintaining the data can be recommended.
Note that SBOM documents can vary across different Python package archives due to variance in dependencies, Python version, platform, architecture, etc. For this reason users SHOULD only use the SBOM documents contained within the actual downloaded and installed Python package archive and not assume that the SBOM documents are the same for all archives in a given package release.
Reference Implementation
Auditwheel fork which generates CycloneDX SBOM documents to include in wheels describing bundled shared library files. These SBOM documents worked as expected for the Syft and Grype SBOM and vulnerability scanners.
Rejected Ideas
Why not require a single SBOM standard?
Most discussion and development around SBOMs today focuses on two SBOM standards: CycloneDX and SPDX. There is no clear “winner” between these two standards, both standards are frequently used by projects and software ecosystems.
Because both standards are frequently used, tools for consuming and processing SBOM documents commonly need to support both standards. This means that this PEP is not constrained to select a single SBOM standard by its consumers and thus can allow tools creating SBOM documents for inclusion in Python packages to choose which SBOM standard works best for their use-case.
Rejected Ideas
Selecting a single SBOM standard
There is no universally accepted SBOM standard and this area is still rapidly evolving (for example, SPDX released a new major version of their standard in April 2024). To avoid locking the Python ecosystem into a specific standard ahead of when a clear winner emerges this PEP treats SBOM documents as opaque and only makes recommendations to promote compatibility with downstream consumers of SBOM document data.
None of the decisions in this PEP restrict a future PEP to select a single SBOM standard. Tools that use SBOM data today already need to support multiple formats to handle this situation, so a future standard that updates to require only one standard would have no effect on downstream SBOM tools.
Using metadata fields to specify SBOM files in archives
A previous iteration of this specification used an Sbom-File
metadata
field to specify an SBOM file within a source or binary distribution archive.
This would make the implementation similar to PEP 639 which uses the
License-File
field to enumerate license files in archives.
The primary issue with this approach is that SBOM files can originate from both static and dynamic sources: like versioned source code, the build backend, or from tools adding SBOM files after the build has completed (like auditwheel).
Metadata fields must either be static or dynamic, not both. This is in direct conflict with the best-case scenario for SBOM data: that SBOM files are added automatically by tools during the build of a Python package without user-involvement or knowledge. Compare this situation to license files which are almost always static.
The 639-style approach was ultimately dropped in favor of defining SBOMs simply
by their presence in the .dist-info/sboms
directory and using a new table in
pyproject.toml
called [dist-info.files]
to define SBOMs in source
distributions. This approach allows users to specify static SBOM files while
still empowering build backends and tools to add their own SBOM data without the
static/dynamic conflict.
Open Issues
Conditional project source SBOM files
How can a project specify an SBOM file that is conditional? Under what circumstances would an SBOM document be conditional?
References
- Visualizing the Python package SBOM data flow. This is a graphic that shows how this PEP fits into the bigger picture of Python packaging’s SBOM data story.
- Adding SBOMs to Python wheels with auditwheel. This was some early results from a fork of auditwheel to add SBOM data to a wheel and then use an SBOM generation tool Syft to detect the SBOM in the installed package.
- Querying every file in every release on PyPI.
The dataset available on py-code.org from Tom Forbes was
used to check subdirectory usage in
.dist-info
files.
Acknowledgements
Thanks to Karolina Surma for authoring and leading PEP 639 to acceptance.
This PEP’s initial design was heavily inspired by PEP 639 and the new
“dist-info.files” mechanism generalizes 639’s approach of using a subdirectory
under .dist-info
.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0770.rst
Last modified: 2025-03-04 19:22:46 GMT