PEP 766 – Explicit Priority Choices Among Multiple Indexes
- Author:
- Michael Sarahan <msarahan at gmail.com>
- Sponsor:
- Barry Warsaw <barry at python.org>
- PEP-Delegate:
- Paul Moore <p.f.moore at gmail.com>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Informational
- Topic:
- Packaging
- Created:
- 18-Nov-2024
- Post-History:
- 18-Nov-2024
Table of Contents
Abstract
Package resolution is a key part of the Python user experience as the means of extending Python’s core functionality. The experience of package resolution is mostly taken for granted until someone encounters a situation where the package installer does something they don’t expect. The installer behavior with multiple indexes has been a common source of unexpected behavior. Through its ubiquity, pip has long defined the standard expected behavior across other tools in the ecosystem, but Python installers are diverging with respect to how they handle multiple indexes. At the core of this divergence is whether index contents are combined before resolving distributions, or each index is handled individually in order. pip merges all indexes before matching distributions, while uv matches distributions on one index before moving on to the next. Each approach has advantages and disadvantages. This PEP aims to describe each of these behaviors, which are referred to as “version priority” and “index priority” respectively, so that community discussions and troubleshooting can share a common vocabulary, and so that tools can implement predictable behavior based on these descriptions.
Motivation
Python package users frequently find themselves in need of specifying an index or package source other than PyPI. There are many reasons for external indexes to exist:
- File size/quota limitations on PyPI
- Implementation variants, such as different GPU library builds in PyTorch
- Local builds of packages shared internally at an organization
- Situations where a local package has remote dependencies, and the user wishes to prioritize local packages over remote dependencies, while still falling back to remote dependencies where needed
In most of these cases, it is not desirable to completely forego PyPI. Instead, users generally want PyPI to still be a source of packages, but a lower priority source. Unfortunately, pip’s current design precludes this concept of priority. Some Python installer tools have developed alternative ways to handle multiple indexes that incorporate mechanisms to express index priority, such as uv and PDM.
The innovation and the potential for customization is exciting, but it comes at the risk of further fragmenting the python packaging ecosystem, which is already perceived as one of Python’s weak points. The motivation of this PEP is to encourage installers to provide more insight into how they handle multiple indexes, and to provide a vocabulary that can be common to the broader community.
Specification
“Version priority”
This behavior is characterized by the installer always getting the “best” version of a package, regardless of the index that it comes from. “Best” is defined by the installer’s algorithm for optimizing the various traits of a package, also factoring in user input (such as preferring only binaries, or no binaries). While installers may differ in their optimization criteria and user options, the general trait that all version priority installers share is that the index contents are collated prior to candidate selection.
Version priority is most useful when all configured indexes are equally trusted and well-behaved regarding the distribution interchangeability assumption. Mirrors are especially well-behaved in this regard. That interchangeability assumption is what makes comparing distributions of a given package meaningful. Without it, the installer is no longer comparing “apples to apples.” In practice, it is common for different indexes to have files that have different contents than other indexes, such as builds for special hardware, or differing metadata for the same package. Version priority behavior can lead to undesirable, unexpected outcomes in these cases, and this is where users generally look for some kind of index priority. Additionally, when there is a difference in trust among indexes, version priority does not provide a way to prefer more trusted indexes over less trusted indexes. This has been exploited by dependency confusion attacks, and PEP 708 was proposed as a way of hard-coding a notion of trusted external indexes into the index.
The “version priority” name is new, and introduction of new terms should always
be minimized. This PEP looks toward the uv project, which refers to its implementation of the version priority
behavior
as “unsafe-best-match
.” Naming is really hard here. On one hand, it
isn’t accurate to call pip’s default behavior intrinsically “unsafe.”
The addition of possibly malicious indexes is what
introduces concern with this behavior. PEP 708 added a way to restrict
installers from drawing packages from unexpected, potentially insecure
indexes. On the other hand, the term “best-match” is technically
correct, but also misleading. The “best match” varies by user and by
application. “Best” is technically correct in the sense that it is a
global optimum according to the match criteria specified above, but that
is not necessarily what is “best” in a user’s eyes. “Version priority”
is a proposed term that avoids the concerns with the uv terminology,
while approximating the behavior in the most user-identifiable way that
packages are compared.
“Index priority”
In index priority, the resolver finds candidates for each index, one at a time. The resolver proceeds to subsequent indexes only if the current package request has no viable candidates. Index priority does not combine indexes into one global, flat namespace. Because indexes are searched in order, the package from an earlier index will be preferred over a package from a later index, regardless of whether the later index had a better match with the installer’s optimization criteria. For a given installer, the optimization criteria and selection algorithm should be the same for both index priority and version priority. It is only the treatment of multiple indexes that differs: all together for version priority, and individually for index priority.
The order of specification of indexes determines their priority in the finding process. As a result, the way that installers load the index configuration must be predictable and reproducible. This PEP does not prescribe any particular mechanism, other than to say that installers should provide a way of ordering their collection of sources. Installers should also ideally provide optional debugging output that provides insight into which index is being considered.
Each package’s finder should start at the beginning of the list of indexes, so each package starts over with the index list. In other words, if one package has no valid candidates on the first index, but finds a hit on the second index, subsequent packages should still start their search on the first index, rather than starting on the second.
One desirable behavior that the index priority strategy implies is that there are no “surprise” updates, where a version bump on a lower-priority index wins out over a curated, approved higher-priority index. This is related to the security improvement of PEP 708, where packages can restrict the external indexes that distributions can come from, but index priority is more configurable by end users. The package installs are only expected to change when either the higher-priority index or the index priority configuration change. This stability and predictability makes it more viable to configure indexes as a more persistent property of an environment, rather than a one-off argument for one install command.
Cache keys
Because index priority is acknowledging the possibility that different indexes may have different content for a given package, caching and lockfiles should now include the index from which distributions were downloaded. Without this aspect, it is possible that after changing the list of configured indexes, the cache or lockfile could provide a similarly-named distribution from a lower-priority index. If every index follows the recommended behavior of providing identical files across indexes for a given filename, this is not an issue. However, that recommendation is not readily enforceable, and augmenting the cache key with origin index would be a wise defensive change.
Ways that a request falls through to a lower priority index
- Package name is not present at all in higher priority index
- All distributions from higher priority index filtered out due to version specifier, compatible Python version, platform tag, yanking or otherwise
- A denylist configuration for the installer specifies that a particular package name should be ignored on a given index
- A higher priority index is unreachable (e.g. blocked by firewall rules, temporarily unavailable due to maintenance, other miscellaneous and temporary networking issues). This is a less clear-cut detail that should be controllable by users. On one hand, this behavior would lead to less predictable, likely unreproducible results by unexpectedly falling through to lower priority indexes. On the other hand, graceful fallback may be more valuable to some users, especially if they can safely assume that all of their indexes are equally trusted. pip’s behavior today is graceful fallback: you see warnings if an index is having connection issues, but the installation will proceed with any other available indexes. Because index priority can convey different trust levels between indexes, installers that implement index priority should default to raising errors and aborting on network issues. Installers may choose to provide a flag to allow fall-through to lower-priority indexes in case of network error.
Treatment within a given index follows existing behavior, but stops at the bounds of one index and moves on to the next index only after all priority preferences within the one index are exhausted. This means that existing priorities among the unified collection of packages apply to each index individually before falling through to a lower priority index.
There are tradeoffs to make at every level of the optimization criteria:
- version: index priority will use an older version from a higher-priority index even if a newer version is available on another index.
- wheel vs sdist: Should the installer use an sdist from a higher-priority index before trying a wheel from a lower-priority index?
- more platform-specific wheels before less specific ones: Should the installer use less specific wheels from higher-priority indexes before using more specific wheels from lower priority indexes?
- flags such as pip’s
--prefer-binary
: Should the installer use an sdist from a higher priority index before considering wheels on a lower priority index?
Installers are free to implement these priorities in different ways for
themselves, but they should document their optimization criteria and how they
handle fall-through to lower-priority indexes. For example, an installer could
say that --prefer-binary
should not install an sdist unless it had iterated
through all configured indexes and found no installable binary candidates.
Mirroring
As described thus far, the index priority scheme breaks the use case of more than one index url serving the same content. Such mirrors may be used with the intent of ameliorating network issues or otherwise improving reliability. One approach that installers could take to preserve mirroring functionality while adding index priority would be to add a notion of user-definable index groups, where each index in the group is assumed to be equivalent. This is related to Poetry’s notion of package sources, except that this would allow arbitrary numbers of prioritizable groups, and that this would assume members of a group to be mirrors. Within each group, content could be combined, or each member could be fetched concurrently. The fastest responding index would then represent the group.
Backwards Compatibility
This PEP does not prescribe any changes as mandatory for any installer, so it only introduces compatibility concerns if tools choose to adopt an index behavior other than the behavior(s) they currently implement.
This PEP’s language does not quite align with existing tools, including pip and uv. Either this PEP’s language can change during review of this PEP, or if this PEP’s language is preferred, other projects could conform to it. The only goal of proposing these terms is to create a central, common vocabulary that makes it easier for users to learn about other installers.
As some tools rely on one or the other behavior, there are some possible issues that may emerge, where tailoring available resources/packages for a particular behavior may detract from the user experience for people who rely on the other behavior.
- Different indexes may have different metadata. For example, one cannot assume that the metadata for package “something” on index “A” has the same dependencies as “something” on index “B”. This breaks fundamental assumptions of version priority, but index priority can handle this. When an installer falls through to a lower-priority index in the search order, it implies refreshing the package metadata from the new index. This is both an improvement and a complication. It is a complication in the sense that a cached metadata entry must be keyed by both package name and index url, instead of just package name. It is a potential improvement in that different implementation variants of a package can differ in dependencies as long as their distributions are separated into different indexes.
- Users may not get updates as they expect when using index priority, because some higher priority index has not updated/synchronized with PyPI to get the latest packages. If the higher priority index has a valid candidate, newer packages will not be found. This will need to be communicated verbosely, because it is counter to pip’s well-established behavior.
- By adding index priority, an installer will improve the predictability of which index will be selected, and index hosts may abuse this as a way of having similarly named files that have different contents. With version priority, this violates the key package interchangeability assumption, and insanity will ensue. Index priority would be more workable, but the situation still has great potential for confusion. It would be helpful to develop tools that support installers in identifying these confusing issues. These tools could operate independently of the installer process, as a means of validating the sanity of a set of indexes. Depending on the time cost of these tools, the installers could run them as part of their process. Users could, of course, ignore the recommendations at their own risk.
Security Implications
Index priority creates a mechanism for users to explicitly specify a trust hierarchy among their indexes. As such, it limits the potential for dependency confusion attacks. Index priority was rejected by PEP 708 as a solution for dependency confusion attacks. This PEP requests that the rejection be reconsidered, with index priority serving a different purpose. This PEP is primarily motivated by the desire to support implementation variants, which is the subject of another discussion that hopefully leads to a PEP. It is not mutually exclusive with PEP 708, nor does it suggest reverting or withdrawing PEP 708. It is an answer to how we could allow users to choose which index to use at a more fine grained level than “per install”.
For a more thorough discussion of the PEP 708 rejection of index priority, please see the discuss.python.org thread for this PEP.
How to Teach This
At the outset, the goal is not to convert pip or any other tool to change its default priority behavior. The best way to teach is perhaps to watch message boards, GitHub issue trackers and chat channels, keeping an eye out for problems that index priority could help solve. There are several long-standing discussions that would be good places to start advertising the concepts. The topics of the two officially supported behaviors need documentation, and we, the authors of this PEP, would develop these as part of the review period of this PEP. These docs would likely consist of additions across several indexes, cross-linking the concepts between installers. At a minimum, we expect to add to the PyPUG and to pip’s documentation.
It will be important for installers to advertise the active behavior, especially in error messaging, and that will provide ways to provide resources to users about these behaviors.
uv users are already experiencing index priority. uv documents this behavior well, but it is always possible to improve the discoverability of that documentation from the command line, where users will actually encounter the unexpected behavior.
Reference Implementation
The uv project demonstrates index priority with its default behavior. uv is implemented in Rust, though, so if a reference implementation to a Python-based tool is necessary, we, the authors of this PEP, will provide one. For pip in particular, we see the implementation plan as something like:
- For users who don’t use
--extra-index-url
or--find-links
, there will be no change, and no migration is necessary. - pip users would be able opt in to the index priority behavior with a
new config setting in the CLI and in
pip.conf
. This proposal does not recommend any strategy as the default for any installer. It only recommends documenting the strategies that a tool provides. - Enable extra info-level output for any pip operation where more than one index is used. In this output, state the current strategy setting, and a terse summary of implied behavior, as well as a link to docs that describe the different options
- Add debugging output that verbosely identifies the index being used at each step, including where the file is in the configuration hierarchy, and where it is being included (via config file, env var, or CLI flag).
- Plumb tracking of which index gets used for which
package/distribution through the entire pip install process. Store
this information so that it is available to tools like
pip freeze
- Supplement PEP 751 (lockfiles) with capture of index where a package/distribution came from
Rejected Ideas
- Tell users to set up a proxy/mirror, such as devpi
or Artifactory that
serves local files if present, and forwards to another server (PyPI)
if no local files match
This matches the behavior of this proposal very closely, except that this method requires hosting some server, and may be inaccessible or not configurable to users in some environments. It is also important to consider that for an organization that operates its own index (for overcoming PyPI size restrictions, for example), this does not solve the need for
--extra-index-url
or proxy/mirror for end users. That is, organizations get no improvement from this approach unless they proxy/mirror PyPI as a whole, and get users to configure their proxy/mirror as their sole index. - Are build tags and/or local version specifiers enough?
Build tags and local version specifiers will take precedence over packages without those tags and/or local version specifiers. In a pool of packages, builds that have these additions hosted on a server other than PyPI will take priority over packages on PyPI, which rarely use build tags, and forbid local version specifiers. This approach is viable when package providers want to provide their own local override, such as HPC maintainers who provide optimized builds for their users. It is less viable in some ways, such as build tags not showing up in
pip freeze
metadata, and local version specifiers not being allowed on PyPI. There is also significant work entailed in building and maintaining package collections with local build tag variants.https://discuss.python.org/t/dependency-notation-including-the-index-url/5659/21
- What about PEP 708? Isn’t that
enough?
PEP 708 is aimed specifically at addressing dependency confusion attacks, and doesn’t address the potential for implementation variants among indexes. It is a way of filtering external URLs and encoding an allow-list for external indexes in index metadata. It does not change the lack of priority or preference among channels that currently exists.
- Namespacing
Namespacing is a means of specifying a package such that the Python usage of the package does not change, but the package installation restricts where the package comes from. PEP 752 recently proposed a way to multiplex a package’s owners in a flat package namespace (e.g. PyPI) by reserving prefixes as grouping elements. NPM’s concept of “scopes” has been raised as another good example of how this might look. This PEP differs in that it is targeted to multiple index, not a flat package namespace. The net effect is roughly the same in terms of predictably choosing a particular package source, except that the namespacing approach relies more on naming packages with these namespace prefixes, whereas this PEP would be less granular, pulling in packages on whatever higher-priority index the user specifies. The namespacing approach relies on all configured indexes treating a given namespace similarly, which leaves the usual concern that not all configured indexes are trusted equally. The namespace idea is not incompatible with this PEP, but it also does not improve expression of trust of indexes in the way that this PEP does.
Open Issues
[Any points that are still being decided/discussed.]
Acknowledgements
This work was supported financially by NVIDIA through employment of the author. NVIDIA teammates dramatically improved this PEP with their input. Astral Software pioneered the behaviors of index priority and thus laid the foundation of this document. The pip authors deserve great praise for their consistent direction and patient communication of the version priority behavior, especially in the face of contentious security concerns.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0766.rst
Last modified: 2024-11-21 20:00:24 GMT