PEP: 708 Title: Extending the Repository API to Mitigate Dependency
Confusion Attacks Author: Donald Stufft <donald@stufft.io> PEP-Delegate:
Paul Moore <p.f.moore@gmail.com> Discussions-To:
https://discuss.python.org/t/24179 Status: Provisional Type: Standards
Track Topic: Packaging Content-Type: text/x-rst Created: 20-Feb-2023
Post-History: 01-Feb-2023, 23-Feb-2023 Resolution:
https://discuss.python.org/t/24179/72

Provisional Acceptance

This PEP has been provisionally accepted, with the following required
conditions before the PEP is made Final:

1.  An implementation of the PEP in PyPI (Warehouse) including any
    necessary UI elements to allow project owners to set the tracking
    data.
2.  An implementation of the PEP in at least one repository other than
    PyPI, as you can’t really test merging indexes without at least two
    indexes.
3.  An implementation of the PEP in pip, which supports the intended
    semantics and can be used to demonstrate that the expected security
    benefits are achieved. This implementation will need to be "off by
    default" initially, which means that users will have to opt in to
    testing it. Ideally, we should collect explicit positive reports
    from users (both project owners and project users) who have
    successfully tried out the new feature, rather than just assuming
    that "no news is good news".

Abstract

Dependency confusion attacks, in which a malicious package is installed
instead of the one the user expected, are an increasingly common supply
chain threat. Most such attacks against Python dependencies, including
the recent PyTorch incident, occur with multiple package repositories,
where a dependency expected to come from one repository (e.g. a custom
index) is installed from another (e.g. PyPI).

To help address this problem, this PEP proposes extending the
Simple Repository API <packaging:simple-repository-api> to allow
repository operators to indicate that a project found on their
repository "tracks" a project on different repositories, and allows
projects to extend their namespaces across multiple repositories.

These features will allow installers to determine when a project being
made available from a particular mix of repositories is expected and
should be allowed, and when it is not and should halt the install with
an error to protect the user.

Motivation

There is a long-standing class of attacks that are called "dependency
confusion" attacks, which roughly boil down to an individual user
expected to get package A, but instead they got B. In Python, this
almost always happens due to the configuration of multiple repositories
(possibly including the default of PyPI), where they expected package A
to come from repository X, but someone is able to publish package B to
repository Y under the same name.

Dependency Confusion attacks have long been possible, but they've
recently gained press with public examples of cases where these attacks
were successfully executed.

A specific example of this is the recent case where the PyTorch project
had an internal package named torchtriton which was only ever intended
to be installed from their repositories located at
https://download.pytorch.org/, but that repository was designed to be
used in conjunction with PyPI, and the name of torchtriton was not
claimed on PyPI, which allowed the attacker to use that name and publish
a malicious version.

There are a number of ways to mitigate against these attacks today, but
they all require that the end user go out of their way to protect
themselves, rather than being protected by default. This means that for
the vast bulk of users, they are likely to remain vulnerable, even if
they are ultimately aware of these types of attacks.

Ultimately the underlying cause of these attacks come from the fact that
there is no globally unique namespace that all Python package names come
from. Instead, each repository is its own distinct namespace, and when
given an "abstract" name such as spam to install, an installer has to
implicitly turn that into a "concrete" name such as pypi.org:spam or
example.com:spam. Currently the standard behavior in Python installation
tools is to implicitly flatten these multiple namespaces into one that
contains the files from all namespaces.

This assumption that collapsing the namespaces is what was expected
means that when packages with the same name in different repositories
are authored by different parties (such as in the torchtriton case)
dependency confusion attacks become possible.

This is made particularly tricky in that there is no "right" answer;
there are valid use cases both for wanting two repositories merged into
one namespace and for wanting two repositories to be treated as distinct
namespaces. This means that an installer needs some mechanism by which
to determine when it should merge the namespaces of multiple
repositories and when it should not, rather than a blanket always merge
or never merge rule.

This functionality could be pushed directly to the end user, since
ultimately the end user is the person whose expectations of what gets
installed from what repository actually matters. However, by extending
the repository specification to allow a repository to indicate when it
is safe, we can enable individual projects and repositories to "work by
default", even when their project naturally spans multiple distinct
namespaces, while maintaining the ability for an installer to be secure
by default.

On its own, this PEP does not solve dependency confusion attacks, but
what it does do is provide enough information so that installers can
prevent them without causing too much collateral damage to otherwise
valid and safe use cases.

Rationale

There are two broad use cases for merging names across repositories that
this PEP seeks to enable.

The first use case is when one repository is not defining its own names,
but rather is extending names defined in other repositories. This
commonly happens in cases where a project is being mirrored from one
repository to another (see Bandersnatch) or when a repository is
providing supplementary artifacts for a specific platform (see
Piwheels).

In this case neither the repositories nor the projects that are being
extended may have any knowledge that they are being extended or by whom,
so this cannot rely on any information that isn't present in the
"extending" repository itself.

The second use case is when the project wants to publish to one "main"
repository, but then have additional repositories that provide binaries
for additional platforms, GPUs, CPUs, etc. Currently wheel tags are not
sufficiently able to express these types of binary compatibility, so
projects that wish to rely on them are forced to set up multiple
repositories and have their users manually configure them to get the
correct binaries for their platform, GPU, CPU, etc.

This use case is similar to the first, but the important difference that
makes it a distinct use case on its own is who is providing the
information and what their level of trust is.

When a user configures a specific repository (or relies on the default)
there is no ambiguity as to what repository they mean. A repository is
identified by an URL, and through the domain system, URLs are globally
unique identifiers. This lack of ambiguity means that an installer can
assume that the repository operator is trustworthy and can trust
metadata that they provide without needing to validate it.

On the flip side, given an installer finds a name in multiple
repositories it is ambiguous which of them the installer should trust.
This ambiguity means that an installer cannot assume that the project
owner on either repository is trustworthy and needs to validate that
they are indeed the same project and that one isn't a dependency
confusion attack.

Without some way for the installer to validate the metadata between
multiple repositories, projects would be forced into becoming repository
operators to safely support this use case. That wouldn't be a
particularly wrong choice to make; however, there is a danger that if we
don't provide a way for repositories to let project owners express this
relationship safely, they will be incentivized to let them use the
repository operator's metadata instead which would reintroduce the
original insecurity.

Specification

This specification defines the changes in version 1.2 of the simple
repository API, adding new two new metadata items: Repository "Tracks"
and "Alternate Locations".

Repository "Tracks" Metadata

To enable one repository to host a project that is intended to "extend"
a project that is hosted at other repositories, this PEP allows the
extending repository to declare that a particular project "tracks" a
project at another repository or repositories by adding the URLs of the
project and repositories that it is extending.

This is exposed in JSON as the key meta.tracks and in HTML as a meta
element named pypi:tracks on the project specific URLs,
($root/$project/).

There are a few key properties that MUST be preserved when using this
metadata:

-   It MUST be under the control of the repository operators themselves,
    not any individual publisher using that repository.
    -   "Repository Operator" can also include anyone who managed the
        overall namespace for a particular repository, which may be the
        case in situations like hosted repository services where one
        entity operates the software but another owns/manages the entire
        namespace of that repository.
-   All URLs MUST represent the same "project" as the project in the
    extending repository.
    -   This does not mean that they need to serve the same files. It is
        valid for them to include binaries built on different platforms,
        copies with local patches being applied, etc. This is
        purposefully left vague as it's ultimately up to the
        expectations that the users have of the repository and its
        operators what exactly constitutes the "same" project.
-   It MUST point to the repositories that "own" the namespaces, not
    another repository that is also tracking that namespace.
-   It MUST point to a project with the exact same name (after
    normalization).
-   It MUST point to the actual URLs for that project, not the base URL
    for the extended repositories.

It is NOT required that every name in a repository tracks the same
repository, or that they all track a repository at all. Mixed use
repositories where some names track a repository and some names do not
are explicitly allowed.

JSON

    {
      "meta": {
        "api-version": "1.2",
        "tracks": ["https://pypi.org/simple/holygrail/", "https://test.pypi.org/simple/holygrail/"]
      },
      "name": "holygrail",
      "files": [
        {
          "filename": "holygrail-1.0.tar.gz",
          "url": "https://example.com/files/holygrail-1.0.tar.gz",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "yanked": "Had a vulnerability"
        },
        {
          "filename": "holygrail-1.0-py3-none-any.whl",
          "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "dist-info-metadata": true
        }
      ]
    }

HTML

    <!DOCTYPE html>
    <html>
      <head>
        <meta name="pypi:repository-version" content="1.2">
        <meta name="pypi:tracks" content="https://pypi.org/simple/holygrail/">
        <meta name="pypi:tracks" content="https://test.pypi.org/simple/holygrail/">
      </head>
      <body>
        <a href="https://example.com/files/holygrail-1.0.tar.gz#sha256=...">
        <a href="https://example.com/files/holygrail-1.0-py3-none-any.whl#sha256=...">
      </body>
    </html>

"Alternate Locations" Metadata

To enable a project to extend its namespace across multiple
repositories, this PEP allows a project owner to declare a list of
"alternate locations" for their project. This is exposed in JSON as the
key alternate-locations and in HTML as a meta element named
pypi-alternate-locations, which may be used multiple times.

There are a few key properties that MUST be observed when using this
metadata:

-   In order for this metadata to be trusted, there MUST be agreement
    between all locations where that project is found as to what the
    alternate locations are.
-   When using alternate locations, clients MUST implicitly assume that
    the url the response was fetched from was included in the list. This
    means that if you fetch from https://pypi.org/simple/foo/ and it has
    an alternate-locations metadata that has the value
    ["https://example.com/simple/foo/"], then you MUST treat it as if it
    had the value
    ["https://example.com/simple/foo/", "https://pypi.org/simple/foo/"].
-   Order of the elements within the array does not have any particular
    meaning.

When an installer encounters a project that is using the alternate
locations metadata it SHOULD consider that all repositories named are
extending the same namespace across multiple repositories.

Note

This alternate locations metadata is project level metadata, not
artifact level metadata, which means it doesn't get included as part of
the core metadata spec, but rather it is something that each repository
will have to provide a configuration option for (if they choose to
support it).

JSON

    {
      "meta": {
        "api-version": "1.2"
      },
      "name": "holygrail",
      "alternate-locations": ["https://pypi.org/simple/holygrail/", "https://test.pypi.org/simple/holygrail/"],
      "files": [
        {
          "filename": "holygrail-1.0.tar.gz",
          "url": "https://example.com/files/holygrail-1.0.tar.gz",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "yanked": "Had a vulnerability"
        },
        {
          "filename": "holygrail-1.0-py3-none-any.whl",
          "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "dist-info-metadata": true
        }
      ]
    }

HTML

    <!DOCTYPE html>
    <html>
      <head>
        <meta name="pypi:repository-version" content="1.2">
        <meta name="pypi:alternate-locations" content="https://pypi.org/simple/holygrail/">
        <meta name="pypi:alternate-locations" content="https://test.pypi.org/simple/holygrail/">
      </head>
      <body>
        <a href="https://example.com/files/holygrail-1.0.tar.gz#sha256=...">
        <a href="https://example.com/files/holygrail-1.0-py3-none-any.whl#sha256=...">
      </body>
    </html>

Recommendations

This section is non-normative; it provides recommendations to installers
in how to interpret this metadata that this PEP feels provides the best
tradeoff between protecting users by default and minimizing breakages to
existing workflows. These recommendations are not binding, and
installers are free to ignore them, or apply them selectively as they
make sense in their specific situations.

File Discovery Algorithm

Note

This algorithm is written based on how pip currently discovers files;
other installers may adapt this based on their own discovery procedures.

Currently the "standard" file discovery algorithm looks something like
this:

1.  Generate a list of all files across all configured repositories.
2.  Filter out any files that do not match known hashes from a lockfile
    or requirements file.
3.  Filter out any files that do not match the current platform, Python
    version, etc.
4.  Pass that list of files into the resolver where it will attempt to
    resolve the "best" match out of those files, irrespective of which
    repository it came from.

It is recommended that installers change their file discovery algorithm
to take into account the new metadata, and instead do:

1.  Generate a list of all files across all configured repositories.
2.  Filter out any files that do not match known hashes from a lockfile
    or requirements file.
3.  If the end user has explicitly told the installer to fetch the
    project from specific repositories, filter out all other
    repositories and skip to 5.
4.  Look to see if the discovered files span multiple repositories; if
    they do then determine if either "Tracks" or "Alternate Locations"
    metadata allows safely merging ALL of the repositories where files
    were discovered together. If that metadata does NOT allow that, then
    generate an error, otherwise continue.
    -   Note: This only applies to remote repositories; repositories
        that exist on the local filesystem SHOULD always be implicitly
        allowed to be merged to any remote repository.
5.  Filter out any files that do not match the current platform, Python
    version, etc.
6.  Pass that list of files into the resolver where it will attempt to
    resolve the "best" match out of those files, irrespective of what
    repository it came from.

This is somewhat subtle, but the key things in the recommendation are:

-   Users who are using lock files or requirements files that include
    specific hashes of artifacts that are "valid" are assumed to be
    protected by nature of those hashes, since the rest of these
    recommendations would apply during hash generation. Thus, we filter
    out unknown hashes up front.
-   If the user has explicitly told the installer that it wants to fetch
    a project from a certain set of repositories, then there is no
    reason to question that and we assume that they've made sure it is
    safe to merge those namespaces.
-   If the project in question only comes from a single repository, then
    there is no chance of dependency confusion, so there's no reason to
    do anything but allow.
-   We check for the metadata in this PEP before filtering out based on
    platform, Python version, etc., because we don't want errors that
    only show up on certain platforms, Python versions, etc.
-   If nothing tells us merging the namespaces is safe, we refuse to
    implicitly assume it is, and generate an error instead.
-   Otherwise we merge the namespaces, and continue on.

This algorithm ensures that an installer never assumes that two
disparate namespaces can be flattened into one, which for all practical
purposes eliminates the possibility of any kind of dependency confusion
attack, while still giving power throughout the stack in a safe way to
allow people to explicitly declare when those disparate namespaces are
actually one logical namespace that can be safely merged.

The above algorithm is mostly a conceptual model. In reality the
algorithm may end up being slightly different in order to be more
privacy preserving and faster, or even just adapted to fit a specific
installer better.

Explicit Configuration for End Users

This PEP avoids dictating or recommending a specific mechanism by which
an installer allows an end user to configure exactly what repositories
they want a specific package to be installed from. However, it does
recommend that installers do provide some mechanism for end users to
provide that configuration, as without it users can end up in a DoS
situation in cases like torchtriton where they're just completely broken
unless they resolve the namespace collision externally (get the name
taken down on one repository, stand up a personal repository that
handles the merging, etc).

This configuration also allows end users to pre-emptively secure
themselves during what is likely to be a long transition until the
default behavior is safe.

How to Communicate This

Note

This example is pip specific and assumes specifics about how pip will
choose to implement this PEP; it's included as an example of how we can
communicate this change, and not intended to constrain pip or any other
installer in how they implement this. This may ultimately be the actual
basis for communication, and if so will need be edited for accuracy and
clarity.

This section should be read as if it were an entire "post" to
communicate this change that could be used for a blog post, email, or
discourse post.

There's a long-standing class of attacks that are called "dependency
confusion" attacks, which roughly boil down to an individual expected to
get package A, but instead they got B. In Python, this almost always
happens due to the end user having configured multiple repositories,
where they expect package A to come from repository X, but someone is
able to publish package B with the same name as package A in repository
Y.

There are a number of ways to mitigate against these attacks today, but
they all require that the end user explicitly go out of their way to
protect themselves, rather than it being inherently safe.

In an effort to secure pip's users and protect them from these types of
attacks, we will be changing how pip discovers packages to install.

What is Changing?

When pip discovers that the same project is available from multiple
remote repositories, by default it will generate an error and refuse to
proceed rather than make a guess about which repository was the correct
one to install from.

Projects that natively publish to multiple repositories will be given
the ability to safely "link" their repositories together so that pip
does not error when those repositories are used together.

End users of pip will be given the ability to explicitly define one or
more repositories that are valid for a specific project, causing pip to
only consider those repositories for that project, and avoiding
generating an error altogether.

See TBD for more information.

Who is Affected?

Users who are installing from multiple remote (e.g. not present on the
local filesystem) repositories may be affected by having pip error
instead of successfully install if:

-   They install a project where the same "name" is being served by
    multiple remote repositories.
-   The project name that is available from multiple remote repositories
    has not used one of the defined mechanisms to link those
    repositories together.
-   The user invoking pip has not used the defined mechanism to
    explicitly control what repositories are valid for a particular
    project.

Users who are not using multiple remote repositories will not be
affected at all, which includes users who are only using a single remote
repository, plus a local filesystem "wheel house".

What do I need to do?

As a pip User?

If you're using only a single remote repository you do not have to do
anything.

If you're using multiple remote repositories, you can opt into the new
behavior by adding --use-feature=TBD to your pip invocation to see if
any of your dependencies are being served from multiple remote
repositories. If they are, you should audit them to determine why they
are, and what the best remediation step will be for you.

Once this behavior becomes the default, you can opt out of it
temporarily by adding --use-deprecated=TBD to your pip invocation.

If you're using projects that are not hosted on a public repository, but
you still have the public repository as a fallback, consider configuring
pip with a repository file to be explicit where that dependency is meant
to come from to prevent registration of that name in a public repository
to cause pip to error for you.

As a Project Owner?

If you only publish your project to a single repository, then you do not
have to do anything.

If you publish your project to multiple repositories that are intended
to be used together at the same time, configure all repositories to
serve the alternate repository metadata to prevent breakages for your
end users.

If you publish your project to a single repository, but it is commonly
used in conjunction with other repositories, consider preemptively
registering your names with those repositories to prevent a third party
from being able to cause your users pip install invocations to start
failing. This may not be available if your project name is too generic
or if the repositories have policies that prevent defensive name
squatting.

As a Repository Operator?

You'll need to decide how you intend for your repository to be used by
your end users and how you want them to use it.

For private repositories that host private projects, it is recommended
that you mirror the public projects that your users depend on into your
own repository, taking care not to let a public project merge with a
private project, and tell your users to use the --index-url option to
use only your repository.

For public repositories that host public projects, you should implement
the alternate repository mechanism and enable the owners of those
projects to configure the list of repositories that their project is
available from if they make it available from more than one repository.

For public repositories that "track" another repository, but provide
supplemental artifacts such as wheels built for a specific platform, you
should implement the "tracks" metadata for your repository. However,
this information MUST NOT be settable by end users who are publishing
projects to your repository. See TBD for more information.

Rejected Ideas

Note: Some of these are somewhat specific to pip, but any solution that
doesn't work for pip isn't a particularly useful solution.

Implicitly allow mirrors when the list of files are the same

If every repository returns the exact same list of files, then it is
safe to consider those repositories to be the same namespace and
implicitly merge them. This would possibly mean that mirrors would be
automatically allowed without any work on any user or repository
operator's part.

Unfortunately, this has two failings that make it undesirable:

-   It only solves the case of mirrors that are exact copies of each
    other, but not repositories that "track" another one, which ends up
    being a more generic solution.
-   Even in the case of exact mirrors, multiple repositories mirroring
    each other is a distributed system will not always be fully
    consistent with each other, effectively an eventually consistent
    system. This means that repositories that relied on this implicit
    heuristic to work would have sporadic failures due to drift between
    the source repository and the mirror repositories.

Provide a mechanism to order the repositories

Providing some mechanism to give the repositories an order, and then
short circuiting the discovery algorithm when it finds the first
repository that provides files for that project is another workable
solution that is safe if the order is specified correctly.

However, this has been rejected for a number of reasons:

-   We've spent 15+ years educating users that the ordering of
    repositories being specified is not meaningful, and they effectively
    have an undefined order. It would be difficult to backpedal on that
    and start saying that now order matters.
-   Users can easily rearrange the order that they specify their
    repositories in within a single location, but when loading
    repositories from multiple locations (env var, conf file,
    requirements file, cli arguments) the order is hard coded into pip.
    While it would be a deterministic and documented order, there's no
    reason to assume it's the order that the user wants their
    repositories to be defined in, forcing them to contort how they
    configure pip so that the implicit ordering ends up being the
    correct one.
-   The above can be mitigated by providing a way to explicitly declare
    the order rather than by implicitly using the order they were
    defined in; however, that then means that the protections are not
    provided unless the user does some explicit configuration.
-   Ordering assumes that one repository is always preferred over
    another repository without any way to decide on a project by project
    basis.
-   Relying on ordering is subtle; if I look at an ordering of
    repositories, I have no way of knowing or ensuring in advance what
    names are going to come from what repositories. I can only know in
    that moment what names are provided by which repositories.
-   Relying on ordering is fragile. There's no reason to assume that two
    disparate repositories are not going to have random naming
    collisions—what happens if I'm using a library from a lower priority
    repository and then a higher priority repository happens to start
    having a colliding name?
-   In cases where ordering does the wrong thing, it does so silently,
    with no feedback given to the user. This is by design because it
    doesn't actually know what the wrong or right thing is, it's just
    hoping that order will give the right thing, and if it does then
    users are protected without any breakage. However, when it does the
    wrong thing, users are left with a very confusing behavior coming
    from pip, where it's just silently installing the wrong thing.

There is a variant of this idea which effectively says that it's really
just PyPI's nature of open registration that causes the real problems,
so if we treat all repositories but the "default" one as equal priority,
and then treat the default one as a lower priority then we'll fix
things.

That is true in that it does improve things, but it has many of the same
problems as the general ordering idea (though not all of them).

It also assumes that PyPI, or whatever repository is configured as the
"default", is the only repository with open registration of names.
However, projects like Piwheels exist which users are expected to use in
addition to PyPI, which also effectively have open registration of names
since it tracks whatever names are registered on PyPI.

Rely on repository proxies

One possible solution is to instead of having the installer have to
solve this, to instead depend on repository proxies that can
intelligently merge multiple repositories safely. This could provide a
better experience for people with complex needs because they can have
configuration and features that are dedicated to the problem space.

However, that has been rejected because:

-   It requires users to opt into using them, unless we also remove the
    facilities to have more than one repository in installers to force
    users into using a repository proxy when they need multiple
    repositories.
    -   Removing facilities to have more than one repository configured
        has been rejected because it would be too disruptive to end
        users.
-   A user may need different outcomes of merging multiple repositories
    in different contexts, or may need to merge different, mutually
    exclusive repositories. This means they'll need to actually set up
    multiple repository proxies for each unique set of options.
-   It requires users to maintain infrastructure or it requires adding
    features in installers to automatically spin up a repository for
    each invocation.
-   It doesn't actually change the requirement to need to have a
    solution to these problems, it just shifts the responsibility of
    implementation from installers to some repository proxy, but in
    either case we still need something that figures out how to merge
    these disparate namespaces.
-   Ultimately, most users do not want to have to stand up a repository
    proxy just to safely interact with multiple repositories.

Rely only on hash checking

Another possible solution is to rely on hash checking, since with hash
checking enabled users cannot get an artifact that they didn't expect;
it doesn't matter if the namespaces are incorrectly merged or not.

This is certainly a solution; unfortunately it also suffers from
problems that make it unworkable:

-   It requires users to opt in to it, so users are still unprotected by
    default.
-   It requires users to do a bunch of labor to manage their hashes,
    which is something that most users are unlikely to be willing to do.
-   It is difficult and verbose to get the protection when users are not
    using a requirements.txt file as the source of their dependencies
    (this affects build time dependencies, and dependencies provided at
    the command line).
-   It only sort of solves the problem, in a way it just shifts the
    responsibility of the problem to be whatever system is generating
    the hashes that the installer would use. If that system isn't a
    human manually validating hashes, which it's unlikely it would be,
    then we've just shifted the question of how to merge these
    namespaces to whatever tool implements the maintenance of the
    hashes.

Require all projects to exist in the "default" repository

Another idea is that we can narrow the scope of --extra-index-url such
that its only supported use is to refer to supplemental repositories to
the default repository, effectively saying that the default repository
defines the namespace, and every additional repository just extends it
with extra packages.

The implementation of this would roughly be to require that the project
MUST be registered with the default repository in order for any
additional repositories to work.

This sort of works if you successfully narrow the scope in that way, but
ultimately it has been rejected because:

-   Users are unlikely to understand or accept this reduced scope, and
    thus are likely to attempt to continue to use it in the now
    unsupported fashion.
    -   This is complicated by the fact that with the scope now
        narrowed, users who have the excluded workflow no longer have
        any alternative besides setting up a repository proxy, which
        takes infrastructure and effort that they previously didn't have
        to do.
-   It assumes that just because a name in an "extra" repository is the
    same as in the default repository, that they are the same project.
    If we were starting from scratch in a brand new ecosystem then maybe
    we could make this assumption from the start and make it stick, but
    it's going to be incredibly difficult to get the ecosystem to adjust
    to that change.
    -   This is a fundamental issue with this approach; the underlying
        problem that drives dependency confusion is that we're taking
        disparate namespaces and flattening them into one. This approach
        essentially just declares that OK, and attempts to mitigate it
        by requiring everyone to register their names.
-   Because of the above assumption, in cases where a name in an extra
    repository collides by accident with the default repository, it's
    going to appear to work for those users, but they are going to be
    silently in a state of dependency confusion.
    -   This is made worse by the fact that the person who owns the name
        that is allowing this to work is going to be completely unaware
        of the role that they're playing for that user, and might
        possibly delete their project or hand it off to someone else,
        potentially allowing them to inadvertently allow a malicious
        user to take it over.
-   Users are likely to attempt to get back to a working state by
    registering their names in their default repository as a defensive
    name squat. Their ability to do this will depend on the specific
    policies of their default repository, whether someone already has
    that name, whether it's too generic, etc. As a best case scenario it
    will cause needless placeholder projects that serve no purpose other
    than to secure some internal use of a name.

Move to Globally Unique Names

The main reason this problem exists is that we don't have globally
unique names, we have locally unique names that exist under multiple
namespaces that we are attempting to merge into a single flat namespace.
If we could instead come up with a way to have globally unique names, we
could sidestep the entire issue.

This idea has been rejected because:

-   Generating globally unique but secure names that are also meaningful
    to humans is a nearly impossible feat without piggybacking off of
    some kind of centralized database. To my knowledge the only systems
    that have managed to do this end up piggybacking off of the domain
    system and refer to packages by URLs with domains etc.
-   Even if we come up with a mechanism to get globally unique names,
    our ability to retrofit that into our decades old system is
    practically zero without burning it all to the ground and starting
    over. The best we could probably do is declare that all non globally
    unique names are implicitly names on the PyPI domain name, and force
    everyone with a non PyPI package to rename their package.
-   This would upend so many core assumptions and fundamental parts of
    our current system it's hard to even know where to start to list
    them.

Only recommend that installers offer explicit configuration

One idea that has come up is to essentially just implement the explicit
configuration and don't make any other changes to anything else. The
specific proposal for a mapping policy is what actually inspired the
explicit configuration option, and created a file that looked something
like:

    {
      "repositories": {
        "PyTorch": ["https://download.pytorch.org/whl/nightly"],
        "PyPI": ["https://pypi.org/simple"]
      },
      "mapping": [
        {
          "paths": ["torch*"],
          "repositories": ["PyTorch"],
          "terminating": true
        },
        {
          "paths": ["*"],
          "repositories": ["PyPI"]
        }
      ]
    }

The recommendation to have explicit configuration pushes the decision on
how to implement that onto each installer, allowing them to choose what
works best for their users.

Ultimately only implementing some kind of explicit configuration was
rejected because by its nature it's opt in, so it doesn't protect
average users who are least capable to solve the problem with the
existing tools; by adding additional protections alongside the explicit
configuration, we are able to protect all users by default.

Additionally, relying on only explicit configuration also means that
every end user has to resolve the same problem over and over again, even
in cases like mirrors of PyPI, Piwheels, PyTorch, etc. In each and every
case they have to sit there and make decisions (or find some example to
cargo cult) in order to be secure. Adding extra features into the mix
allows us to centralize those protections where we can, while still
giving advanced end users the ability to completely control their own
destiny.

Scopes à la npm

There's been some suggestion that scopes similar to how npm has
implemented them may ultimately solve this. Ultimately scopes do not
change anything about this problem. As far as I know scopes in npm are
not globally unique, they're tied to a specific registry just like
unscoped names are. However what scopes do enable is an obvious
mechanism for grouping related projects and the ability for a user or
organization on npm.org to claim an entire scope, which makes explicit
configuration significantly easier to handle because you can be assured
that there's a whole little slice of the namespace that wholly belongs
to you, and you can easily write a rule that assigns an entire scope to
a specific non public registry.

Unfortunately, it basically ends up being an easier version of the idea
to only use explicit configuration, which works ok in npm because its
not particularly common for people to use their own registries, but in
Python we encourage you to do just that.

Define and Standardize the "Explicit Configuration"

This PEP recommends installers to have a mechanism for explicit
configuration of which repository a particular project comes from, but
it does not define what that mechanism is. We are purposefully leave
that undefined, as it is closely tied to the UX of each individual
installer and we want to allow each individual installer the ability to
expose that configuration in whatever way that they see fit for their
particular use cases.

Further, when the idea of defining that mechanism came up, none of the
other installers seemed particularly interested in having that mechanism
defined for them, suggesting that they were happy to treat that as part
of their UX.

Finally, that mechanism, if we did choose to define it, deserves it's
own PEP rather than baking it as part of the changes to the repository
API in this PEP and it can be a future PEP if we ultimately decide we do
want to go down the path of standardization for it.

Acknowledgements

Thanks to Trishank Kuppusamy for kick starting the discussion that lead
to this PEP with his proposal.

Thanks to Paul Moore, Pradyun Gedam, Steve Dower, and Trishank Kuppusamy
for providing early feedback and discussion on the ideas in this PEP.

Thanks to Jelle Zijlstra, C.A.M. Gerlach, Hugo van Kemenade, and Stefano
Rivera for copy editing and improving the structure and quality of this
PEP.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.