PEP 691 – JSON-based Simple API for Python Package Indexes
- Author:
- Donald Stufft <donald at stufft.io>, Pradyun Gedam <pradyunsg at gmail.com>, Cooper Lees <me at cooperlees.com>, Dustin Ingram <di at python.org>
- PEP-Delegate:
- Brett Cannon <brett at python.org>
- Discussions-To:
- Discourse thread
- Status:
- Accepted
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 04-May-2022
- Post-History:
- 05-May-2022
- Resolution:
- Discourse message
Table of Contents
- Abstract
- Goals
- Specification
- Recommendations
- FAQ
- Does this mean PyPI is planning to drop support for HTML/PEP 503?
- Why JSON instead of X format?
- Why not add X feature?
- Why include the filename when the URL has it already?
- Why not break out other pieces of information from the filename?
- Why Content Negotiation instead of multiple URLs?
- Does this mean that static servers are no longer supported?
- Why not add an
application/json
alias liketext/html
? - Why add a
application/vnd.pypi.simple.v1+html
? - Why v1.0 and not v1.1 or v2.0?
- Appendix 1: Survey of use cases to cover
- Appendix 2: Rough Underlying Data Models
- Copyright
Abstract
The “Simple Repository API” that was defined in PEP 503 (and was in use much longer than that) has served us reasonably well for a very long time. However, the reliance on using HTML as the data exchange mechanism has several shortcomings.
There are two major issues with an HTML-based API:
- While HTML5 is a standard, it’s an incredibly complex standard and ensuring
completely correct parsing of it involves complex logic that does not
currently exist within the Python standard library (nor the standard library
of many other languages).
This means that to actually accept everything that is technically valid, tools have to pull in large dependencies or they have to rely on the standard library’s
html.parser
library, which is lighter weight but potentially doesn’t fully support HTML5. - HTML5 is primarily designed as a markup language to present documents for human
consumption. Our use of it is driven largely for historical and accidental
reasons, and it’s unlikely anyone would design an API that relied on it if
they were starting from scratch.
The primary issue with using a markup format designed for human consumption is that there’s not a great way to actually encode data within HTML. We’ve gotten around this by limiting the data we put in this API and being creative with how we can cram data into the API (for instance, hashes are embedded as URL fragments, adding the
data-yanked
attribute in PEP 592).
PEP 503 was largely an attempt to standardize what was already in use, so it did not propose any large changes to the API.
In the intervening years, we’ve regularly talked about an “API V2” that would re-envision the entire API of PyPI. However, due to limited time constraints, that effort has not gained much, if any, traction beyond people thinking that it would be nice to do.
This PEP attempts to take a different route. It doesn’t fundamentally change the overall API structure, but instead specifies a new serialization of the existing data contained in existing PEP 503 responses in a format that is easier for software to parse rather than using a human centric document format.
Goals
- Enable zero configuration discovery. Clients of the simple API MUST be able to gracefully determine whether a target repository supports this PEP without relying on any form of out of band communication (configuration, prior knowledge, etc). Individual clients MAY choose to require configuration to enable the use of this API, however.
- Enable clients to drop support for “legacy” HTML parsing. While it is expected that most clients will keep supporting HTML-only repositories for a while, if not forever, it should be possible for a client to choose to support only the new API formats and no longer invoke an HTML parser.
- Enable repositories to drop support for “legacy” HTML formats. Similar to clients, it is expected that most repositories will continue to support HTML responses for a long time, or forever. It should be possible for a repository to choose to only support the new formats.
- Maintain full support for existing HTML-only clients. We MUST not break existing clients that are accessing the API as a strictly PEP 503 API. The only exception to this, is if the repository itself has chosen to no longer support the HTML format.
- Minimal additional HTTP requests. Using this API MUST not drastically increase the amount of HTTP requests an installer must do in order to function. Ideally it will require 0 additional requests, but if needed it may require one or two additional requests (total, not per dependency).
- Minimal additional unique responses. Due to the nature of how large repositories like PyPI cache responses, this PEP should not introduce a significantly or combinatorially large number of additional unique responses that the repository may produce.
- Supports TUF. This PEP MUST be able to function within the bounds of what TUF can support (PEP 458), and must be able to be secured using it.
- Require only the standard library, or small external dependencies for clients. Parsing an API response should ideally require nothing but the standard library, however it would be acceptable to require a small, pure Python dependency.
Specification
To enable response parsing with only the standard library, this PEP specifies that all responses (besides the files themselves, and the HTML responses from PEP 503) should be serialized using JSON.
To enable zero configuration discovery and to minimize the amount of additional HTTP requests, this PEP extends PEP 503 such that all of the API endpoints (other than the files themselves) will utilize HTTP content negotiation to allow client and server to select the correct serialization format to serve, i.e. either HTML or JSON.
Versioning
Versioning will adhere to PEP 629 format (Major.Minor
), which has defined the
existing HTML responses to be 1.0
. Since this PEP does not introduce new features
into the API, rather it describes a different serialization format for the existing
features, this PEP does not change the existing 1.0
version, and instead just
describes how to serialize that into JSON.
Similar to PEP 629, the major version number MUST be incremented if any changes to the new format would result in no longer being able to expect existing clients to meaningfully understand the format.
Likewise, the minor version MUST be incremented if features are added or removed from the format, but existing clients would be expected to continue to meaningfully understand the format.
Changes that would not result in existing clients being unable to meaningfully understand the format and which do not represent features being added or removed may occur without changing the version number.
This is intentionally vague, as this PEP believes it is best left up to future PEPs that make any changes to the API to investigate and decide whether or not that change should increment the major or minor version.
Future versions of the API may add things that can only be represented in a subset of the available serializations of that version. All serializations version numbers, within a major version, SHOULD be kept in sync, but the specifics of how a feature serializes into each format may differ, including whether or not that feature is present at all.
It is the intent of this PEP that the API should be thought of as URL endpoints that return data, whose interpretation is defined by the version of that data, and then serialized into the target serialization format.
JSON Serialization
The URL structure from PEP 503 still applies, as this PEP only adds an additional serialization format for the already existing API.
The following constraints apply to all JSON serialized responses described in this PEP:
- All JSON responses will always be a JSON object rather than an array or other type.
- While JSON doesn’t natively support an URL type, any value that represents an URL in this API may be either absolute or relative as long as they point to the correct location. If relative, they are relative to the current URL as if it were HTML.
- Additional keys may be added to any dictionary objects in the API responses and clients MUST ignore keys that they don’t understand.
- All JSON responses will have a
meta
key, which contains information related to the response itself, rather than the content of the response. - All JSON responses will have a
meta.api-version
key, which will be a string that contains the PEP 629Major.Minor
version number, with the same fail/warn semantics as defined in PEP 629. - All requirements of PEP 503 that are not HTML specific still apply.
Project List
The root URL /
for this PEP (which represents the base URL) will be a JSON encoded
dictionary which has a two keys:
projects
: An array where each entry is a dictionary with a single key,name
, which represents string of the project name.meta
: The general response metadata as described earlier.
As an example:
{
"meta": {
"api-version": "1.0"
},
"projects": [
{"name": "Frob"},
{"name": "spamspamspam"}
]
}
Note
The name
field is the same as the one from PEP 503, which does not specify
whether it is the non-normalized display name or the normalized name. In practice
different implementations of these PEPs are choosing differently here, so relying
on it being either non-normalized or normalized is relying on an implementation
detail of the repository in question.
Note
While the projects
key is an array, and thus is required to be in some kind
of an order, neither PEP 503 nor this PEP requires any specific ordering nor
that the ordering is consistent from one request to the next. Mentally this is
best thought of as a set, but both JSON and HTML lack the functionality to have
sets.
Project Detail
The format of this URL is /<project>/
where the <project>
is replaced by the
PEP 503 normalized name for that project, so a project named “Silly_Walk” would
have a URL like /silly-walk/
.
This URL must respond with a JSON encoded dictionary that has three keys:
name
: The normalized name of the project.files
: A list of dictionaries, each one representing an individual file.meta
: The general response metadata as described earlier.
Each individual file dictionary has the following keys:
filename
: The filename that is being represented.url
: The URL that the file can be fetched from.hashes
: A dictionary mapping a hash name to a hex encoded digest of the file. Multiple hashes can be included, and it is up to the client to decide what to do with multiple hashes (it may validate all of them or a subset of them, or nothing at all). These hash names SHOULD always be normalized to be lowercase.The
hashes
dictionary MUST be present, even if no hashes are available for the file, however it is HIGHLY recommended that at least one secure, guaranteed-to-be-available hash is always included.By default, any hash algorithm available via hashlib (specifically any that can be passed to
hashlib.new()
and do not require additional parameters) can be used as a key for the hashes dictionary. At least one secure algorithm fromhashlib.algorithms_guaranteed
SHOULD always be included. At the time of this PEP,sha256
specifically is recommended.requires-python
: An optional key that exposes the Requires-Python metadata field, specified in PEP 345. Where this is present, installer tools SHOULD ignore the download when installing to a Python version that doesn’t satisfy the requirement.Unlike
data-requires-python
in PEP 503, therequires-python
key does not require any special escaping other than anything JSON does naturally.dist-info-metadata
: An optional key that indicates that metadata for this file is available, via the same location as specified in PEP 658 ({file_url}.metadata
). Where this is present, it MUST be either a boolean to indicate if the file has an associated metadata file, or a dictionary mapping hash names to a hex encoded digest of the metadata’s hash.When this is a dictionary of hashes instead of a boolean, then all the same requirements and recommendations as the
hashes
key hold true for this key as well.If this key is missing then the metadata file may or may not exist. If the key value is truthy, then the metadata file is present, and if it is falsey then it is not.
It is recommended that servers make the hashes of the metadata file available if possible.
gpg-sig
: An optional key that acts a boolean to indicate if the file has an associated GPG signature or not. The URL for the signature file follows what is specified in PEP 503 ({file_url}.asc
). If this key does not exist, then the signature may or may not exist.yanked
: An optional key which may be either a boolean to indicate if the file has been yanked, or a non empty, but otherwise arbitrary, string to indicate that a file has been yanked with a specific reason. If theyanked
key is present and is a truthy value, then it SHOULD be interpreted as indicating that the file pointed to by theurl
field has been “Yanked” as per PEP 592.
As an example:
{
"meta": {
"api-version": "1.0"
},
"name": "holygrail",
"files": [
{
"filename": "holygrail-1.0.tar.gz",
"url": "https://example.com/files/holygrail-1.0.tar.gz",
"hashes": {"sha256": "...", "blake2b": "..."},
"requires-python": ">=3.7",
"yanked": "Had a vulnerability"
},
{
"filename": "holygrail-1.0-py3-none-any.whl",
"url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
"hashes": {"sha256": "...", "blake2b": "..."},
"requires-python": ">=3.7",
"dist-info-metadata": true
}
]
}
Note
While the files
key is an array, and thus is required to be in some kind
of an order, neither PEP 503 nor this PEP requires any specific ordering nor
that the ordering is consistent from one request to the next. Mentally this is
best thought of as a set, but both JSON and HTML lack the functionality to have
sets.
Content-Types
This PEP proposes that all responses from the Simple API will have a standard content type that describes what the response is (a Simple API response), what version of the API it represents, and what serialization format has been used.
The structure of this content type will be:
application/vnd.pypi.simple.$version+format
Since only major versions should be disruptive to clients attempting to
understand one of these API responses, only the major version will be included
in the content type, and will be prefixed with a v
to clarify that it is a
version number.
Which means that for the existing 1.0 API, the content types would be:
- JSON:
application/vnd.pypi.simple.v1+json
- HTML:
application/vnd.pypi.simple.v1+html
In addition to the above, a special “meta” version is supported named latest
,
whose purpose is to allow clients to request the absolute latest version, without
having to know ahead of time what that version is. It is recommended however,
that clients be explicit about what versions they support.
To support existing clients which expect the existing PEP 503 API responses to
use the text/html
content type, this PEP further defines text/html
as an alias
for the application/vnd.pypi.simple.v1+html
content type.
Version + Format Selection
Now that there is multiple possible serializations, we need a mechanism to allow clients to indicate what serialization formats they’re able to understand. In addition, it would be beneficial if any possible new major version to the API can be added without disrupting existing clients expecting the previous API version.
To enable this, this PEP standardizes on the use of HTTP’s Server-Driven Content Negotiation.
While this PEP won’t fully describe the entirety of server-driven content negotiation, the flow is roughly:
- The client makes an HTTP request containing an
Accept
header listing all of the version+format content types that they are able to understand. - The server inspects that header, selects one of the listed content types,
then returns a response using that content type (treating the absence of
an
Accept
header asAccept: */*
). - If the server does not support any of the content types in the
Accept
header then they are able to choose between 3 different options for how to respond:- Select a default content type other than what the client has requested and return a response with that.
- Return a HTTP
406 Not Acceptable
response to indicate that none of the requested content types were available, and the server was unable or unwilling to select a default content type to respond with. - Return a HTTP
300 Multiple Choices
response that contains a list of all of the possible responses that could have been chosen.
- The client interprets the response, handling the different types of responses that the server may have responded with.
This PEP does not specify which choices the server makes in regards to handling a content type that it isn’t able to return, and clients SHOULD be prepared to handle all of the possible responses in whatever way makes the most sense for that client.
However, as there is no standard format for how a 300 Multiple Choices
response can be interpreted, this PEP highly discourages servers from utilizing
that option, as clients will have no way to understand and select a different
content-type to request. In addition, it’s unlikely that the client could
understand a different content type anyways, so at best this response would
likely just be treated the same as a 406 Not Acceptable
error.
This PEP does require that if the meta version latest
is being used, the
server MUST respond with the content type for the actual version that is
contained in the response
(i.e. A Accept: application/vnd.pypi.simple.latest+json
request that returns
a v1.x
response should have a Content-Type
of
application/vnd.pypi.simple.v1+json
).
The Accept
header is a comma separated list of content types that the client
understands and is able to process. It supports three different formats for each
content type that is being requested:
$type/$subtype
$type/*
*/*
For the use of selecting a version+format, the most useful of these is
$type/$subtype
, as that is the only way to actually specify the version
and format you want.
The order of the content types listed in the Accept
header does not have any
specific meaning, and the server SHOULD consider all of them to be equally
valid to respond with. If a client wishes to specify that they prefer a specific
content type over another, they may use the Accept
header’s
quality value
syntax.
This allows a client to specify a priority for a specific entry in their
Accept
header, by appending a ;q=
followed by a value between 0
and
1
inclusive, with up to 3 decimal digits. When interpreting this value,
an entry with a higher quality has priority over an entry with a lower quality,
and any entry without a quality present will default to a quality of 1
.
However, clients should keep in mind that a server is free to select any of the content types they’ve asked for, regardless of their requested priority, and it may even return a content type that they did not ask for.
To aid clients in determining the content type of the response that they have
received from an API request, this PEP requires that servers always include a
Content-Type
header indicating the content type of the response. This is
technically a backwards incompatible change, however in practice
pip has been enforcing this requirement
so the risks for actual breakages is low.
An example of how a client can operate would look like:
import email.message
import requests
def parse_content_type(header: str) -> str:
m = email.message.Message()
m["content-type"] = header
return m.get_content_type()
# Construct our list of acceptable content types, we want to prefer
# that we get a v1 response serialized using JSON, however we also
# can support a v1 response serialized using HTML. For compatibility
# we also request text/html, but we prefer it least of all since we
# don't know if it's actually a Simple API response, or just some
# random HTML page that we've gotten due to a misconfiguration.
CONTENT_TYPES = [
"application/vnd.pypi.simple.v1+json",
"application/vnd.pypi.simple.v1+html;q=0.2",
"text/html;q=0.01", # For legacy compatibility
]
ACCEPT = ", ".join(CONTENT_TYPES)
# Actually make our request to the API, requesting all of the content
# types that we find acceptable, and letting the server select one of
# them out of the list.
resp = requests.get("https://pypi.org/simple/", headers={"Accept": ACCEPT})
# If the server does not support any of the content types you requested,
# AND it has chosen to return a HTTP 406 error instead of a default
# response then this will raise an exception for the 406 error.
resp.raise_for_status()
# Determine what kind of response we've gotten to ensure that it is one
# that we can support, and if it is, dispatch to a function that will
# understand how to interpret that particular version+serialization. If
# we don't understand the content type we've gotten, then we'll raise
# an exception.
content_type = parse_content_type(resp.headers.get("content-type", ""))
match content_type:
case "application/vnd.pypi.simple.v1+json":
handle_v1_json(resp)
case "application/vnd.pypi.simple.v1+html" | "text/html":
handle_v1_html(resp)
case _:
raise Exception(f"Unknown content type: {content_type}")
If a client wishes to only support HTML or only support JSON, then they would
just remove the content types that they do not want from the Accept
header,
and turn receiving them into an error.
Alternative Negotiation Mechanisms
While using HTTP’s Content negotiation is considered the standard way for a client and server to coordinate to ensure that the client is getting an HTTP response that it is able to understand, there are situations where that mechanism may not be sufficient. For those cases this PEP has alternative negotiation mechanisms that may optionally be used instead.
URL Parameter
Servers that implement the Simple API may choose to support an URL parameter named
format
to allow the clients to request a specific version of the URL.
The value of the format
parameter should be one of the valid content types.
Passing multiple content types, wild cards, quality values, etc… is not
supported.
Supporting this parameter is optional, and clients SHOULD NOT rely on it for interacting with the API. This negotiation mechanism is intended to allow for easier human based exploration of the API within a browser, or to allow documentation or notes to link to a specific version+format.
Servers that do not support this parameter may choose to return an error when it is present, or they may simple ignore its presence.
When a server does implement this parameter, it SHOULD take precedence over any
values in the client’s Accept
header, and if the server does not support the
requested format, it may choose to fall back to the Accept
header, or choose any
of the error conditions that standard server-driven content negotiation typically
has (e.g. 406 Not Available
, 303 Multiple Choices
, or selecting a default
type to return).
Endpoint Configuration
This option technically is not a special option at all, it is just a natural consequence of using content negotiation and allowing servers to select which of the available content types is their default.
If a server is unwilling or unable to implement the server-driven content negotiation, and would instead rather require users to explicitly configure their client to select the version they want, then that is a supported configuration.
To enable this, a server should make multiple endpoints (for instance,
/simple/v1+html/
and/or /simple/v1+json/
) for each version+format that they
wish to support. Under that endpoint, they can host a copy of their repository that
only supports one (or a subset) of the content-types. When a client makes a request
using the Accept
header, the server can ignore it and return the content type
that corresponds to that endpoint.
For clients that wish to require specific configuration, they can keep track of
which version+format a specific repository URL was configured for, and when making
a request to that server, emit an Accept
header that only includes the correct
content type.
TUF Support - PEP 458
PEP 458 requires that all API responses are hashable and that they can be uniquely
identified by a path relative to the repository root. For a Simple API repository, the
target path is the Root of our API (e.g. /simple/
on PyPI). This creates
challenges when accessing the API using a TUF client instead of directly using a
standard HTTP client, as the TUF client cannot handle the fact that a target could
have multiple different representations that all hash differently.
PEP 458 does not specify what the target path should be for the Simple API, but
TUF requires that the target paths be “file-like”, in other words, a path like
simple/PROJECT/
is not acceptable, because it technically points to a
directory.
The saving grace is that the target path does not have to actually match the URL
being fetched from the Simple API, and it can just be a sigil that the fetching code
knows how to transform into the actual URL that needs to be fetched. This same thing
can hold true for other aspects of the actual HTTP request, such as the Accept
header.
Ultimately figuring out how to map a directory to a filename is out of scope for this PEP (but it would be in scope for PEP 458), and this PEP defers making a decision about how exactly to represent this inside of PEP 458 metadata.
However, it appears that the current WIP branch against pip that attempts to implement
PEP 458 is using a target path like simple/PROJECT/index.html
. This could be
modified to include the API version and serialization format using something like
simple/PROJECT/vnd.pypi.simple.vN.FORMAT
. So the v1 HTML format would be
simple/PROJECT/vnd.pypi.simple.v1.html
and the v1 JSON format would be
simple/PROJECT/vnd.pypi.simple.v1.json
.
In this case, since text/html
is an alias to application/vnd.pypi.simple.v1+html
when interacting through TUF, it likely will make the most sense to normalize to the
more explicit name.
Likewise the latest
metaversion should not be included in the targets, only
explicitly declared versions should be supported.
Recommendations
This section is non-normative, and represents what the PEP authors believe to be the best default implementation decisions for something implementing this PEP, but it does not represent any sort of requirement to match these decisions.
These decisions have been chosen to maximize the number of requests that can be moved onto the newest version of an API, while maintaining the greatest amount of compatibility. In addition, they’ve also tried to make using the API provide guardrails that attempt to push clients into making the best choices it can.
It is recommended that servers:
- Support all 3 content types described in this PEP, using server-driven content negotiation, for as long as they reasonably can, or at least as long as they’re receiving non trivial traffic that uses the HTML responses.
- When encountering an
Accept
header that does not contain any content types that it knows how to work with, the server should not ever return a300 Multiple Choice
response, and instead return a406 Not Acceptable
response.- However, if choosing to use the endpoint configuration, you should prefer to
return a
200 OK
response in the expected content type for that endpoint.
- However, if choosing to use the endpoint configuration, you should prefer to
return a
- When selecting an acceptable version, the server should choose the highest version
that the client supports, with the most expressive/featureful serialization format,
taking into account the specificity of the client requests as well as any
quality priority values they have expressed, and it should only use the
text/html
content type as a last resort.
It is recommended that clients:
- Support all 3 content types described in this PEP, using server-driven content negotiation, for as long as they reasonably can.
- When constructing an
Accept
header, include all of the content types that you support.You should generally not include a quality priority value for your content types, unless you have implementation specific reasons that you want the server to take into account (for example, if you’re using the standard library HTML parser and you’re worried that there may be some kinds of HTML responses that you’re unable to parse in some edge cases).
The one exception to this recommendation is that it is recommended that you should include a
;q=0.01
value on the legacytext/html
content type, unless it is the only content type that you are requesting. - Explicitly select what versions they are looking for, rather than using the
latest
meta version during normal operation. - Check the
Content-Type
of the response and ensure it matches something that you were expecting.
FAQ
Does this mean PyPI is planning to drop support for HTML/PEP 503?
No, PyPI has no plans at this time to drop support for PEP 503 or HTML responses.
While this PEP does give repositories the flexibility to do that, that largely exists to ensure that things like using the Endpoint Configuration mechanism is able to work, and to ensure that clients do not make any assumptions that would prevent, at some point in the future, gracefully dropping support for HTML.
The existing HTML responses incur almost no maintenance burden on PyPI and there is no pressing need to remove them. The only real benefit to dropping them would be to reduce the number of items cached in our CDN.
If in the future PyPI does wish to drop support for them, doing so would almost certainly be the topic of a PEP, or at a minimum a public, open, discussion and would be informed by metrics showing any impact to end users.
Why JSON instead of X format?
JSON parsers are widely available in most, if not every, language. A JSON parser is also available in the Python standard library. It’s not the perfect format, but it’s good enough.
Why not add X feature?
The general goal of this PEP is to change or add very little. We will instead focus largely on translating the existing information contained within our HTML responses into a sensible JSON representation. This will include PEP 658 metadata required for packaging tooling.
The only real new capability that is added in this PEP is the ability to have multiple hashes for a single file. That was done because the current mechanism being limited to a single hash has made it painful in the past to migrate hashes (md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple is pretty low.
The API was generally designed to allow further extension through adding new keys, so if there’s some new piece of data that an installer might need, future PEPs can easily make that available.
Why include the filename when the URL has it already?
We could reduce the size of our responses by removing the filename
key and expecting
clients to pull that information out of the URL.
Currently this PEP chooses not to do that, largely because PEP 503 explicitly required that the filename be available via the anchor tag of the links, though that was largely because something had to be there. It’s not clear if repositories in the wild always have a filename as the last part of the URL or if they’re relying on the filename in the anchor tag.
It also makes the responses slightly nicer to read for a human, as you get a nice short unique identifier.
If we got reasonable confidence that mandating the filename is in the URL, then we could drop this data and reduce the size of the JSON response.
Why not break out other pieces of information from the filename?
Currently clients are expected to parse a number of pieces of information from the filename such as project name, version, ABI tags, etc. We could break these out and add them as keys to the file object.
This PEP has chosen not to do that because doing so would increase the size of the API responses, and most clients are going to require the ability to parse that information out of file names anyways regardless of what the API does. Thus it makes sense to keep that functionality inside of the clients.
Why Content Negotiation instead of multiple URLs?
Another reasonable way to implement this would be to duplicate the API routes and
include some marker in the URL itself for JSON. Such as making the URLs be something
like /simple/foo.json
, /simple/_index.json
, etc.
This makes some things simpler like TUF integration and fully static serving of a
repository (since .json
files can just be written out).
However, this is two pretty major issues:
- Our current URL structure relies on the fact that there is an URL that represents
the “root”,
/
to serve the list of projects. If we want to have separate URLs for JSON and HTML, we would need to come up with some way to have two root URLs.Something like
/
being HTML and/_index.json
being JSON, since_index
isn’t a valid project name could work. But/
being HTML doesn’t work great if a repository wants to remove support for HTML.Another option could be moving all of the existing HTML URLs under a namespace while making a new namespace for JSON. Since
/<project>/
was defined, we would have to make these namespaces not valid project names, so something like/_html/
and/_json/
could work, then just redirect the non namespaced URLs to whatever the “default” for that repository is (likely HTML, unless they’ve disabled HTML then JSON). - With separate URLs, there’s no good way to support zero configuration discovery
that a repository supports the JSON URLs without making additional HTTP requests to
determine if the JSON URL exists or not.
The most naive implementation of this would be to request the JSON URL and fall back to the HTML URL for every single request, but that would be horribly performant and violate the goal of minimal additional HTTP requests.
The most likely implementation of this would be to make some sort of repository level configuration file that somehow indicates what is supported. We would have the same namespace problem as above, with the same solution, something like
/_config.json
or so could hold that data, and a client could first make an HTTP request to that, and if it exists pull it down and parse it to learn about the capabilities of this particular repository. - The use of
Accept
also allows us to add versioning into this field
All being said, it is the opinion of this PEP that those three issues combined make using separate API routes a less desirable solution than relying on content negotiation to select the most ideal representation of the data.
Does this mean that static servers are no longer supported?
In short, no, static servers are still (almost) fully supported by this PEP.
The specifics of how they are supported will depend on the static server in question. For example:
- S3: S3 fully supports custom content types, however it does not support any form of content negotiation. In order to have a server hosted on S3, you would have to use the “Endpoint configuration” style of negotiation, and users would have to configure their clients explicitly.
- GitHub Pages: GitHub pages does not support custom content types, so the
S3 solution is not currently workable, which means that only
text/html
repositories would function. - Apache: Apache fully supports server-driven content negotiation, and would just need to be configured to map the custom content types to specific extension.
Why not add an application/json
alias like text/html
?
This PEP believes that it is best for both clients and servers to be explicit
about the types of the API responses that are being used, and a content type
like application/json
is the exact opposite of explicit.
The existence of the text/html
alias exists as a compromise primarily to
ensure that existing consumers of the API continue to function as they already
do. There is no such expectation of existing clients using the Simple API with
a application/json
content type.
In addition, application/json
has no versioning in it, which means that
if there is ever a 2.x
version of the Simple API, we will be forced to make
a decision. Should application/json
preserve backwards compatibility and
continue to be an alias for application/vnd.pypi.simple.v1+json
, or should
it be updated to be an alias for application/vnd.pypi.simple.v2+json
?
This problem doesn’t exist for text/html
, because the assumption is that
HTML will remain a legacy format, and will likely not gain any new features,
much less features that require breaking compatibility. So having it be an
alias for application/vnd.pypi.simple.v1+html
is effectively the same as
having it be an alias for application/vnd.pypi.simple.latest+html
, since
1.x
will likely be the only HTML version to exist.
The largest benefit to adding the application/json
content type is that
there do things that do not allow you to have custom content types, and require
you to select one of their preset content types. The main example of this being
GitHub Pages, which the lack of application/json
support in this PEP means
that static repositories will no longer be able to be hosted on GitHub Pages
unless GitHub adds the application/vnd.pypi.simple.v1+json
content type.
This PEP believes that the benefits are not large enough to add that content type alias at this time, and that its inclusion would likely be a footgun waiting for unsuspecting people to accidentally pick it up. Especially given that we can always add it in the future, but removing things is a lot harder to do.
Why add a application/vnd.pypi.simple.v1+html
?
The PEP expects the HTML version of the API to become legacy, so one option it
could take is not add the application/vnd.pypi.simple.v1+html
content type,
and just use text/html
for that.
This PEP has decided that adding the new content type is better overall, since it
makes even the legacy format more self describing and makes them both more consistent
with each other. Overall I think it’s more confusing if the +html
version doesn’t
exist.
Why v1.0 and not v1.1 or v2.0?
This PEP is still wholly backwards compatible with clients that could read the existing v1.0 API, can still continue to read the API after these changes have been made. In PEP 629, the qualification for a major version bump is:
Incrementing the major version is used to signal a backwards incompatible change such that existing clients would no longer be expected to be able to meaningfully use the API.
The changes in this PEP do not meet that bar, nothing has changed in a way that existing clients would no longer be expected to be able to meaningfully use the API.
That means we should still be within the v1.x version line.
The question of whether we should be v1.1 or v1.0 is a more interesting one, and there are a few ways of looking at it:
- We’ve exposed new features to the API (the project name on the project page, multiple hashes), which is a sign that we should increment the minor version.
- The new features exist wholly within the JSON serialization, which means that no client that currently is requesting the HTML 1.0 page, would ever see any of the new features anyways, so for them it is effectively still v1.0.
- No major client has implemented support for PEP 629 yet, which means that the minor version numbering is largely academic at this point anyways, since it exists to let clients provide feedback to end users.
The second and third points above end up making the first point kind of meaningless, and with that, it makes more sense to just call everything v1.0 and be stricter about updating versions into the future.
Appendix 1: Survey of use cases to cover
This was done through a discussion between pip
, PyPI
, and bandersnarch
maintainers, who are the two first potential users for the new API. This is
how they use the Simple + JSON APIs today or how they currently plan to use it:
pip
:- List of all files for a particular release
- Metadata of each individual artifact:
- was it yanked? (
data-yanked
) - what’s the python-requires? (
data-python-requires
) - what’s the hash of this file? (currently, hash in URL)
- Full metadata (
data-dist-info-metadata
) - [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)?
- was it yanked? (
bandersnatch
- Only uses legacy JSON API + XMLRPC today:- Generates Simple HTML rather than copying from PyPI
- Maybe this changes with the new API and we verbatim pull these API assets from PyPI
- List of all files for a particular release.
- Workout URL for release files to download
- Metadata of each individual artifact.
- Write out the JSON to mirror storage today (disk/S3)
- Required metadata used
(via Package class):
metadata["info"]
metadata["last_serial"]
metadata["releases"]
- digests
- URL
- Required metadata used
(via Package class):
- Write out the JSON to mirror storage today (disk/S3)
- XML-RPC calls (we’d love to deprecate - but we don’t think should go in the Simple API)
- [Bonus] Get packages since serial X (or all)
- XML-RPC Call:
changelog_since_serial
- XML-RPC Call:
- [Bonus] Get all packages with serial
- XML-RPC Call:
list_packages_with_serial
- XML-RPC Call:
- [Bonus] Get packages since serial X (or all)
- Generates Simple HTML rather than copying from PyPI
Appendix 2: Rough Underlying Data Models
These are not intended to perfectly match the server, client, or wire formats. Rather, these are conceptual models, put to code to make them more explicit as to the abstract models underlining the repository API as it evolved through PEP 503, PEP 529, PEP 629, PEP 658, and now this PEP, PEP 691.
The existing HTML, and the new JSON serialization of these models then represent how these underlying conceptual models get mapped onto the actual wire formats.
How servers or clients choose to model this data is out of scope for this PEP.
@dataclass
class Meta:
api_version: Literal["1.0"]
@dataclass
class Project:
# Normalized or Non-Normalized Name
name: str
# Computed in JSON, Included in HTML
url: str | None
@dataclass
class File:
filename: str
url: str
# Limited to a len() of 1 in HTML
hashes: dict[str, str]
gpg_sig: bool | None
requires_python: str | None
@dataclass
class PEP529File(File):
yanked: bool | str
@dataclass
class PEP658File(PEP529File):
# Limited to a len() of 1 in HTML
dist_info_metadata: bool | dict[str, str]
# Simple Index page (/simple/)
@dataclass
class PEP503_Index:
projects: set[Project]
@dataclass
class PEP629_Index(PEP503_Index):
meta: Meta
@dataclass
class Index(PEP629_Index):
pass
# Simple Detail page (/simple/$PROJECT/)
@dataclass
class PEP503_Detail:
files: set[File]
@dataclass
class PEP529_Detail(PEP503_Detail):
files: set[PEP529File]
@dataclass
class PEP629_Detail(PEP529_Detail):
meta: Meta
@dataclass
class PEP658_Detail(PEP629_Detail):
files: set[PEP658File]
@dataclass
class PEP691_Detail(PEP658_Detail):
name: str # Normalized Name
@dataclass
class Detail(PEP691_Detail):
pass
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0691.rst
Last modified: 2023-09-09 17:39:29 GMT