PEP 694 – Upload 2.0 API for Python Package Repositories
- Author:
- Donald Stufft <donald at stufft.io>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 11-Jun-2022
- Post-History:
- 27-Jun-2022
Table of Contents
Abstract
There is currently no standardized API for uploading files to a Python package repository such as PyPI. Instead, everyone has been forced to reverse engineer the non-standard API from PyPI.
That API, while functional, leaks a lot of implementation details of the original PyPI code base, which have now had to have been faithfully replicated in the new code base, and alternative implementations.
Beyond the above, there are a number of major issues with the current API:
- It is a fully synchronous API, which means that we’re forced to have a single request being held open for potentially a long time, both for the upload itself, and then while the repository processes the uploaded file to determine success or failure.
- It does not support any mechanism for resuming an upload, with the largest file size on PyPI being just under 1GB in size, that’s a lot of wasted bandwidth if a large file has a network blip towards the end of an upload.
- It treats a single file as the atomic unit of operation, which can be problematic when a release might have multiple binary wheels which can cause people to get different versions while the files are uploading, and if the sdist happens to not go last, possibly some hard to build packages are attempting to be built from source.
- It has very limited support for communicating back to the user, with no support for multiple errors, warnings, deprecations, etc. It is limited entirely to the HTTP status code and reason phrase, of which the reason phrase has been deprecated since HTTP/2 (RFC 7540).
- The metadata for a release/file is submitted alongside the file, however this metadata is famously unreliable, and most installers instead choose to download the entire file and read that in part due to that unreliability.
- There is no mechanism for allowing a repository to do any sort of sanity checks before bandwidth starts getting expended on an upload, whereas a lot of the cases of invalid metadata or incorrect permissions could be checked prior to upload.
- It has no support for “staging” a draft release prior to publishing it to the repository.
- It has no support for creating new projects, without uploading a file.
This PEP proposes a new API for uploads, and deprecates the existing non standard API.
Status Quo
This does not attempt to be a fully exhaustive documentation of the current API, but give a high level overview of the existing API.
Endpoint
The existing upload API (and the now removed register API) lives at an url, currently
https://upload.pypi.org/legacy/
, and to communicate which specific API you want
to call, you add a :action
url parameter with a value of file_upload
. The values
of submit
, submit_pkg_info
, and doc_upload
also used to be supported, but
no longer are.
It also has a protocol_version
parameter, in theory to allow new versions of the
API to be written, but in practice that has never happened, and the value is always
1
.
So in practice, on PyPI, the endpoint is
https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1
.
Encoding
The data to be submitted is submitted as a POST
request with the content type
of multipart/form-data
. This is due to the historical nature, that this API
was not actually designed as an API, but rather was a form on the initial PyPI
implementation, then client code was written to programmatically submit that form.
Content
Roughly speaking, the metadata contained within the package is submitted as parts
where the content-disposition is form-data
, and the name is the name of the
field. The names of these various pieces of metadata are not documented, and they
sometimes, but not always match the names used in the METADATA
files. The casing
rarely matches though, but overall the METADATA
to form-data
conversion is
extremely inconsistent.
The file itself is then sent as a application/octet-stream
part with the name
of content
, and if there is a PGP signature attached, then it will be included
as a application/octet-stream
part with the name of gpg_signature
.
Specification
This PEP traces the root cause of most of the issues with the existing API to be roughly two things:
- The metadata is submitted alongside the file, rather than being parsed from the
file itself.
- This is actually fine if used as a pre-check, but it should be validated
against the actual
METADATA
or similar files within the distribution.
- This is actually fine if used as a pre-check, but it should be validated
against the actual
- It supports a single request, using nothing but form data, that either succeeds or fails, and everything is done and contained within that single request.
We then propose a multi-request workflow, that essentially boils down to:
- Initiate an upload session.
- Upload the file(s) as part of the upload session.
- Complete the upload session.
- (Optional) Check the status of an upload session.
All URLs described here will be relative to the root endpoint, which may be
located anywhere within the url structure of a domain. So it could be at
https://upload.example.com/
, or https://example.com/upload/
.
Versioning
This PEP uses the same MAJOR.MINOR
versioning system as used in PEP 691,
but it is otherwise independently versioned. The existing API is considered by
this spec to be version 1.0
, but it otherwise does not attempt to modify
that API in any way.
Endpoints
Create an Upload Session
To create a new upload session, you can send a POST
request to /
,
with a payload that looks like:
{
"meta": {
"api-version": "2.0"
},
"name": "foo",
"version": "1.0"
}
This currently has three keys, meta
, name
, and version
.
The meta
key is included in all payloads, and it describes information about the
payload itself.
The name
key is the name of the project that this session is attempting to
add files to.
The version
key is the version of the project that this session is attepmting to
add files to.
If creating the session was successful, then the server must return a response that looks like:
{
"meta": {
"api-version": "2.0"
},
"urls": {
"upload": "...",
"draft": "...",
"publish": "..."
},
"valid-for": 604800,
"status": "pending",
"files": {},
"notices": [
"a notice to display to the user"
]
}
Besides the meta
key, this response has five keys, urls
, valid-for
,
status
, files
, and notices
.
The urls
key is a dictionary mapping identifiers to related URLs to this
session.
The valid-for
key is an integer representing how long, in seconds, until the
server itself will expire this session (and thus all of the URLs contained in it).
The session SHOULD live at least this much longer unless the client itself
has canceled the session. Servers MAY choose to increase this time, but should
never decrease it, except naturally through the passage of time.
The status
key is a string that contains one of pending
, published
,
errored
, or canceled
, this string represents the overall status of
the session.
The files
key is a mapping containing the filenames that have been uploaded
to this session, to a mapping containing details about each file.
The notices
key is an optional key that points to an array of notices that
the server wishes to communicate to the end user that are not specific to any
one file.
For each filename in files
the mapping has three keys, status
, url
,
and notices
.
The status
key is the same as the top level status
key, except that it
indicates the status of a specific file.
The url
key is the absolute URL that the client should upload that specific
file to (or use to delete that file).
The notices
key is an optional key, that is an array of notices that the server
wishes to communicate to the end user that are specific to this file.
The required response code to a successful creation of the session is a
201 Created
response and it MUST include a Location
header that is the
URL for this session, which may be used to check its status or cancel it.
For the urls
key, there are currently three keys that may appear:
The upload
key, which is the upload endpoint for this session to initiate
a file upload.
The draft
key, which is the repository URL that these files are available at
prior to publishing.
The publish
key, which is the endpoint to trigger publishing the session.
In addition to the above, if a second session is created for the same name+version pair, then the upload server MUST return the already existing session rather than creating a new, empty one.
Upload Each File
Once you have initiated an upload session for one or more files, then you have to actually upload each of those files.
There is no set endpoint for actually uploading the file, that is given to the client by the server as part of the creation of the upload session, and clients MUST NOT assume that there is any commonality to what those URLs look like from one session to the next.
To initiate a file upload, a client sends a POST
request to the upload URL
in the session, with a request body that looks like:
{
"meta": {
"api-version": "2.0"
},
"filename": "foo-1.0.tar.gz",
"size": 1000,
"hashes": {"sha256": "...", "blake2b": "..."},
"metadata": "..."
}
Besides the standard meta
key, this currently has 4 keys:
filename
: The filename of the file being uploaded.size
: The size, in bytes, of the file that is being uploaded.hashes
: A mapping of hash names to hex encoded digests, each of these digests are the digests of that file, when hashed by the hash identified in the name.By default, any hash algorithm available via hashlib (specifically any that can be passed to
hashlib.new()
and do not require additional parameters) can be used as a key for the hashes dictionary. At least one secure algorithm fromhashlib.algorithms_guaranteed
MUST always be included. At the time of this PEP,sha256
specifically is recommended.Multiple hashes may be passed at a time, but all hashes must be valid for the file.
metadata
: An optional key that is a string containing the file’s core metadata.
Servers MAY use the data provided in this response to do some sanity checking prior to allowing the file to be uploaded, which may include but is not limited to:
- Checking if the
filename
already exists. - Checking if the
size
would invalidate some quota. - Checking if the contents of the
metadata
, if provided, are valid.
If the server determines that the client should attempt the upload, it will return
a 201 Created
response, with an empty body, and a Location
header pointing
to the URL that the file itself should be uploaded to.
At this point, the status of the session should show the filename, with the above url included in it.
Upload Data
To upload the file, a client has two choices, they may upload the file as either a single chunk, or as multiple chunks. Either option is acceptable, but it is recommended that most clients should choose to upload each file as a single chunk as that requires fewer requests and typically has better performance.
However for particularly large files, uploading within a single request may result in timeouts, so larger files may need to be uploaded in multiple chunks.
In either case, the client must generate a unique token (or nonce) for each upload
attempt for a file, and MUST include that token in each request in the Upload-Token
header. The Upload-Token
is a binary blob encoded using base64 surrounded by
a :
on either side. Clients SHOULD use at least 32 bytes of cryptographically
random data. You can generate it using the following:
import base64
import secrets
header = ":" + base64.b64encode(secrets.token_bytes(32)).decode() + ":"
The one time that it is permissible to omit the Upload-Token
from an upload
request is when a client wishes to opt out of the resumable or chunked file upload
feature completely. In that case, they MAY omit the Upload-Token
, and the
file must be successfully uploaded in a single HTTP request, and if it fails, the
entire file must be resent in another single HTTP request.
To upload in a single chunk, a client sends a POST
request to the URL from the
session response for that filename. The client MUST include a Content-Length
header that is equal to the size of the file in bytes, and this MUST match the
size given in the original session creation.
As an example, if uploading a 100,000 byte file, you would send headers like:
Content-Length: 100000
Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:
If the upload completes successfully, the server MUST respond with a
201 Created
status. At this point this file MUST not be present in the
repository, but merely staged until the upload session has completed.
To upload in multiple chunks, a client sends multiple POST
requests to the same
URL as before, one for each chunk.
This time however, the Content-Length
is equal to the size, in bytes, of the
chunk that they are sending. In addition, the client MUST include a
Upload-Offset
header which indicates a byte offset that the content included
in this request starts at and a Upload-Incomplete
header set to 1
.
As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk represents bytes 1001 through 2000, you would send headers like:
Content-Length: 1000
Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:
Upload-Offset: 1001
Upload-Incomplete: 1
However, the final chunk of data omits the Upload-Incomplete
header, since
at that point the upload is no longer incomplete.
For each successful chunk, the server MUST respond with a 202 Accepted
header, except for the final chunk, which MUST be a 201 Created
.
The following constraints are placed on uploads regardless of whether they are single chunk or multiple chunks:
- A client MUST NOT perform multiple
POST
requests in parallel for the same file to avoid race conditions and data loss or corruption. The server MAY terminate any ongoingPOST
request that utilizes the sameUpload-Token
. - If the offset provided in
Upload-Offset
is not0
or the next chunk in an incomplete upload, then the server MUST respond with a 409 Conflict. - Once an upload has started with a specific token, you may not use another token for that file without deleting the in progress upload.
- Once a file has uploaded successfully, you may initiate another upload for that file, and doing so will replace that file.
Resume Upload
To resume an upload, you first have to know how much of the data the server has already received, regardless of if you were originally uploading the file as a single chunk, or in multiple chunks.
To get the status of an individual upload, a client can make a HEAD
request
with their existing Upload-Token
to the same URL they were uploading to.
The server MUST respond back with a 204 No Content
response, with an
Upload-Offset
header that indicates what offset the client should continue
uploading from. If the server has not received any data, then this would be 0
,
if it has received 1007 bytes then it would be 1007
.
Once the client has retrieved the offset that they need to start from, they can upload the rest of the file as described above, either in a single request containing all of the remaining data or in multiple chunks.
Canceling an In Progress Upload
If a client wishes to cancel an upload of a specific file, for instance because
they need to upload a different file, they may do so by issuing a DELETE
request to the file upload URL with the Upload-Token
used to upload the
file in the first place.
A successful cancellation request MUST response with a 204 No Content
.
Delete an uploaded File
Already uploaded files may be deleted by issuing a DELETE
request to the file
upload URL without the Upload-Token
.
A successful deletion request MUST response with a 204 No Content
.
Session Status
Similarly to file upload, the session URL is provided in the response to creating the upload session, and clients MUST NOT assume that there is any commonality to what those URLs look like from one session to the next.
To check the status of a session, clients issue a GET
request to the
session URL, to which the server will respond with the same response that
they got when they initially created the upload session, except with any
changes to status
, valid-for
, or updated files
reflected.
Session Cancellation
To cancel an upload session, a client issues a DELETE
request to the
same session URL as before. At which point the server marks the session as
canceled, MAY purge any data that was uploaded as part of that session,
and future attempts to access that session URL or any of the file upload URLs
MAY return a 404 Not Found
.
To prevent a lot of dangling sessions, servers may also choose to cancel a session on their own accord. It is recommended that servers expunge their sessions after no less than a week, but each server may choose their own schedule.
Session Completion
To complete a session, and publish the files that have been included in it,
a client MUST send a POST
request to the publish
url in the
session status payload.
If the server is able to immediately complete the session, it may do so
and return a 201 Created
response. If it is unable to immediately
complete the session (for instance, if it needs to do processing that may
take longer than reasonable in a single HTTP request), then it may return
a 202 Accepted
response.
In either case, the server should include a Location
header pointing
back to the session status url, and if the server returned a 202 Accepted
,
the client may poll that URL to watch for the status to change.
Errors
All Error responses that contain a body will have a body that looks like:
{
"meta": {
"api-version": "2.0"
},
"message": "...",
"errors": [
{
"source": "...",
"message": "..."
}
]
}
Besides the standard meta
key, this has two top level keys, message
and errors
.
The message
key is a singular message that encapsulates all errors that
may have happened on this request.
The errors
key is an array of specific errors, each of which contains
a source
key, which is a string that indicates what the source of the
error is, and a message
key for that specific error.
The message
and source
strings do not have any specific meaning, and
are intended for human interpretation to figure out what the underlying issue
was.
Content-Types
Like PEP 691, this PEP proposes that all requests and responses from the Upload API will have a standard content type that describes what the content is, what version of the API it represents, and what serialization format has been used.
The structure of this content type will be:
application/vnd.pypi.upload.$version+format
Since only major versions should be disruptive to systems attempting to
understand one of these API content bodies, only the major version will be
included in the content type, and will be prefixed with a v
to clarify
that it is a version number.
Unlike PEP 691, this PEP does not change the existing 1.0
API in any
way, so servers will be required to host the new API described in this PEP at
a different endpoint than the existing upload API.
Which means that for the new 2.0 API, the content types would be:
- JSON:
application/vnd.pypi.upload.v2+json
In addition to the above, a special “meta” version is supported named latest
,
whose purpose is to allow clients to request the absolute latest version, without
having to know ahead of time what that version is. It is recommended however,
that clients be explicit about what versions they support.
These content types DO NOT apply to the file uploads themselves, only to the
other API requests/responses in the upload API. The files themselves should use
the application/octet-stream
content-type.
Version + Format Selection
Again similar to PEP 691, this PEP standardizes on using server-driven
content negotiation to allow clients to request different versions or
serialization formats, which includes the format
url parameter.
Since this PEP expects the existing legacy 1.0
upload API to exist at a
different endpoint, and it currently only provides for JSON serialization, this
mechanism is not particularly useful, and clients only have a single version and
serialization they can request. However clients SHOULD be setup to handle
content negotiation gracefully in the case that additional formats or versions
are added in the future.
FAQ
Does this mean PyPI is planning to drop support for the existing upload API?
At this time PyPI does not have any specific plans to drop support for the existing upload API.
Unlike with PEP 691 there are wide benefits to doing so, so it is likely that we will want to drop support for it at some point in the future, but until this API is implemented, and receiving broad use it would be premature to make any plans for actually dropping support for it.
Is this Resumable Upload protocol based on anything?
Yes!
It’s actually the protocol specified in an Active Internet-Draft, where the authors took what they learned implementing tus to provide the idea of resumable uploads in a wholly generic, standards based way.
The only deviation we’ve made from that spec is that we don’t use the
104 Upload Resumption Supported
informational response in the first
POST
request. This decision was made for a few reasons:
- The
104 Upload Resumption Supported
is the only part of that draft which does not rely entirely on things that are already supported in the existing standards, since it was adding a new informational status. - Many clients and web frameworks don’t support
1xx
informational responses in a very good way, if at all, adding it would complicate implementation for very little benefit. - The purpose of the
104 Upload Resumption Supported
support is to allow clients to determine that an arbitrary endpoint that they’re interacting with supports resumable uploads. Since this PEP is mandating support for that in servers, clients can just assume that the server they are interacting with supports it, which makes using it unneeded. - In theory, if the support for
1xx
responses got resolved and the draft gets accepted with it in, we can add that in at a later date without changing the overall flow of the API.
There is a risk that the above draft doesn’t get accepted, but even if it does not, that doesn’t actually affect us. It would just mean that our support for resumable uploads is an application specific protocol, but is still wholly standards compliant.
Open Questions
Multipart Uploads vs tus
This PEP currently bases the actual uploading of files on an internet draft from tus.io that supports resumable file uploads.
That protocol requires a few things:
- That the client selects a secure
Upload-Token
that they use to identify uploading a single file. - That if clients don’t upload the entire file in one shot, that they have
to submit the chunks serially, and in the correct order, with all but the
final chunk having a
Upload-Incomplete: 1
header. - Resumption of an upload is essentially just querying the server to see how much data they’ve gotten, then sending the remaining bytes (either as a single request, or in chunks).
- The upload implicitly is completed when the server successfully gets all of the data from the client.
This has one big benefit, that if a client doesn’t care about resuming their
download, the work to support, from a client side, resumable uploads is able
to be completely ignored. They can just POST
the file to the URL, and if
it doesn’t succeed, they can just POST
the whole file again.
The other benefit is that even if you do want to support resumption, you can
still just POST
the file, and unless you need to resume the download,
that’s all you have to do.
Another, possibly theoretical, benefit is that for hashing the uploaded files, the serial chunks requirement means that the server can maintain hashing state between requests, update it for each request, then write that file back to storage. Unfortunately this isn’t actually possible to do with Python’s hashlib, though there are some libraries like Rehash that implement it, but they don’t support every hash that hashlib does (specifically not blake2 or sha3 at the time of writing).
We might also need to reconstitute the download for processing anyways to do things like extract metadata, etc from it, which would make it a moot point.
The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially.
AWS S3 has a similar API (and most blob stores have copied it either wholesale or something like it) which they call multipart uploading.
The basic flow for a multipart upload is:
- Initiate a Multipart Upload to get an Upload ID.
- Break your file up into chunks, and upload each one of them individually.
- Once all chunks have been uploaded, finalize the upload. - This is the step where any errors would occur.
It does not directly support resuming an upload, but it allows clients to control the “blast radius” of failure by adjusting the size of each part they upload, and if any of the parts fail, they only have to resend those specific parts.
This has a big benefit in that it allows parallelization in uploading files, allowing clients to maximize their bandwidth using multiple threads to send the data.
We wouldn’t need an explicit step (1), because our session would implicitly initiate a multipart upload for each file.
It does have its own downsides:
- Clients have to do more work on every request to have something resembling resumable uploads. They would have to break the file up into multiple parts rather than just making a single POST request, and only needing to deal with the complexity if something fails.
- Clients that don’t care about resumption at all still have to deal with
the third explicit step, though they could just upload the file all as a
single part.
- S3 works around this by having another API for one shot uploads, but I’d rather not have two different APIs for uploading the same file.
- Verifying hashes gets somewhat more complicated. AWS implements hashing
multipart uploads by hashing each part, then the overall hash is just a
hash of those hashes, not of the content itself. We need to know the
actual hash of the file itself for PyPI, so we would have to reconstitute
the file and read its content and hash it once it’s been fully uploaded,
though we could still use the hash of hashes trick for checksumming the
upload itself.
- See above about whether this is actually a downside in practice, or if it’s just in theory.
I lean towards the tus style resumable uploads as I think they’re simpler to use and to implement, and the main downside is that we possibly leave some multi-threaded performance on the table, which I think that I’m personally fine with?
I guess one additional benefit of the S3 style multi part uploads is that you don’t have to try and do any sort of protection against parallel uploads, since they’re just supported. That alone might erase most of the server side implementation simplification.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0694.rst
Last modified: 2024-07-10 21:28:34 GMT