PEP 694 – Upload 2.0 API for Python Package Indexes
- Author:
- Barry Warsaw <barry at python.org>, Donald Stufft <donald at stufft.io>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 11-Jun-2022
- Post-History:
- 27-Jun-2022
Abstract
This PEP proposes a standard API for uploading files to a Python package index such as PyPI. Along with standardization, the upload API provides additional useful features such as support for:
- an upload session, which can be used to simultaneously publish all wheels in a package release;
- “staging” a release, which can be used to test uploads before publicly publishing them, without the need for test.pypi.org;
- artifacts which can be overwritten and replaced, until a session is published;
- asynchronous and “chunked”, resumable file uploads, for more efficient use of network bandwidth;
- detailed status on the state of artifact uploads;
- new project creation without requiring the uploading of an artifact.
Once this new upload API is adopted, the existing legacy API can be deprecated, however this PEP does not propose a deprecation schedule for the legacy API.
Rationale
There is currently no standardized API for uploading files to a Python package index such as PyPI. Instead, everyone has been forced to reverse engineer the existing “legacy” API.
The legacy API, while functional, leaks implementation details of the original PyPI code base, which has been faithfully replicated in the new code base and alternative implementations.
In addition, there are a number of major issues with the legacy API:
- It is fully synchronous, which forces requests to be held open both for the upload itself, and while the index processes the uploaded file to determine success or failure.
- It does not support any mechanism for resuming an upload. With the largest default file size on PyPI being around 1GB in size, requiring the entire upload to complete successfully means bandwidth is wasted when such uploads experience a network interruption while the request is in progress.
- The atomic unit of operation is a single file. This is problematic when a release logically includes an sdist and multiple binary wheels, leading to race conditions where consumers get different versions of the package if they are unlucky enough to require a package before their platform’s wheel has completely uploaded. If the release uploads its sdist first, this may also manifest in some consumers seeing only the sdist, triggering a local build from source.
- Status reporting is very limited. There’s no support for reporting multiple errors, warnings, deprecations, etc. Status is limited to the HTTP status code and reason phrase, of which the reason phrase has been deprecated since HTTP/2 (RFC 7540).
- Metadata for a release is submitted alongside the file. However, as this metadata is famously unreliable, most installers instead choose to download the entire file and read the metadata from there.
- There is no mechanism for allowing an index to do any sort of sanity checks before bandwidth gets expended on an upload. Many cases of invalid metadata or incorrect permissions could be checked prior to uploading files.
- There is no support for “staging” a release prior to publishing it to the index.
- Creation of new projects requires the uploading of at least one file, leading to “stub” uploads to claim a project namespace.
The new upload API proposed in this PEP solves all of these problems, providing for a much more flexible, bandwidth friendly approach, with better error reporting, a better release testing experience, and atomic and simultaneous publishing of all release artifacts.
Legacy API
The following is an overview of the legacy API. For the detailed description, consult the PyPI user guide documentation.
Endpoint
The existing upload API lives at a base URL. For PyPI, that URL is currently
https://upload.pypi.org/legacy/
. Clients performing uploads specify the API they want to call
by adding an :action
URL parameter with a value of file_upload
. [1]
The legacy API also has a protocol_version
parameter, in theory allowing new versions of the API
to be defined. In practice this has never happened, and the value is always 1
.
Thus, the effective upload API on PyPI is:
https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1
.
Encoding
The data to be submitted is submitted as a POST
request with the content type of
multipart/form-data
. This reflects the legacy API’s historical nature, which was originally
designed not as an API, but rather as a web form on the initial PyPI implementation, with client code
written to programmatically submit that form.
Content
Roughly speaking, the metadata contained within the package is submitted as parts where the content
disposition is form-data
, and the metadata key is the name of the field. The names of these
various pieces of metadata are not documented, and they sometimes, but not always match the names
used in the METADATA
files for package artifacts. The case rarely matches, and the form-data
to METADATA
conversion is inconsistent.
The upload artifact file itself is sent as a application/octet-stream
part with the name of
content
, and if there is a PGP signature attached, then it will be included as a
application/octet-stream
part with the name of gpg_signature
.
Authentication
Upload authentication is also not standardized. On PyPI, authentication is through API tokens or Trusted Publisher (OpenID Connect). Other indexes may support different authentication methods.
Upload 2.0 API Specification
This PEP draws inspiration from the Resumable Uploads for HTTP internet draft, however there are significant differences. This is largely due to the unique nature of Python package releases (i.e. metadata, multiple related artifacts, etc.), and the support for an upload session and release stages. Where it makes sense to adopt details of the draft, this PEP does so.
This PEP traces the root cause of most of the issues with the existing API to be roughly two things:
- The metadata is submitted alongside the file, rather than being parsed from the file itself. [2]
- It supports only a single request, using only form data, that either succeeds or fails, and all actions are atomic within that single request.
To address these issues, this PEP proposes a multi-request workflow, which at a high level involves these steps:
- Initiate an upload session, creating a release stage.
- Upload the file(s) to that stage as part of the upload session.
- Complete the upload session, publishing or discarding the stage.
- Optionally check the status of an upload session.
Versioning
This PEP uses the same MAJOR.MINOR
versioning system as used in PEP 691, but it is otherwise
independently versioned. The legacy API is considered by this PEP to be version 1.0
, but this
PEP does not modify the legacy API in any way.
The API proposed in this PEP therefor has the version number 2.0
.
Root Endpoint
All URLs described here are relative to the “root endpoint”, which may be located anywhere within
the url structure of a domain. For example, the root endpoint could be
https://upload.example.com/
, or https://example.com/upload/
.
Specifically for PyPI, this PEP proposes to implement the root endpoint at
https://upload.pypi.org/2.0
. This root URL will be considered provisional while the feature is
being tested, and will be blessed as permanent after sufficient testing with live projects.
Create an Upload Session
A release starts by creating a new upload session. To create the session, a client submits a POST
request
to the root URL, with a payload that looks like:
{
"meta": {
"api-version": "2.0"
},
"name": "foo",
"version": "1.0",
"nonce": "<string>"
}
The request includes the following top-level keys:
meta
(required)- Describes information about the payload itself. Currently, the only defined sub-key is
api-version
the value of which must be the string"2.0"
. name
(required)- The name of the project that this session is attempting to release a new version of.
version
(required)- The version of the project that this session is attempting to add files to.
nonce
(optional)- An additional client-side string input to the “session token” algorithm. Details are provided below, but if this key is omitted, it is equivalent to passing the empty string.
Upon successful session creation, the server returns a 201 Created
response. If an error
occurs, the appropriate 4xx
code will be returned, as described in the Errors
section.
If a session is created for a project which has no previous release, then the index MAY reserve the project name before the session is published, however it MUST NOT be possible to navigate to that project using the “regular” (i.e. unstaged) access protocols, until the stage is published. If this first-release stage gets canceled, then the index SHOULD delete the project record, as if it were never uploaded.
The session is owned by the user that created it, and all subsequent requests MUST be performed
with the same credentials, otherwise a 403 Forbidden
will be returned on those subsequent
requests.
Response Body
The successful response includes the following JSON content:
{
"meta": {
"api-version": "2.0"
},
"links": {
"stage": "...",
"upload": "...",
"session": "...",
},
"session-token": "<token-string>",
"valid-for": 604800,
"status": "pending",
"files": {},
"notices": [
"a notice to display to the user"
]
}
Besides the meta
key, which has the same format as the request JSON, the success response has
the following keys:
links
- A dictionary mapping keys to URLs related to this session, the details of which are provided below.
session-token
- If the index supports previewing staged releases, this key will contain the unique “session token” that can be provided to installers in order to preview the staged release before it’s published. If the index does not support stage previewing, this key MUST be omitted.
valid-for
- An integer representing how long, in seconds, until the server itself will expire this session, and thus all of its content, including any uploaded files and the URL links related to the session. This value is roughly relative to the time at which the session was created or extended. The session SHOULD live at least this much longer unless the client itself has canceled or published the session. Servers MAY choose to increase this time, but should never decrease it, except naturally through the passage of time. Clients can query the session status to get time remaining in the session.
status
- A string that contains one of
pending
,published
,error
, orcanceled
, representing the overall status of the session. files
- A mapping containing the filenames that have been uploaded to this session, to a mapping containing details about each file referenced in this session.
notices
- An optional key that points to an array of human-readable informational notices that the server wishes to communicate to the end user. These notices are specific to the overall session, not to any particular file in the session.
Session Links
For the links
key in the success JSON, the following sub-keys are valid:
upload
- The endpoint session clients will use to initiate uploads for each file to be included in this session.
stage
- The endpoint where this staged release can be previewed prior to publishing the session. This can be used to download and verify the not-yet-public files. If the index does not support previewing staged releases, this key MUST be omitted.
session
- The endpoint where actions for this session can be performed, including publishing this session, canceling and discarding the session, querying the current session status, and requesting an extension of the session lifetime (if the server supports it).
Session Files
The files
key contains a mapping from the names of the files uploaded in this session to a
sub-mapping with the following keys:
status
- A string with the same values and semantics as the session status key, except that it indicates the status of the specific referenced file.
link
- The absolute URL that the client should use to reference this specific file. This URL is used
to retrieve, replace, or delete the referenced file. If a
nonce
was provided, this URL MUST be obfuscated with a non-guessable token as described in the session token section. notices
- An optional key with similar format and semantics as the
notices
session key, except that these notices are specific to the referenced file.
If a second session is created for the same name-version pair while a session for that pair is in
the pending
state, then the server MUST return the JSON status response for the already
existing session, along with the 200 Ok
status code rather than creating a new, empty session.
File Upload
After creating the session, the upload
endpoint from the response’s session links mapping is used to begin the upload of new files into that session. Clients
MUST use the provided upload
URL and MUST NOT assume there is any pattern or commonality
to those URLs from one session to the next.
To initiate a file upload, a client first sends a POST
request to the upload
URL. The
request body has the following JSON format:
{
"meta": {
"api-version": "2.0"
},
"filename": "foo-1.0.tar.gz",
"size": 1000,
"hashes": {"sha256": "...", "blake2b": "..."},
"metadata": "..."
}
Besides the standard meta
key, the request JSON has the following additional keys:
filename
(required)- The name of the file being uploaded.
size
(required)- The size in bytes of the file being uploaded.
hashes
(required)- A mapping of hash names to hex-encoded digests. Each of these digests are the checksums of the
file being uploaded when hashed by the algorithm identified in the name.
By default, any hash algorithm available in hashlib can be used as a key for the hashes dictionary [3]. At least one secure algorithm from
hashlib.algorithms_guaranteed
MUST always be included. This PEP specifically recommendssha256
.Multiple hashes may be passed at a time, but all hashes provided MUST be valid for the file.
metadata
(optional)- If given, this is a string value containing the file’s core metadata.
Servers MAY use the data provided in this request to do some sanity checking prior to allowing the file to be uploaded. These checks may include, but are not limited to:
- checking if the
filename
already exists in a published release; - checking if the
size
would exceed any project or file quota; - checking if the contents of the
metadata
, if provided, are valid.
If the server determines that upload should proceed, it will return a 201 Created
response, with
an empty body, and a Location
header pointing to the URL that the file content should be
uploaded to. The status of the session will also include the filename in
the files
mapping, with the above Location
URL included in under the link
sub-key.
Important
The IETF draft calls this the URL of the upload resource, and this PEP uses that nomenclature as well.
Upload File Contents
The actual file contents are uploaded by issuing a POST
request to the upload resource URL
[5]. The client may either upload the entire file in a single request, or it may opt
for “chunked” upload where the file contents are split into multiple requests, as described below.
Important
The protocol defined in this PEP differs from the IETF draft in a few ways:
- For chunked uploads, the second and subsequent chunks are uploaded
using a
POST
request instead ofPATCH
requests. Similarly, this PEP usesapplication/octet-stream
for theContent-Type
headers for all chunks. - No
Upload-Draft-Interop-Version
header is required. - Some of the server responses are different.
When uploading the entire file in a single request, the request MUST include the following headers (e.g. for a 100,000 byte file):
Content-Length: 100000
Content-Type: application/octet-stream
Upload-Length: 100000
Upload-Complete: ?1
The body of this request contains all 100,000 bytes of the unencoded raw binary data.
Content-Length
- The number of file bytes contained in the body of this request.
Content-Type
- MUST be
application/octet-stream
. Upload-Length
- Indicates the total number of bytes that will be uploaded for this file. For single-request
uploads this will always be equal to
Content-Length
, but these values will likely differ for chunked uploads. This value MUST equal the number of bytes given in thesize
field of the file upload initiation request. Upload-Complete
- A flag indicating whether more chunks are coming for this file. For single-request uploads, the
value of this header MUST be
?1
.
If the upload completes successfully, the server MUST respond with a 201 Created
status.
The response body has no content.
If this single-request upload fails, the entire file must be resent in another single HTTP request. This is the recommended, preferred format for file uploads since fewer requests are required.
As an example, if the client was to upload a 100,000 byte file, the headers would look like:
Content-Length: 100000
Content-Type: application/octet-stream
Upload-Length: 100000
Upload-Complete: ?1
Clients can opt to upload the file in multiple chunks. Because the upload resource URL provided in
the metadata response will be unique per file, clients MUST use the given upload resource URL
for all chunks. Clients upload file chunks by sending multiple POST
requests to this URL, with
one request per chunk.
For chunked uploads, the Content-Length
is equal to the size in bytes of the chunk that is
currently being sent. The client MUST include a Upload-Offset
header which indicates the
byte offset that the content included in this chunk’s request starts at, and an Upload-Complete
header with the value ?0
. For the first chunk, the Upload-Offset
header MUST be set to
0
. As with single-request uploads, the Content-Type
header is application/octet-stream
and the body is the raw, unencoded bytes of the chunk.
For example, if uploading a 100,000 byte file in 1000 byte chunks, the first chunk’s request headers would be:
Content-Length: 1000
Content-Type: application/octet-stream
Upload-Offset: 0
Upload-Length: 100000
Upload-Complete: ?0
For the second chunk representing bytes 1000 through 1999, include the following headers:
Content-Length: 1000
Content-Type: application/octet-stream
Upload-Offset: 1000
Upload-Length: 100000
Upload-Complete: ?0
These requests would continue sequentially until the last chunk is ready to be uploaded.
For each successful chunk, the server MUST respond with a 202 Accepted
header, except for
the final chunk, which MUST be a 201 Created
, and as with non-chunked uploads, the body of
these responses has no content.
The final chunk of data MUST include the Upload-Complete: ?1
header, since at that point the
entire file has been uploaded.
With both chunked and non-chunked uploads, once completed successfully, the file MUST not be
publicly visible in the repository, but merely staged until the upload session is completed. If the server supports previews, the file MUST be
visible at the stage
URL. Partially uploaded chunked files SHOULD
NOT be visible at the stage
URL.
The following constraints are placed on uploads regardless of whether they are single chunk or multiple chunks:
- A client MUST NOT perform multiple
POST
requests in parallel for the same file to avoid race conditions and data loss or corruption. - If the offset provided in
Upload-Offset
is not0
or correctly specifies the byte offset of the next chunk in an incomplete upload, then the server MUST respond with a409 Conflict
. This means that a client MAY NOT upload chunks out of order. - Once a file upload has completed successfully, you may initiate another upload for that file, which once completed, will replace that file. This is possible until the entire session is completed, at which point no further file uploads (either creating or replacing a session file) are accepted. I.e. once a session is published, the files included in that release are immutable [4].
Resume an Upload
To resume an upload, you first have to know how much of the file’s contents the server has already
received. If this is not already known, a client can make a HEAD
request to the upload resource
URL.
The server MUST respond with a 204 No Content
response, with an Upload-Offset
header
that indicates what offset the client should continue uploading from. If the server has not received
any data, then this would be 0
, if it has received 1007 bytes then it would be 1007
. For
this example, the full response headers would look like:
Upload-Offset: 1007
Upload-Complete: ?0
Cache-Control: no-store
Once the client has retrieved the offset that they need to start from, they can upload the rest of the file as described above, either in a single request containing all of the remaining bytes, or in multiple chunks as per the above protocol.
Canceling an In-Progress Upload
If a client wishes to cancel an upload of a specific file, for instance because they need to upload
a different file, they may do so by issuing a DELETE
request to the upload resource URL of the
file they want to delete.
A successful cancellation request MUST respond with a 204 No Content
.
Once deleting, a client MUST NOT assume that the previous upload resource URL can be reused.
Delete a Partial or Fully Uploaded File
Similarly, for files which have already been completely uploaded, clients can delete the file by
issuing a DELETE
request to the upload resource URL.
A successful deletion request MUST response with a 204 No Content
.
Once deleting, a client MUST NOT assume that the previous upload resource URL can be reused.
Replacing a Partially or Fully Uploaded File
To replace a session file, the file upload MUST have been previously completed or deleted. It is not possible to replace a file if the upload for that file is incomplete. Clients have two options to replace an incomplete upload:
- Cancel the in-progress upload by issuing a
DELETE
to the upload resource URL for the file they want to replace. After this, the new file upload can be initiated by beginning the entire file upload sequence over again. This means providing the metadata request again to retrieve a new upload resource URL. Client MUST NOT assume that the previous upload resource URL can be reused after deletion. - Complete the in-progress upload by uploading a zero-length chunk
providing the
Upload-Complete: ?1
header. This effectively truncates and completes the in-progress upload, after which point the new upload can commence. In this case, clients SHOULD reuse the previous upload resource URL and do not need to begin the entire file upload sequence over again.
Session Status
At any time, a client can query the status of the session by issuing a GET
request to the
session
link given in the session creation response body.
The server will respond to this GET
request with the same response
that they got when they initially created the upload session, except with any changes to status
,
valid-for
, or files
reflected.
Session Extension
Servers MAY allow clients to extend sessions, but the overall lifetime and number of extensions
allowed is left to the server. To extend a session, a client issues a POST
request to the
session
link given in the session creation response body.
The JSON body of this request looks like:
{
"meta": {
"api-version": "2.0"
},
":action": "extend",
"extend-for": 3600
}
The number of seconds specified is just a suggestion to the server for the number of additional
seconds to extend the current session. For example, if the client wants to extend the current
session for another hour, extend-for
would be 3600
. Upon successful extension, the server
will respond with the same response that they got when they initially
created the upload session, except with any changes to status
, valid-for
, or files
reflected.
If the server refuses to extend the session for the requested number of seconds, it still returns a
success response, and the valid-for
key will simply include the number of seconds remaining in
the current session.
Session Cancellation
To cancel an entire session, a client issues a DELETE
request to the session
link given in the session creation response body. The server
then marks the session as canceled, and SHOULD purge any data that was uploaded as part of that
session. Future attempts to access that session URL or any of the upload session URLs MUST
return a 404 Not Found
.
To prevent dangling sessions, servers may also choose to cancel timed-out sessions on their own accord. It is recommended that servers expunge their sessions after no less than a week, but each server may choose their own schedule. Servers MAY support client-directed session extensions.
Session Completion
To complete a session and publish the files that have been included in it, a client issues a
POST
request to the session
link given in the session creation
response body.
The JSON body of this request looks like:
{
"meta": {
"api-version": "2.0"
},
":action": "publish",
}
If the server is able to immediately complete the session, it may do so and return a 201 Created
response. If it is unable to immediately complete the session (for instance, if it needs to do
processing that may take longer than reasonable in a single HTTP request), then it may return a
202 Accepted
response.
In either case, the server should include a Location
header pointing back to the session status
URL, and if the server returned a 202 Accepted
, the client may poll that URL to watch for the
status to change.
If a session is published that has no staged files, the operation is effectively a no-op, except where a new project name is being reserved. In this case, the new project is created, reserved, and owned by the user that created the session.
Session Token
When creating a session, clients can provide a nonce
in the initial session creation
request . This nonce is a string with arbitrary content. The nonce
is
optional, and if omitted, is equivalent to providing an empty string.
In order to support previewing of staged uploads, the package name
and version
, along with
this nonce
are used as input into a hashing algorithm to produce a unique “session token”. This
session token is valid for the life of the session (i.e., until it is completed, either by
cancellation or publishing), and can be provided to supporting installers to gain access to the
staged release.
The use of the nonce
allows clients to decide whether they want to obscure the visibility of
their staged releases or not, and there can be good reasons for either choice. For example, if a CI
system wants to upload some wheels for a new release, and wants to allow independent validation of a
stage before it’s published, the client may opt for not including a nonce. On the other hand, if a
client would like to pre-seed a release which it publishes atomically at the time of a public
announcement, that client will likely opt for providing a nonce.
The SHA256 algorithm is used to
turn these inputs into a unique token, in the order name
, version
, nonce
, using the
following Python code as an example:
from hashlib import sha256
def gentoken(name: bytes, version: bytes, nonce: bytes = b''):
h = sha256()
h.update(name)
h.update(version)
h.update(nonce)
return h.hexdigest()
It should be evident that if no nonce
is provided in the session creation request, then the preview token is easily guessable from the package name and version
number alone. Clients can elect to omit the nonce
(or set it to the empty string themselves) if
they want to allow previewing from anybody without access to the preview token. By providing a
non-empty nonce
, clients can elect for security-through-obscurity, but this does not protect
staged files behind any kind of authentication.
Stage Previews
The ability to preview staged releases before they are published is an important feature of this
PEP, enabling an additional level of last-mile testing before the release is available to the
public. Indexes MAY provide this functionality through the URL provided in the stage
sub-key of the links key returned when the session is created. The stage
URL can be passed to installers such as pip
by setting the –extra-index-url flag to this value.
Multiple stages can even be previewed by repeating this flag with multiple values.
In the future, it may be valuable to include something like a Stage-Token
header to the Simple
Repository API
requests or the PEP 691 JSON-based Simple API, with the value from the session-token
sub-key
of the JSON response to the session creation request. Multiple Stage-Token
headers could be
allowed, and installers could support enabling stage previews by adding a --staged <token>
or
similarly named option to set the Stage-Token
header at the command line. This feature is not
currently support, nor proposed by this PEP, though it could be proposed by a separate PEP in the
future.
In either case, the index will return views that expose the staged releases to the installer tool, making them available to download and install into virtual environments built for that last-mile testing. The former option allows for existing installers to preview staged releases with no changes, although perhaps in a less user-friendly way. The latter option can be a better user experience, but the details of this are left to installer tool maintainers.
Errors
All error responses that contain content will have a body that looks like:
{
"meta": {
"api-version": "2.0"
},
"message": "...",
"errors": [
{
"source": "...",
"message": "..."
}
]
}
Besides the standard meta
key, this has the following top level keys:
message
- A singular message that encapsulates all errors that may have happened on this request.
errors
- An array of specific errors, each of which contains a
source
key, which is a string that indicates what the source of the error is, and amessage
key for that specific error.
The message
and source
strings do not have any specific meaning, and are intended for human
interpretation to aid in diagnosing underlying issue.
Content Types
Like PEP 691, this PEP proposes that all requests and responses from this upload API will have a standard content type that describes what the content is, what version of the API it represents, and what serialization format has been used.
This standard request content type applies to all requests except for file upload requests which, since they contain only binary data, is always application/octet-stream
.
The structure of the Content-Type
header for all other requests is:
application/vnd.pypi.upload.$version+$format
Since minor API version differences should never be disruptive, only the major version is included
in the content type; the version number is prefixed with a v
.
Unlike PEP 691, this PEP does not change the existing legacy 1.0
upload API in any way, so
servers are required to host the new API described in this PEP at a different endpoint than the
existing upload API.
Since JSON is the only defined request format defined in this PEP, all non-file-upload requests
defined in this PEP MUST include a Content-Type
header value of:
application/vnd.pypi.upload.v2+json
.
As with PEP 691, a special “meta” version is supported named latest
, the purpose of which is
to allow clients to request the latest version implemented by the server, without having to know
ahead of time what that version is. It is recommended however, that clients be explicit about what
versions they support.
Similar to PEP 691, this PEP also standardizes on using server-driven content negotiation to
allow clients to request different versions or serialization formats, which includes the format
part of the content type. However, since this PEP expects the existing legacy 1.0
upload API to
exist at a different endpoint, and this PEP currently only provides for JSON serialization, this
mechanism is not particularly useful. Clients only have a single version and serialization they can
request. However clients SHOULD be prepared to handle content negotiation gracefully in the case
that additional formats or versions are added in the future.
FAQ
Does this mean PyPI is planning to drop support for the existing upload API?
At this time PyPI does not have any specific plans to drop support for the existing upload API.
Unlike with PEP 691 there are significant benefits to doing so, so it is likely that support for the legacy upload API to be (responsibly) deprecated and removed at some point in the future. Such future deprecation planning is explicitly out of scope for this PEP.
Is this Resumable Upload protocol based on anything?
Yes!
It’s actually based on the protocol specified in an active internet draft, where the authors took what they learned implementing tus to provide the idea of resumable uploads in a wholly generic, standards based way.
This PEP deviates from that spec in several ways, as described in the body of the proposal. This decision was made for a few reasons:
- The
104 Upload Resumption Supported
is the only part of that draft which does not rely entirely on things that are already supported in the existing standards, since it was adding a new informational status. - Many clients and web frameworks don’t support
1xx
informational responses in a very good way, if at all, adding it would complicate implementation for very little benefit. - The purpose of the
104 Upload Resumption Supported
support is to allow clients to determine that an arbitrary endpoint that they’re interacting with supports resumable uploads. Since this PEP is mandating support for that in servers, clients can just assume that the server they are interacting with supports it, which makes using it unneeded. - In theory, if the support for
1xx
responses got resolved and the draft gets accepted with it in, we can add that in at a later date without changing the overall flow of the API.
Can I use the upload 2.0 API to reserve a project name?
Yes! If you’re not ready to upload files to make a release, you can still reserve a project name (assuming of course that the name doesn’t already exist).
To do this, create a new session, then publish the session without uploading any files. While the version
key is required in the JSON
body of the create session request, you can simply use the placeholder version number "0.0.0"
.
The user that created the session will become the owner of the new project.
Open Questions
Multipart Uploads vs tus
This PEP currently bases the actual uploading of files on an internet draft (originally designed by tus.io) that supports resumable file uploads.
That protocol requires a few things:
- That if clients don’t upload the entire file in one shot, that they have to submit the chunks
serially, and in the correct order, with all but the final chunk having a
Upload-Complete: ?0
header. - Resumption of an upload is essentially just querying the server to see how much data they’ve gotten, then sending the remaining bytes (either as a single request, or in chunks).
- The upload implicitly is completed when the server successfully gets all of the data from the client.
This has the benefit that if a client doesn’t care about resuming their download, it can essentially
ignore the protocol. Clients can just POST
the file to the file upload URL, and if it doesn’t
succeed, they can just POST
the whole file again.
The other benefit is that even if clients do want to support resumption, unless they need to
resume the download, they can still just POST
the file.
Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks
requirement means that the server can maintain hashing state between requests, update it for each
request, then write that file back to storage. Unfortunately this isn’t actually possible to do with
Python’s hashlib standard library module.
There are some libraries third party libraries, such as Rehash that do implement the necessary APIs, but they don’t
support every hash that hashlib
does (e.g. blake2
or sha3
at the time of writing).
We might also need to reconstitute the download for processing anyways to do things like extract metadata, etc from it, which would make it a moot point.
The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially.
AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it which they call multipart uploading.
The basic flow for a multipart upload is:
- Initiate a multipart upload to get an upload ID.
- Break your file up into chunks, and upload each one of them individually.
- Once all chunks have been uploaded, finalize the upload. This is the step where any errors would occur.
Such multipart uploads do not directly support resuming an upload, but it allows clients to control the “blast radius” of failure by adjusting the size of each part they upload, and if any of the parts fail, they only have to resend those specific parts. The trade-off is that it allows for more parallelism when uploading a single file, allowing clients to maximize their bandwidth using multiple threads to send the file data.
We wouldn’t need an explicit step (1), because our session would implicitly initiate a multipart upload for each file.
There are downsides to this though:
- Clients have to do more work on every request to have something resembling resumable uploads. They would have to break the file up into multiple parts rather than just making a single POST request, and only needing to deal with the complexity if something fails.
- Clients that don’t care about resumption at all still have to deal with the third explicit step, though they could just upload the file all as a single part. (S3 works around this by having another API for one shot uploads, but the PEP authors place a high value on having a single API for uploading any individual file.)
- Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by hashing each part, then the overall hash is just a hash of those hashes, not of the content itself. Since PyPI needs to know the actual hash of the file itself anyway, we would have to reconstitute the file, read its content, and hash it once it’s been fully uploaded, though it could still use the hash of hashes trick for checksumming the upload itself.
The PEP authors lean towards tus
style resumable uploads, due to them being simpler to use,
easier to imp;lement, and more consistent, with the main downside being that multi-threaded
performance is theoretically left on the table.
One other possible benefit of the S3 style multipart uploads is that you don’t have to try and do any sort of protection against parallel uploads, since they’re just supported. That alone might erase most of the server side implementation simplification.
Footnotes
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0694.rst
Last modified: 2025-01-06 23:53:01 GMT