Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 694 – Upload 2.0 API for Python Package Indexes

Author:
Barry Warsaw <barry at python.org>, Donald Stufft <donald at stufft.io>
Discussions-To:
Discourse thread
Status:
Draft
Type:
Standards Track
Topic:
Packaging
Created:
11-Jun-2022
Post-History:
27-Jun-2022

Table of Contents

Abstract

This PEP proposes a standard API for uploading files to a Python package index such as PyPI. Along with standardization, the upload API provides additional useful features such as support for:

  • an upload session, which can be used to simultaneously publish all wheels in a package release;
  • “staging” a release, which can be used to test uploads before publicly publishing them, without the need for test.pypi.org;
  • artifacts which can be overwritten and replaced, until a session is published;
  • asynchronous and “chunked”, resumable file uploads, for more efficient use of network bandwidth;
  • detailed status on the state of artifact uploads;
  • new project creation without requiring the uploading of an artifact.

Once this new upload API is adopted, the existing legacy API can be deprecated, however this PEP does not propose a deprecation schedule for the legacy API.

Rationale

There is currently no standardized API for uploading files to a Python package index such as PyPI. Instead, everyone has been forced to reverse engineer the existing “legacy” API.

The legacy API, while functional, leaks implementation details of the original PyPI code base, which has been faithfully replicated in the new code base and alternative implementations.

In addition, there are a number of major issues with the legacy API:

  • It is fully synchronous, which forces requests to be held open both for the upload itself, and while the index processes the uploaded file to determine success or failure.
  • It does not support any mechanism for resuming an upload. With the largest default file size on PyPI being around 1GB in size, requiring the entire upload to complete successfully means bandwidth is wasted when such uploads experience a network interruption while the request is in progress.
  • The atomic unit of operation is a single file. This is problematic when a release logically includes an sdist and multiple binary wheels, leading to race conditions where consumers get different versions of the package if they are unlucky enough to require a package before their platform’s wheel has completely uploaded. If the release uploads its sdist first, this may also manifest in some consumers seeing only the sdist, triggering a local build from source.
  • Status reporting is very limited. There’s no support for reporting multiple errors, warnings, deprecations, etc. Status is limited to the HTTP status code and reason phrase, of which the reason phrase has been deprecated since HTTP/2 (RFC 7540).
  • Metadata for a release is submitted alongside the file. However, as this metadata is famously unreliable, most installers instead choose to download the entire file and read the metadata from there.
  • There is no mechanism for allowing an index to do any sort of sanity checks before bandwidth gets expended on an upload. Many cases of invalid metadata or incorrect permissions could be checked prior to uploading files.
  • There is no support for “staging” a release prior to publishing it to the index.
  • Creation of new projects requires the uploading of at least one file, leading to “stub” uploads to claim a project namespace.

The new upload API proposed in this PEP solves all of these problems, providing for a much more flexible, bandwidth friendly approach, with better error reporting, a better release testing experience, and atomic and simultaneous publishing of all release artifacts.

Legacy API

The following is an overview of the legacy API. For the detailed description, consult the PyPI user guide documentation.

Endpoint

The existing upload API lives at a base URL. For PyPI, that URL is currently https://upload.pypi.org/legacy/. Clients performing uploads specify the API they want to call by adding an :action URL parameter with a value of file_upload. [1]

The legacy API also has a protocol_version parameter, in theory allowing new versions of the API to be defined. In practice this has never happened, and the value is always 1.

Thus, the effective upload API on PyPI is: https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1.

Encoding

The data to be submitted is submitted as a POST request with the content type of multipart/form-data. This reflects the legacy API’s historical nature, which was originally designed not as an API, but rather as a web form on the initial PyPI implementation, with client code written to programmatically submit that form.

Content

Roughly speaking, the metadata contained within the package is submitted as parts where the content disposition is form-data, and the metadata key is the name of the field. The names of these various pieces of metadata are not documented, and they sometimes, but not always match the names used in the METADATA files for package artifacts. The case rarely matches, and the form-data to METADATA conversion is inconsistent.

The upload artifact file itself is sent as a application/octet-stream part with the name of content, and if there is a PGP signature attached, then it will be included as a application/octet-stream part with the name of gpg_signature.

Authentication

Upload authentication is also not standardized. On PyPI, authentication is through API tokens or Trusted Publisher (OpenID Connect). Other indexes may support different authentication methods.

Upload 2.0 API Specification

This PEP draws inspiration from the Resumable Uploads for HTTP internet draft, however there are significant differences. This is largely due to the unique nature of Python package releases (i.e. metadata, multiple related artifacts, etc.), and the support for an upload session and release stages. Where it makes sense to adopt details of the draft, this PEP does so.

This PEP traces the root cause of most of the issues with the existing API to be roughly two things:

  • The metadata is submitted alongside the file, rather than being parsed from the file itself. [2]
  • It supports only a single request, using only form data, that either succeeds or fails, and all actions are atomic within that single request.

To address these issues, this PEP proposes a multi-request workflow, which at a high level involves these steps:

  1. Initiate an upload session, creating a release stage.
  2. Upload the file(s) to that stage as part of the upload session.
  3. Complete the upload session, publishing or discarding the stage.
  4. Optionally check the status of an upload session.

Versioning

This PEP uses the same MAJOR.MINOR versioning system as used in PEP 691, but it is otherwise independently versioned. The legacy API is considered by this PEP to be version 1.0, but this PEP does not modify the legacy API in any way.

The API proposed in this PEP therefor has the version number 2.0.

Root Endpoint

All URLs described here are relative to the “root endpoint”, which may be located anywhere within the url structure of a domain. For example, the root endpoint could be https://upload.example.com/, or https://example.com/upload/.

Specifically for PyPI, this PEP proposes to implement the root endpoint at https://upload.pypi.org/2.0. This root URL will be considered provisional while the feature is being tested, and will be blessed as permanent after sufficient testing with live projects.

Create an Upload Session

A release starts by creating a new upload session. To create the session, a client submits a POST request to the root URL, with a payload that looks like:

{
  "meta": {
    "api-version": "2.0"
  },
  "name": "foo",
  "version": "1.0",
  "nonce": "<string>"
}

The request includes the following top-level keys:

meta (required)
Describes information about the payload itself. Currently, the only defined sub-key is api-version the value of which must be the string "2.0".
name (required)
The name of the project that this session is attempting to release a new version of.
version (required)
The version of the project that this session is attempting to add files to.
nonce (optional)
An additional client-side string input to the “session token” algorithm. Details are provided below, but if this key is omitted, it is equivalent to passing the empty string.

Upon successful session creation, the server returns a 201 Created response. If an error occurs, the appropriate 4xx code will be returned, as described in the Errors section.

If a session is created for a project which has no previous release, then the index MAY reserve the project name before the session is published, however it MUST NOT be possible to navigate to that project using the “regular” (i.e. unstaged) access protocols, until the stage is published. If this first-release stage gets canceled, then the index SHOULD delete the project record, as if it were never uploaded.

The session is owned by the user that created it, and all subsequent requests MUST be performed with the same credentials, otherwise a 403 Forbidden will be returned on those subsequent requests.

Response Body

The successful response includes the following JSON content:

{
  "meta": {
    "api-version": "2.0"
  },
  "links": {
    "stage": "...",
    "upload": "...",
    "session": "...",
  },
  "session-token": "<token-string>",
  "valid-for": 604800,
  "status": "pending",
  "files": {},
  "notices": [
    "a notice to display to the user"
  ]
}

Besides the meta key, which has the same format as the request JSON, the success response has the following keys:

links
A dictionary mapping keys to URLs related to this session, the details of which are provided below.
session-token
If the index supports previewing staged releases, this key will contain the unique “session token” that can be provided to installers in order to preview the staged release before it’s published. If the index does not support stage previewing, this key MUST be omitted.
valid-for
An integer representing how long, in seconds, until the server itself will expire this session, and thus all of its content, including any uploaded files and the URL links related to the session. This value is roughly relative to the time at which the session was created or extended. The session SHOULD live at least this much longer unless the client itself has canceled or published the session. Servers MAY choose to increase this time, but should never decrease it, except naturally through the passage of time. Clients can query the session status to get time remaining in the session.
status
A string that contains one of pending, published, error, or canceled, representing the overall status of the session.
files
A mapping containing the filenames that have been uploaded to this session, to a mapping containing details about each file referenced in this session.
notices
An optional key that points to an array of human-readable informational notices that the server wishes to communicate to the end user. These notices are specific to the overall session, not to any particular file in the session.
Session Files

The files key contains a mapping from the names of the files uploaded in this session to a sub-mapping with the following keys:

status
A string with the same values and semantics as the session status key, except that it indicates the status of the specific referenced file.
link
The absolute URL that the client should use to reference this specific file. This URL is used to retrieve, replace, or delete the referenced file. If a nonce was provided, this URL MUST be obfuscated with a non-guessable token as described in the session token section.
notices
An optional key with similar format and semantics as the notices session key, except that these notices are specific to the referenced file.

If a second session is created for the same name-version pair while a session for that pair is in the pending state, then the server MUST return the JSON status response for the already existing session, along with the 200 Ok status code rather than creating a new, empty session.

File Upload

After creating the session, the upload endpoint from the response’s session links mapping is used to begin the upload of new files into that session. Clients MUST use the provided upload URL and MUST NOT assume there is any pattern or commonality to those URLs from one session to the next.

To initiate a file upload, a client first sends a POST request to the upload URL. The request body has the following JSON format:

{
  "meta": {
    "api-version": "2.0"
  },
  "filename": "foo-1.0.tar.gz",
  "size": 1000,
  "hashes": {"sha256": "...", "blake2b": "..."},
  "metadata": "..."
}

Besides the standard meta key, the request JSON has the following additional keys:

filename (required)
The name of the file being uploaded.
size (required)
The size in bytes of the file being uploaded.
hashes (required)
A mapping of hash names to hex-encoded digests. Each of these digests are the checksums of the file being uploaded when hashed by the algorithm identified in the name.

By default, any hash algorithm available in hashlib can be used as a key for the hashes dictionary [3]. At least one secure algorithm from hashlib.algorithms_guaranteed MUST always be included. This PEP specifically recommends sha256.

Multiple hashes may be passed at a time, but all hashes provided MUST be valid for the file.

metadata (optional)
If given, this is a string value containing the file’s core metadata.

Servers MAY use the data provided in this request to do some sanity checking prior to allowing the file to be uploaded. These checks may include, but are not limited to:

  • checking if the filename already exists in a published release;
  • checking if the size would exceed any project or file quota;
  • checking if the contents of the metadata, if provided, are valid.

If the server determines that upload should proceed, it will return a 201 Created response, with an empty body, and a Location header pointing to the URL that the file content should be uploaded to. The status of the session will also include the filename in the files mapping, with the above Location URL included in under the link sub-key.

Important

The IETF draft calls this the URL of the upload resource, and this PEP uses that nomenclature as well.

Upload File Contents

The actual file contents are uploaded by issuing a POST request to the upload resource URL [5]. The client may either upload the entire file in a single request, or it may opt for “chunked” upload where the file contents are split into multiple requests, as described below.

Important

The protocol defined in this PEP differs from the IETF draft in a few ways:

  • For chunked uploads, the second and subsequent chunks are uploaded using a POST request instead of PATCH requests. Similarly, this PEP uses application/octet-stream for the Content-Type headers for all chunks.
  • No Upload-Draft-Interop-Version header is required.
  • Some of the server responses are different.

When uploading the entire file in a single request, the request MUST include the following headers (e.g. for a 100,000 byte file):

Content-Length: 100000
Content-Type: application/octet-stream
Upload-Length: 100000
Upload-Complete: ?1

The body of this request contains all 100,000 bytes of the unencoded raw binary data.

Content-Length
The number of file bytes contained in the body of this request.
Content-Type
MUST be application/octet-stream.
Upload-Length
Indicates the total number of bytes that will be uploaded for this file. For single-request uploads this will always be equal to Content-Length, but these values will likely differ for chunked uploads. This value MUST equal the number of bytes given in the size field of the file upload initiation request.
Upload-Complete
A flag indicating whether more chunks are coming for this file. For single-request uploads, the value of this header MUST be ?1.

If the upload completes successfully, the server MUST respond with a 201 Created status. The response body has no content.

If this single-request upload fails, the entire file must be resent in another single HTTP request. This is the recommended, preferred format for file uploads since fewer requests are required.

As an example, if the client was to upload a 100,000 byte file, the headers would look like:

Content-Length: 100000
Content-Type: application/octet-stream
Upload-Length: 100000
Upload-Complete: ?1

Clients can opt to upload the file in multiple chunks. Because the upload resource URL provided in the metadata response will be unique per file, clients MUST use the given upload resource URL for all chunks. Clients upload file chunks by sending multiple POST requests to this URL, with one request per chunk.

For chunked uploads, the Content-Length is equal to the size in bytes of the chunk that is currently being sent. The client MUST include a Upload-Offset header which indicates the byte offset that the content included in this chunk’s request starts at, and an Upload-Complete header with the value ?0. For the first chunk, the Upload-Offset header MUST be set to 0. As with single-request uploads, the Content-Type header is application/octet-stream and the body is the raw, unencoded bytes of the chunk.

For example, if uploading a 100,000 byte file in 1000 byte chunks, the first chunk’s request headers would be:

Content-Length: 1000
Content-Type: application/octet-stream
Upload-Offset: 0
Upload-Length: 100000
Upload-Complete: ?0

For the second chunk representing bytes 1000 through 1999, include the following headers:

Content-Length: 1000
Content-Type: application/octet-stream
Upload-Offset: 1000
Upload-Length: 100000
Upload-Complete: ?0

These requests would continue sequentially until the last chunk is ready to be uploaded.

For each successful chunk, the server MUST respond with a 202 Accepted header, except for the final chunk, which MUST be a 201 Created, and as with non-chunked uploads, the body of these responses has no content.

The final chunk of data MUST include the Upload-Complete: ?1 header, since at that point the entire file has been uploaded.

With both chunked and non-chunked uploads, once completed successfully, the file MUST not be publicly visible in the repository, but merely staged until the upload session is completed. If the server supports previews, the file MUST be visible at the stage URL. Partially uploaded chunked files SHOULD NOT be visible at the stage URL.

The following constraints are placed on uploads regardless of whether they are single chunk or multiple chunks:

  • A client MUST NOT perform multiple POST requests in parallel for the same file to avoid race conditions and data loss or corruption.
  • If the offset provided in Upload-Offset is not 0 or correctly specifies the byte offset of the next chunk in an incomplete upload, then the server MUST respond with a 409 Conflict. This means that a client MAY NOT upload chunks out of order.
  • Once a file upload has completed successfully, you may initiate another upload for that file, which once completed, will replace that file. This is possible until the entire session is completed, at which point no further file uploads (either creating or replacing a session file) are accepted. I.e. once a session is published, the files included in that release are immutable [4].
Resume an Upload

To resume an upload, you first have to know how much of the file’s contents the server has already received. If this is not already known, a client can make a HEAD request to the upload resource URL.

The server MUST respond with a 204 No Content response, with an Upload-Offset header that indicates what offset the client should continue uploading from. If the server has not received any data, then this would be 0, if it has received 1007 bytes then it would be 1007. For this example, the full response headers would look like:

Upload-Offset: 1007
Upload-Complete: ?0
Cache-Control: no-store

Once the client has retrieved the offset that they need to start from, they can upload the rest of the file as described above, either in a single request containing all of the remaining bytes, or in multiple chunks as per the above protocol.

Canceling an In-Progress Upload

If a client wishes to cancel an upload of a specific file, for instance because they need to upload a different file, they may do so by issuing a DELETE request to the upload resource URL of the file they want to delete.

A successful cancellation request MUST respond with a 204 No Content.

Once deleting, a client MUST NOT assume that the previous upload resource URL can be reused.

Delete a Partial or Fully Uploaded File

Similarly, for files which have already been completely uploaded, clients can delete the file by issuing a DELETE request to the upload resource URL.

A successful deletion request MUST response with a 204 No Content.

Once deleting, a client MUST NOT assume that the previous upload resource URL can be reused.

Replacing a Partially or Fully Uploaded File

To replace a session file, the file upload MUST have been previously completed or deleted. It is not possible to replace a file if the upload for that file is incomplete. Clients have two options to replace an incomplete upload:

  • Cancel the in-progress upload by issuing a DELETE to the upload resource URL for the file they want to replace. After this, the new file upload can be initiated by beginning the entire file upload sequence over again. This means providing the metadata request again to retrieve a new upload resource URL. Client MUST NOT assume that the previous upload resource URL can be reused after deletion.
  • Complete the in-progress upload by uploading a zero-length chunk providing the Upload-Complete: ?1 header. This effectively truncates and completes the in-progress upload, after which point the new upload can commence. In this case, clients SHOULD reuse the previous upload resource URL and do not need to begin the entire file upload sequence over again.

Session Status

At any time, a client can query the status of the session by issuing a GET request to the session link given in the session creation response body.

The server will respond to this GET request with the same response that they got when they initially created the upload session, except with any changes to status, valid-for, or files reflected.

Session Extension

Servers MAY allow clients to extend sessions, but the overall lifetime and number of extensions allowed is left to the server. To extend a session, a client issues a POST request to the session link given in the session creation response body.

The JSON body of this request looks like:

{
  "meta": {
    "api-version": "2.0"
  },
  ":action": "extend",
  "extend-for": 3600
}

The number of seconds specified is just a suggestion to the server for the number of additional seconds to extend the current session. For example, if the client wants to extend the current session for another hour, extend-for would be 3600. Upon successful extension, the server will respond with the same response that they got when they initially created the upload session, except with any changes to status, valid-for, or files reflected.

If the server refuses to extend the session for the requested number of seconds, it still returns a success response, and the valid-for key will simply include the number of seconds remaining in the current session.

Session Cancellation

To cancel an entire session, a client issues a DELETE request to the session link given in the session creation response body. The server then marks the session as canceled, and SHOULD purge any data that was uploaded as part of that session. Future attempts to access that session URL or any of the upload session URLs MUST return a 404 Not Found.

To prevent dangling sessions, servers may also choose to cancel timed-out sessions on their own accord. It is recommended that servers expunge their sessions after no less than a week, but each server may choose their own schedule. Servers MAY support client-directed session extensions.

Session Completion

To complete a session and publish the files that have been included in it, a client issues a POST request to the session link given in the session creation response body.

The JSON body of this request looks like:

{
  "meta": {
    "api-version": "2.0"
  },
  ":action": "publish",
}

If the server is able to immediately complete the session, it may do so and return a 201 Created response. If it is unable to immediately complete the session (for instance, if it needs to do processing that may take longer than reasonable in a single HTTP request), then it may return a 202 Accepted response.

In either case, the server should include a Location header pointing back to the session status URL, and if the server returned a 202 Accepted, the client may poll that URL to watch for the status to change.

If a session is published that has no staged files, the operation is effectively a no-op, except where a new project name is being reserved. In this case, the new project is created, reserved, and owned by the user that created the session.

Session Token

When creating a session, clients can provide a nonce in the initial session creation request . This nonce is a string with arbitrary content. The nonce is optional, and if omitted, is equivalent to providing an empty string.

In order to support previewing of staged uploads, the package name and version, along with this nonce are used as input into a hashing algorithm to produce a unique “session token”. This session token is valid for the life of the session (i.e., until it is completed, either by cancellation or publishing), and can be provided to supporting installers to gain access to the staged release.

The use of the nonce allows clients to decide whether they want to obscure the visibility of their staged releases or not, and there can be good reasons for either choice. For example, if a CI system wants to upload some wheels for a new release, and wants to allow independent validation of a stage before it’s published, the client may opt for not including a nonce. On the other hand, if a client would like to pre-seed a release which it publishes atomically at the time of a public announcement, that client will likely opt for providing a nonce.

The SHA256 algorithm is used to turn these inputs into a unique token, in the order name, version, nonce, using the following Python code as an example:

from hashlib import sha256

def gentoken(name: bytes, version: bytes, nonce: bytes = b''):
    h = sha256()
    h.update(name)
    h.update(version)
    h.update(nonce)
    return h.hexdigest()

It should be evident that if no nonce is provided in the session creation request, then the preview token is easily guessable from the package name and version number alone. Clients can elect to omit the nonce (or set it to the empty string themselves) if they want to allow previewing from anybody without access to the preview token. By providing a non-empty nonce, clients can elect for security-through-obscurity, but this does not protect staged files behind any kind of authentication.

Stage Previews

The ability to preview staged releases before they are published is an important feature of this PEP, enabling an additional level of last-mile testing before the release is available to the public. Indexes MAY provide this functionality through the URL provided in the stage sub-key of the links key returned when the session is created. The stage URL can be passed to installers such as pip by setting the –extra-index-url flag to this value. Multiple stages can even be previewed by repeating this flag with multiple values.

In the future, it may be valuable to include something like a Stage-Token header to the Simple Repository API requests or the PEP 691 JSON-based Simple API, with the value from the session-token sub-key of the JSON response to the session creation request. Multiple Stage-Token headers could be allowed, and installers could support enabling stage previews by adding a --staged <token> or similarly named option to set the Stage-Token header at the command line. This feature is not currently support, nor proposed by this PEP, though it could be proposed by a separate PEP in the future.

In either case, the index will return views that expose the staged releases to the installer tool, making them available to download and install into virtual environments built for that last-mile testing. The former option allows for existing installers to preview staged releases with no changes, although perhaps in a less user-friendly way. The latter option can be a better user experience, but the details of this are left to installer tool maintainers.

Errors

All error responses that contain content will have a body that looks like:

{
  "meta": {
    "api-version": "2.0"
  },
  "message": "...",
  "errors": [
    {
      "source": "...",
      "message": "..."
    }
  ]
}

Besides the standard meta key, this has the following top level keys:

message
A singular message that encapsulates all errors that may have happened on this request.
errors
An array of specific errors, each of which contains a source key, which is a string that indicates what the source of the error is, and a message key for that specific error.

The message and source strings do not have any specific meaning, and are intended for human interpretation to aid in diagnosing underlying issue.

Content Types

Like PEP 691, this PEP proposes that all requests and responses from this upload API will have a standard content type that describes what the content is, what version of the API it represents, and what serialization format has been used.

This standard request content type applies to all requests except for file upload requests which, since they contain only binary data, is always application/octet-stream.

The structure of the Content-Type header for all other requests is:

application/vnd.pypi.upload.$version+$format

Since minor API version differences should never be disruptive, only the major version is included in the content type; the version number is prefixed with a v.

Unlike PEP 691, this PEP does not change the existing legacy 1.0 upload API in any way, so servers are required to host the new API described in this PEP at a different endpoint than the existing upload API.

Since JSON is the only defined request format defined in this PEP, all non-file-upload requests defined in this PEP MUST include a Content-Type header value of:

  • application/vnd.pypi.upload.v2+json.

As with PEP 691, a special “meta” version is supported named latest, the purpose of which is to allow clients to request the latest version implemented by the server, without having to know ahead of time what that version is. It is recommended however, that clients be explicit about what versions they support.

Similar to PEP 691, this PEP also standardizes on using server-driven content negotiation to allow clients to request different versions or serialization formats, which includes the format part of the content type. However, since this PEP expects the existing legacy 1.0 upload API to exist at a different endpoint, and this PEP currently only provides for JSON serialization, this mechanism is not particularly useful. Clients only have a single version and serialization they can request. However clients SHOULD be prepared to handle content negotiation gracefully in the case that additional formats or versions are added in the future.

FAQ

Does this mean PyPI is planning to drop support for the existing upload API?

At this time PyPI does not have any specific plans to drop support for the existing upload API.

Unlike with PEP 691 there are significant benefits to doing so, so it is likely that support for the legacy upload API to be (responsibly) deprecated and removed at some point in the future. Such future deprecation planning is explicitly out of scope for this PEP.

Is this Resumable Upload protocol based on anything?

Yes!

It’s actually based on the protocol specified in an active internet draft, where the authors took what they learned implementing tus to provide the idea of resumable uploads in a wholly generic, standards based way.

This PEP deviates from that spec in several ways, as described in the body of the proposal. This decision was made for a few reasons:

  • The 104 Upload Resumption Supported is the only part of that draft which does not rely entirely on things that are already supported in the existing standards, since it was adding a new informational status.
  • Many clients and web frameworks don’t support 1xx informational responses in a very good way, if at all, adding it would complicate implementation for very little benefit.
  • The purpose of the 104 Upload Resumption Supported support is to allow clients to determine that an arbitrary endpoint that they’re interacting with supports resumable uploads. Since this PEP is mandating support for that in servers, clients can just assume that the server they are interacting with supports it, which makes using it unneeded.
  • In theory, if the support for 1xx responses got resolved and the draft gets accepted with it in, we can add that in at a later date without changing the overall flow of the API.

Can I use the upload 2.0 API to reserve a project name?

Yes! If you’re not ready to upload files to make a release, you can still reserve a project name (assuming of course that the name doesn’t already exist).

To do this, create a new session, then publish the session without uploading any files. While the version key is required in the JSON body of the create session request, you can simply use the placeholder version number "0.0.0".

The user that created the session will become the owner of the new project.

Open Questions

Multipart Uploads vs tus

This PEP currently bases the actual uploading of files on an internet draft (originally designed by tus.io) that supports resumable file uploads.

That protocol requires a few things:

  • That if clients don’t upload the entire file in one shot, that they have to submit the chunks serially, and in the correct order, with all but the final chunk having a Upload-Complete: ?0 header.
  • Resumption of an upload is essentially just querying the server to see how much data they’ve gotten, then sending the remaining bytes (either as a single request, or in chunks).
  • The upload implicitly is completed when the server successfully gets all of the data from the client.

This has the benefit that if a client doesn’t care about resuming their download, it can essentially ignore the protocol. Clients can just POST the file to the file upload URL, and if it doesn’t succeed, they can just POST the whole file again.

The other benefit is that even if clients do want to support resumption, unless they need to resume the download, they can still just POST the file.

Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks requirement means that the server can maintain hashing state between requests, update it for each request, then write that file back to storage. Unfortunately this isn’t actually possible to do with Python’s hashlib standard library module. There are some libraries third party libraries, such as Rehash that do implement the necessary APIs, but they don’t support every hash that hashlib does (e.g. blake2 or sha3 at the time of writing).

We might also need to reconstitute the download for processing anyways to do things like extract metadata, etc from it, which would make it a moot point.

The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially.

AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it which they call multipart uploading.

The basic flow for a multipart upload is:

  1. Initiate a multipart upload to get an upload ID.
  2. Break your file up into chunks, and upload each one of them individually.
  3. Once all chunks have been uploaded, finalize the upload. This is the step where any errors would occur.

Such multipart uploads do not directly support resuming an upload, but it allows clients to control the “blast radius” of failure by adjusting the size of each part they upload, and if any of the parts fail, they only have to resend those specific parts. The trade-off is that it allows for more parallelism when uploading a single file, allowing clients to maximize their bandwidth using multiple threads to send the file data.

We wouldn’t need an explicit step (1), because our session would implicitly initiate a multipart upload for each file.

There are downsides to this though:

  • Clients have to do more work on every request to have something resembling resumable uploads. They would have to break the file up into multiple parts rather than just making a single POST request, and only needing to deal with the complexity if something fails.
  • Clients that don’t care about resumption at all still have to deal with the third explicit step, though they could just upload the file all as a single part. (S3 works around this by having another API for one shot uploads, but the PEP authors place a high value on having a single API for uploading any individual file.)
  • Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by hashing each part, then the overall hash is just a hash of those hashes, not of the content itself. Since PyPI needs to know the actual hash of the file itself anyway, we would have to reconstitute the file, read its content, and hash it once it’s been fully uploaded, though it could still use the hash of hashes trick for checksumming the upload itself.

The PEP authors lean towards tus style resumable uploads, due to them being simpler to use, easier to imp;lement, and more consistent, with the main downside being that multi-threaded performance is theoretically left on the table.

One other possible benefit of the S3 style multipart uploads is that you don’t have to try and do any sort of protection against parallel uploads, since they’re just supported. That alone might erase most of the server side implementation simplification.

Footnotes


Source: https://github.com/python/peps/blob/main/peps/pep-0694.rst

Last modified: 2025-01-06 23:53:01 GMT