PEP: 694 Title: Upload 2.0 API for Python Package Repositories Author:
Donald Stufft <donald@stufft.io> Discussions-To:
https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879
Status: Draft Type: Standards Track Topic: Packaging Content-Type:
text/x-rst Created: 11-Jun-2022 Post-History: 27-Jun-2022

Abstract

There is currently no standardized API for uploading files to a Python
package repository such as PyPI. Instead, everyone has been forced to
reverse engineer the non-standard API from PyPI.

That API, while functional, leaks a lot of implementation details of the
original PyPI code base, which have now had to have been faithfully
replicated in the new code base, and alternative implementations.

Beyond the above, there are a number of major issues with the current
API:

-   It is a fully synchronous API, which means that we're forced to have
    a single request being held open for potentially a long time, both
    for the upload itself, and then while the repository processes the
    uploaded file to determine success or failure.
-   It does not support any mechanism for resuming an upload, with the
    largest file size on PyPI being just under 1GB in size, that's a lot
    of wasted bandwidth if a large file has a network blip towards the
    end of an upload.
-   It treats a single file as the atomic unit of operation, which can
    be problematic when a release might have multiple binary wheels
    which can cause people to get different versions while the files are
    uploading, and if the sdist happens to not go last, possibly some
    hard to build packages are attempting to be built from source.
-   It has very limited support for communicating back to the user, with
    no support for multiple errors, warnings, deprecations, etc. It is
    limited entirely to the HTTP status code and reason phrase, of which
    the reason phrase has been deprecated since HTTP/2
    (RFC 7540 <7540#section-8.1.2.4>).
-   The metadata for a release/file is submitted alongside the file,
    however this metadata is famously unreliable, and most installers
    instead choose to download the entire file and read that in part due
    to that unreliability.
-   There is no mechanism for allowing a repository to do any sort of
    sanity checks before bandwidth starts getting expended on an upload,
    whereas a lot of the cases of invalid metadata or incorrect
    permissions could be checked prior to upload.
-   It has no support for "staging" a draft release prior to publishing
    it to the repository.
-   It has no support for creating new projects, without uploading a
    file.

This PEP proposes a new API for uploads, and deprecates the existing non
standard API.

Status Quo

This does not attempt to be a fully exhaustive documentation of the
current API, but give a high level overview of the existing API.

Endpoint

The existing upload API (and the now removed register API) lives at an
url, currently https://upload.pypi.org/legacy/, and to communicate which
specific API you want to call, you add a :action url parameter with a
value of file_upload. The values of submit, submit_pkg_info, and
doc_upload also used to be supported, but no longer are.

It also has a protocol_version parameter, in theory to allow new
versions of the API to be written, but in practice that has never
happened, and the value is always 1.

So in practice, on PyPI, the endpoint is
https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1.

Encoding

The data to be submitted is submitted as a POST request with the content
type of multipart/form-data. This is due to the historical nature, that
this API was not actually designed as an API, but rather was a form on
the initial PyPI implementation, then client code was written to
programmatically submit that form.

Content

Roughly speaking, the metadata contained within the package is submitted
as parts where the content-disposition is form-data, and the name is the
name of the field. The names of these various pieces of metadata are not
documented, and they sometimes, but not always match the names used in
the METADATA files. The casing rarely matches though, but overall the
METADATA to form-data conversion is extremely inconsistent.

The file itself is then sent as a application/octet-stream part with the
name of content, and if there is a PGP signature attached, then it will
be included as a application/octet-stream part with the name of
gpg_signature.

Specification

This PEP traces the root cause of most of the issues with the existing
API to be roughly two things:

-   The metadata is submitted alongside the file, rather than being
    parsed from the file itself.
    -   This is actually fine if used as a pre-check, but it should be
        validated against the actual METADATA or similar files within
        the distribution.
-   It supports a single request, using nothing but form data, that
    either succeeds or fails, and everything is done and contained
    within that single request.

We then propose a multi-request workflow, that essentially boils down
to:

1.  Initiate an upload session.
2.  Upload the file(s) as part of the upload session.
3.  Complete the upload session.
4.  (Optional) Check the status of an upload session.

All URLs described here will be relative to the root endpoint, which may
be located anywhere within the url structure of a domain. So it could be
at https://upload.example.com/, or https://example.com/upload/.

Versioning

This PEP uses the same MAJOR.MINOR versioning system as used in PEP 691,
but it is otherwise independently versioned. The existing API is
considered by this spec to be version 1.0, but it otherwise does not
attempt to modify that API in any way.

Endpoints

Create an Upload Session

To create a new upload session, you can send a POST request to /, with a
payload that looks like:

    {
      "meta": {
        "api-version": "2.0"
      },
      "name": "foo",
      "version": "1.0"
    }

This currently has three keys, meta, name, and version.

The meta key is included in all payloads, and it describes information
about the payload itself.

The name key is the name of the project that this session is attempting
to add files to.

The version key is the version of the project that this session is
attepmting to add files to.

If creating the session was successful, then the server must return a
response that looks like:

    {
      "meta": {
        "api-version": "2.0"
      },
      "urls": {
        "upload": "...",
        "draft": "...",
        "publish": "..."
      },
      "valid-for": 604800,
      "status": "pending",
      "files": {},
      "notices": [
        "a notice to display to the user"
      ]
    }

Besides the meta key, this response has five keys, urls, valid-for,
status, files, and notices.

The urls key is a dictionary mapping identifiers to related URLs to this
session.

The valid-for key is an integer representing how long, in seconds, until
the server itself will expire this session (and thus all of the URLs
contained in it). The session SHOULD live at least this much longer
unless the client itself has canceled the session. Servers MAY choose to
increase this time, but should never decrease it, except naturally
through the passage of time.

The status key is a string that contains one of pending, published,
errored, or canceled, this string represents the overall status of the
session.

The files key is a mapping containing the filenames that have been
uploaded to this session, to a mapping containing details about each
file.

The notices key is an optional key that points to an array of notices
that the server wishes to communicate to the end user that are not
specific to any one file.

For each filename in files the mapping has three keys, status, url, and
notices.

The status key is the same as the top level status key, except that it
indicates the status of a specific file.

The url key is the absolute URL that the client should upload that
specific file to (or use to delete that file).

The notices key is an optional key, that is an array of notices that the
server wishes to communicate to the end user that are specific to this
file.

The required response code to a successful creation of the session is a
201 Created response and it MUST include a Location header that is the
URL for this session, which may be used to check its status or cancel
it.

For the urls key, there are currently three keys that may appear:

The upload key, which is the upload endpoint for this session to
initiate a file upload.

The draft key, which is the repository URL that these files are
available at prior to publishing.

The publish key, which is the endpoint to trigger publishing the
session.

In addition to the above, if a second session is created for the same
name+version pair, then the upload server MUST return the already
existing session rather than creating a new, empty one.

Upload Each File

Once you have initiated an upload session for one or more files, then
you have to actually upload each of those files.

There is no set endpoint for actually uploading the file, that is given
to the client by the server as part of the creation of the upload
session, and clients MUST NOT assume that there is any commonality to
what those URLs look like from one session to the next.

To initiate a file upload, a client sends a POST request to the upload
URL in the session, with a request body that looks like:

    {
      "meta": {
        "api-version": "2.0"
      },
      "filename": "foo-1.0.tar.gz",
      "size": 1000,
      "hashes": {"sha256": "...", "blake2b": "..."},
      "metadata": "..."
    }

Besides the standard meta key, this currently has 4 keys:

-   filename: The filename of the file being uploaded.

-   size: The size, in bytes, of the file that is being uploaded.

-   hashes: A mapping of hash names to hex encoded digests, each of
    these digests are the digests of that file, when hashed by the hash
    identified in the name.

    By default, any hash algorithm available via hashlib (specifically
    any that can be passed to hashlib.new() and do not require
    additional parameters) can be used as a key for the hashes
    dictionary. At least one secure algorithm from
    hashlib.algorithms_guaranteed MUST always be included. At the time
    of this PEP, sha256 specifically is recommended.

    Multiple hashes may be passed at a time, but all hashes must be
    valid for the file.

-   metadata: An optional key that is a string containing the file's
    core metadata.

Servers MAY use the data provided in this response to do some sanity
checking prior to allowing the file to be uploaded, which may include
but is not limited to:

-   Checking if the filename already exists.
-   Checking if the size would invalidate some quota.
-   Checking if the contents of the metadata, if provided, are valid.

If the server determines that the client should attempt the upload, it
will return a 201 Created response, with an empty body, and a Location
header pointing to the URL that the file itself should be uploaded to.

At this point, the status of the session should show the filename, with
the above url included in it.

Upload Data

To upload the file, a client has two choices, they may upload the file
as either a single chunk, or as multiple chunks. Either option is
acceptable, but it is recommended that most clients should choose to
upload each file as a single chunk as that requires fewer requests and
typically has better performance.

However for particularly large files, uploading within a single request
may result in timeouts, so larger files may need to be uploaded in
multiple chunks.

In either case, the client must generate a unique token (or nonce) for
each upload attempt for a file, and MUST include that token in each
request in the Upload-Token header. The Upload-Token is a binary blob
encoded using base64 surrounded by a : on either side. Clients SHOULD
use at least 32 bytes of cryptographically random data. You can generate
it using the following:

    import base64
    import secrets

    header = ":" + base64.b64encode(secrets.token_bytes(32)).decode() + ":"

The one time that it is permissible to omit the Upload-Token from an
upload request is when a client wishes to opt out of the resumable or
chunked file upload feature completely. In that case, they MAY omit the
Upload-Token, and the file must be successfully uploaded in a single
HTTP request, and if it fails, the entire file must be resent in another
single HTTP request.

To upload in a single chunk, a client sends a POST request to the URL
from the session response for that filename. The client MUST include a
Content-Length header that is equal to the size of the file in bytes,
and this MUST match the size given in the original session creation.

As an example, if uploading a 100,000 byte file, you would send headers
like:

    Content-Length: 100000
    Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:

If the upload completes successfully, the server MUST respond with a
201 Created status. At this point this file MUST not be present in the
repository, but merely staged until the upload session has completed.

To upload in multiple chunks, a client sends multiple POST requests to
the same URL as before, one for each chunk.

This time however, the Content-Length is equal to the size, in bytes, of
the chunk that they are sending. In addition, the client MUST include a
Upload-Offset header which indicates a byte offset that the content
included in this request starts at and a Upload-Incomplete header set to
1.

As an example, if uploading a 100,000 byte file in 1000 byte chunks, and
this chunk represents bytes 1001 through 2000, you would send headers
like:

    Content-Length: 1000
    Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:
    Upload-Offset: 1001
    Upload-Incomplete: 1

However, the final chunk of data omits the Upload-Incomplete header,
since at that point the upload is no longer incomplete.

For each successful chunk, the server MUST respond with a 202 Accepted
header, except for the final chunk, which MUST be a 201 Created.

The following constraints are placed on uploads regardless of whether
they are single chunk or multiple chunks:

-   A client MUST NOT perform multiple POST requests in parallel for the
    same file to avoid race conditions and data loss or corruption. The
    server MAY terminate any ongoing POST request that utilizes the same
    Upload-Token.
-   If the offset provided in Upload-Offset is not 0 or the next chunk
    in an incomplete upload, then the server MUST respond with a 409
    Conflict.
-   Once an upload has started with a specific token, you may not use
    another token for that file without deleting the in progress upload.
-   Once a file has uploaded successfully, you may initiate another
    upload for that file, and doing so will replace that file.

Resume Upload

To resume an upload, you first have to know how much of the data the
server has already received, regardless of if you were originally
uploading the file as a single chunk, or in multiple chunks.

To get the status of an individual upload, a client can make a HEAD
request with their existing Upload-Token to the same URL they were
uploading to.

The server MUST respond back with a 204 No Content response, with an
Upload-Offset header that indicates what offset the client should
continue uploading from. If the server has not received any data, then
this would be 0, if it has received 1007 bytes then it would be 1007.

Once the client has retrieved the offset that they need to start from,
they can upload the rest of the file as described above, either in a
single request containing all of the remaining data or in multiple
chunks.

Canceling an In Progress Upload

If a client wishes to cancel an upload of a specific file, for instance
because they need to upload a different file, they may do so by issuing
a DELETE request to the file upload URL with the Upload-Token used to
upload the file in the first place.

A successful cancellation request MUST response with a 204 No Content.

Delete an uploaded File

Already uploaded files may be deleted by issuing a DELETE request to the
file upload URL without the Upload-Token.

A successful deletion request MUST response with a 204 No Content.

Session Status

Similarly to file upload, the session URL is provided in the response to
creating the upload session, and clients MUST NOT assume that there is
any commonality to what those URLs look like from one session to the
next.

To check the status of a session, clients issue a GET request to the
session URL, to which the server will respond with the same response
that they got when they initially created the upload session, except
with any changes to status, valid-for, or updated files reflected.

Session Cancellation

To cancel an upload session, a client issues a DELETE request to the
same session URL as before. At which point the server marks the session
as canceled, MAY purge any data that was uploaded as part of that
session, and future attempts to access that session URL or any of the
file upload URLs MAY return a 404 Not Found.

To prevent a lot of dangling sessions, servers may also choose to cancel
a session on their own accord. It is recommended that servers expunge
their sessions after no less than a week, but each server may choose
their own schedule.

Session Completion

To complete a session, and publish the files that have been included in
it, a client MUST send a POST request to the publish url in the session
status payload.

If the server is able to immediately complete the session, it may do so
and return a 201 Created response. If it is unable to immediately
complete the session (for instance, if it needs to do processing that
may take longer than reasonable in a single HTTP request), then it may
return a 202 Accepted response.

In either case, the server should include a Location header pointing
back to the session status url, and if the server returned a
202 Accepted, the client may poll that URL to watch for the status to
change.

Errors

All Error responses that contain a body will have a body that looks
like:

    {
      "meta": {
        "api-version": "2.0"
      },
      "message": "...",
      "errors": [
        {
          "source": "...",
          "message": "..."
        }
      ]
    }

Besides the standard meta key, this has two top level keys, message and
errors.

The message key is a singular message that encapsulates all errors that
may have happened on this request.

The errors key is an array of specific errors, each of which contains a
source key, which is a string that indicates what the source of the
error is, and a message key for that specific error.

The message and source strings do not have any specific meaning, and are
intended for human interpretation to figure out what the underlying
issue was.

Content-Types

Like PEP 691, this PEP proposes that all requests and responses from the
Upload API will have a standard content type that describes what the
content is, what version of the API it represents, and what
serialization format has been used.

The structure of this content type will be:

    application/vnd.pypi.upload.$version+format

Since only major versions should be disruptive to systems attempting to
understand one of these API content bodies, only the major version will
be included in the content type, and will be prefixed with a v to
clarify that it is a version number.

Unlike PEP 691, this PEP does not change the existing 1.0 API in any
way, so servers will be required to host the new API described in this
PEP at a different endpoint than the existing upload API.

Which means that for the new 2.0 API, the content types would be:

-   JSON: application/vnd.pypi.upload.v2+json

In addition to the above, a special "meta" version is supported named
latest, whose purpose is to allow clients to request the absolute latest
version, without having to know ahead of time what that version is. It
is recommended however, that clients be explicit about what versions
they support.

These content types DO NOT apply to the file uploads themselves, only to
the other API requests/responses in the upload API. The files themselves
should use the application/octet-stream content-type.

Version + Format Selection

Again similar to PEP 691, this PEP standardizes on using server-driven
content negotiation to allow clients to request different versions or
serialization formats, which includes the format url parameter.

Since this PEP expects the existing legacy 1.0 upload API to exist at a
different endpoint, and it currently only provides for JSON
serialization, this mechanism is not particularly useful, and clients
only have a single version and serialization they can request. However
clients SHOULD be setup to handle content negotiation gracefully in the
case that additional formats or versions are added in the future.

FAQ

Does this mean PyPI is planning to drop support for the existing upload API?

At this time PyPI does not have any specific plans to drop support for
the existing upload API.

Unlike with PEP 691 there are wide benefits to doing so, so it is likely
that we will want to drop support for it at some point in the future,
but until this API is implemented, and receiving broad use it would be
premature to make any plans for actually dropping support for it.

Is this Resumable Upload protocol based on anything?

Yes!

It's actually the protocol specified in an Active Internet-Draft, where
the authors took what they learned implementing tus to provide the idea
of resumable uploads in a wholly generic, standards based way.

The only deviation we've made from that spec is that we don't use the
104 Upload Resumption Supported informational response in the first POST
request. This decision was made for a few reasons:

-   The 104 Upload Resumption Supported is the only part of that draft
    which does not rely entirely on things that are already supported in
    the existing standards, since it was adding a new informational
    status.
-   Many clients and web frameworks don't support 1xx informational
    responses in a very good way, if at all, adding it would complicate
    implementation for very little benefit.
-   The purpose of the 104 Upload Resumption Supported support is to
    allow clients to determine that an arbitrary endpoint that they're
    interacting with supports resumable uploads. Since this PEP is
    mandating support for that in servers, clients can just assume that
    the server they are interacting with supports it, which makes using
    it unneeded.
-   In theory, if the support for 1xx responses got resolved and the
    draft gets accepted with it in, we can add that in at a later date
    without changing the overall flow of the API.

There is a risk that the above draft doesn't get accepted, but even if
it does not, that doesn't actually affect us. It would just mean that
our support for resumable uploads is an application specific protocol,
but is still wholly standards compliant.

Open Questions

Multipart Uploads vs tus

This PEP currently bases the actual uploading of files on an internet
draft from tus.io that supports resumable file uploads.

That protocol requires a few things:

-   That the client selects a secure Upload-Token that they use to
    identify uploading a single file.
-   That if clients don't upload the entire file in one shot, that they
    have to submit the chunks serially, and in the correct order, with
    all but the final chunk having a Upload-Incomplete: 1 header.
-   Resumption of an upload is essentially just querying the server to
    see how much data they've gotten, then sending the remaining bytes
    (either as a single request, or in chunks).
-   The upload implicitly is completed when the server successfully gets
    all of the data from the client.

This has one big benefit, that if a client doesn't care about resuming
their download, the work to support, from a client side, resumable
uploads is able to be completely ignored. They can just POST the file to
the URL, and if it doesn't succeed, they can just POST the whole file
again.

The other benefit is that even if you do want to support resumption, you
can still just POST the file, and unless you need to resume the
download, that's all you have to do.

Another, possibly theoretical, benefit is that for hashing the uploaded
files, the serial chunks requirement means that the server can maintain
hashing state between requests, update it for each request, then write
that file back to storage. Unfortunately this isn't actually possible to
do with Python's hashlib, though there are some libraries like Rehash
that implement it, but they don't support every hash that hashlib does
(specifically not blake2 or sha3 at the time of writing).

We might also need to reconstitute the download for processing anyways
to do things like extract metadata, etc from it, which would make it a
moot point.

The downside is that there is no ability to parallelize the upload of a
single file because each chunk has to be submitted serially.

AWS S3 has a similar API (and most blob stores have copied it either
wholesale or something like it) which they call multipart uploading.

The basic flow for a multipart upload is:

1.  Initiate a Multipart Upload to get an Upload ID.
2.  Break your file up into chunks, and upload each one of them
    individually.
3.  Once all chunks have been uploaded, finalize the upload.
    -   This is the step where any errors would occur.

It does not directly support resuming an upload, but it allows clients
to control the "blast radius" of failure by adjusting the size of each
part they upload, and if any of the parts fail, they only have to resend
those specific parts.

This has a big benefit in that it allows parallelization in uploading
files, allowing clients to maximize their bandwidth using multiple
threads to send the data.

We wouldn't need an explicit step (1), because our session would
implicitly initiate a multipart upload for each file.

It does have its own downsides:

-   Clients have to do more work on every request to have something
    resembling resumable uploads. They would have to break the file up
    into multiple parts rather than just making a single POST request,
    and only needing to deal with the complexity if something fails.
-   Clients that don't care about resumption at all still have to deal
    with the third explicit step, though they could just upload the file
    all as a single part.
    -   S3 works around this by having another API for one shot uploads,
        but I'd rather not have two different APIs for uploading the
        same file.
-   Verifying hashes gets somewhat more complicated. AWS implements
    hashing multipart uploads by hashing each part, then the overall
    hash is just a hash of those hashes, not of the content itself. We
    need to know the actual hash of the file itself for PyPI, so we
    would have to reconstitute the file and read its content and hash it
    once it's been fully uploaded, though we could still use the hash of
    hashes trick for checksumming the upload itself.
    -   See above about whether this is actually a downside in practice,
        or if it's just in theory.

I lean towards the tus style resumable uploads as I think they're
simpler to use and to implement, and the main downside is that we
possibly leave some multi-threaded performance on the table, which I
think that I'm personally fine with?

I guess one additional benefit of the S3 style multi part uploads is
that you don't have to try and do any sort of protection against
parallel uploads, since they're just supported. That alone might erase
most of the server side implementation simplification.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.