PEP: 721 Title: Using tarfile.data_filter for source distribution
extraction Author: Petr Viktorin <encukou@gmail.com> PEP-Delegate: Paul
Moore <p.f.moore@gmail.com> Status: Final Type: Standards Track Topic:
Packaging Content-Type: text/x-rst Requires: 706 Created: 12-Jul-2023
Python-Version: 3.12 Post-History: 04-Jul-2023, Resolution: 02-Aug-2023

packaging:sdist-archive-features

Abstract

Extracting a source distribution archive should normally use the data
filter added in PEP 706. We clarify details, and specify the behaviour
for tools that cannot use the filter directly.

Motivation

The source distribution sdist is defined as a tar archive.

The tar format is designed to capture all metadata of Unix-like files.
Some of these are dangerous, unnecessary for source code, and/or
platform-dependent. As explained in PEP 706, when extracting a tarball,
one should always either limit the allowed features, or explicitly give
the tarball total control.

Rationale

For source distributions, the data filter introduced in PEP 706 is
enough. It allows slightly more features than git and zip (both commonly
used in packaging workflows).

However, not all tools can use the data filter, so this PEP specifies an
explicit set of expectations. The aim is that the current behaviour of
pip download and setuptools.archive_util.unpack_tarfile is valid, except
cases deemed too dangerous to allow. Another consideration is ease of
implementation for non-Python tools.

Unpatched versions of Python

Tools are allowed to ignore this PEP when running on Python without
tarfile filters.

The feature has been backported to all versions of Python supported by
python.org. Vendoring it in third-party libraries is tricky, and we
should not force all tools to do so. This shifts the responsibility to
keep up with security updates from the tools to the users.

Permissions

Common tools (git, zip) don't preserve Unix permissions (mode bits).
Telling users to not rely on them in sdists, and allowing tools to
handle them relatively freely, seems fair.

The only exception is the executable permission. We recommend, but not
require, that tools preserve it. Given that scripts are generally
platform-specific, it seems fitting to say that keeping them executable
is tool-specific behaviour.

Note that while git preserves executability, zip (and thus wheel)
doesn't do it natively. (It is possible to encode it in “external
attributes”, but Python's ZipFile.extract does not honour that.)

Specification

The following will be added to the PyPA source distribution format spec
under a new heading, “Source distribution archive features”:

Because extracting tar files as-is is dangerous, and the results are
platform-specific, archive features of source distributions are limited.

Unpacking with the data filter

When extracting a source distribution, tools MUST either use
tarfile.data_filter (e.g. TarFile.extractall(..., filter='data')), OR
follow the Unpacking without the data filter section below.

As an exception, on Python interpreters without
hasattr(tarfile, 'data_filter') (PEP 706), tools that normally use that
filter (directly on indirectly) MAY warn the user and ignore this
specification. The trade-off between usability (e.g. fully trusting the
archive) and security (e.g. refusing to unpack) is left up to the tool
in this case.

Unpacking without the data filter

Tools that do not use the data filter directly (e.g. for backwards
compatibility, allowing additional features, or not using Python) MUST
follow this section. (At the time of this writing, the data filter also
follows this section, but it may get out of sync in the future.)

The following files are invalid in an sdist archive. Upon encountering
such an entry, tools SHOULD notify the user, MUST NOT unpack the entry,
and MAY abort with a failure:

-   Files that would be placed outside the destination directory.
-   Links (symbolic or hard) pointing outside the destination directory.
-   Device files (including pipes).

The following are also invalid. Tools MAY treat them as above, but are
NOT REQUIRED to do so:

-   Files with a .. component in the filename or link target.
-   Links pointing to a file that is not part of the archive.

Tools MAY unpack links (symbolic or hard) as regular files, using
content from the archive.

When extracting sdist archives:

-   Leading slashes in file names MUST be dropped. (This is nowadays
    standard behaviour for tar unpacking.)
-   For each mode (Unix permission) bit, tools MUST either:
    -   use the platform's default for a new file/directory
        (respectively),
    -   set the bit according to the archive, or
    -   use the bit from rw-r--r-- (0o644) for non-executable files or
        rwxr-xr-x (0o755) for executable files and directories.
-   High mode bits (setuid, setgid, sticky) MUST be cleared.
-   It is RECOMMENDED to preserve the user executable bit.

Further hints

Tool authors are encouraged to consider how hints for further
verification in tarfile documentation apply for their tool.

Backwards Compatibility

The existing behaviour is unspecified, and treated differently by
different tools. This PEP makes the expectations explicit.

There is no known case of backwards incompatibility, but some project
out there probably does rely on details that aren't guaranteed. This PEP
bans the most dangerous of those features, and the rest is made
tool-specific.

Security Implications

The recommended data filter is believed safe against common exploits,
and is a single place to amend if flaws are found in the future.

The explicit specification includes protections from the data filter.

How to Teach This

The PEP is aimed at authors of packaging tools, who should be fine with
a PEP and an updated packaging spec.

Reference Implementation

TBD

Rejected Ideas

None yet.

Open Issues

None yet.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.