PEP: 706 Title: Filter for tarfile.extractall Author: Petr Viktorin
<encukou@gmail.com> Discussions-To: https://discuss.python.org/t/23903
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
09-Feb-2023 Python-Version: 3.12 Post-History: 25-Jan-2023, 15-Feb-2023,
Resolution: https://discuss.python.org/t/23903/10

tarfile documentation <python:tarfile-extraction-filter>

Abstract

The extraction methods in tarfile gain a filter argument, which allows
rejecting files or modifying metadata as the archive is extracted. Three
built-in named filters are provided, aimed at limiting features that
might be surprising or dangerous. These can be used as-is, or serve as a
base for custom filters.

After a deprecation period, a strict (but safer) filter will become the
default.

Motivation

The tar format is used for several use cases, many of which have
different needs. For example:

-   A backup of a UNIX workstation should faithfully preserve all kinds
    of details like file permissions, symlinks to system configuration,
    and various kinds of special files.
-   When unpacking a data bundle, it’s much more important that the
    unpacking will not have unintended consequences – like exposing a
    password file by symlinking it to a public place.

To support all its use cases, the tar format has many features. In many
cases, it's best to ignore or disallow some of them when extracting an
archive.

Python allows extracting tar archives using tarfile.TarFile.extractall,
whose docs warn to never extract archives from untrusted sources without
prior inspection. However, it’s not clear what kind of inspection should
be done. Indeed, it’s quite tricky to do such an inspection correctly.
As a result, many people don’t bother, or do the check incorrectly,
resulting in security issues such as CVE-2007-4559.

Since tarfile was first written, it's become more accepted that warnings
in documentation are not enough. Whenever possible, an unsafe operation
should be explicitly requested; potentially dangerous operations should
look dangerous. However, TarFile.extractall looks benign in a code
review.

Tarfile extraction is also exposed via shutil.unpack_archive, which
allows the user to not care about the kind of archive they're dealing
with. The API is very inviting for extracting archives without prior
inspection, even though the docs again warn against it.

It has been argued that Python is not wrong -- it behaves exactly as
documented -- but that's beside the point. Let's improve the situation
rather than assign/avoid blame. Python and its docs are the best place
to improve things.

Rationale

How do we improve things? Unfortunately, we will need to change the
defaults, which implies breaking backwards compatibility.
TarFile.extractall <tarfile.TarFile.extractall> is what people reach for
when they need to extract a tarball. Its default behaviour needs to
change.

What would be the best behaviour? That depends on the use case. So,
we'll add several general “policies” to control extraction. They are
based on use cases, and ideally they should have straightforward
security implications:

-   Current behavior: trusting the archive. Suitable e.g. as a building
    block for libraries that do the check themselves, or extracting an
    archive you just made yourself.
-   Unpacking a UNIX archive: roughly following GNU tar, e.g. stripping
    leading / from filenames.
-   Unpacking a general data archive: the shutil.unpack_archive use
    case, where it's not important to preserve details specific to tar
    or Unix-like filesystems.

After a deprecation period, the last option -- the most limited but most
secure one -- will become the default.

Even with better general defaults, users should still verify the
archives they extract, and perhaps modify some of the metadata.
Superficially, the following looks like a reasonable way to do this
today:

-   Call TarFile.getmembers <tarfile.TarFile.getmembers>
-   Verify or modify each member's ~tarfile.TarInfo
-   Pass the result to extractall's members

However, there are some issues with this approach:

-   It's possible to modify TarInfo objects, but the changes to them
    affect all subsequent operations on the same TarFile object. This
    behavior is fine for most uses, but despite that, it would be very
    surprising if TarFile.extractall did this by default.
-   Calling getmembers can be expensive and it requires a seekable
    archive.
-   When verifying members in advance, it may be necessary to track how
    each member would have changed the filesystem, e.g. how symlinks are
    being set up. This is hard. We can't expect users to do it.

To solve these issues we'll:

-   Provide a supported way to “clone” and modify TarInfo objects. A
    replace method, similar to dataclasses.replace or
    namedtuple._replace <collections.somenamedtuple._replace> should do
    the trick.
-   Provide a “filter” hook in extractall's loop that can modify or
    discard members before they are processed.
-   Require that this hook is called just before extracting each member,
    so it can scan the current state of the disk. This will greatly
    simplify the implementation of policies (both in stdlib and user
    code), at the cost of not being able to do a precise “dry run”.

The hook API will be very similar to the existing filter argument for
TarFile.add <tarfile.TarFile.add>. We'll also name it filter. (In some
cases “policy” would be a more fitting name, but the API can be used for
more than security policies.)

The built-in policies/filters described above will be implemented using
the public filter API, so they can be used as building blocks or
examples.

Setting a precedent

If and when other libraries for archive extraction, such as zipfile,
gain similar functionality, they should mimic this API as much as it's
reasonable.

To enable this for simple cases, the built-in filters will have string
names; e.g. users can pass filter='data' instead of a specific function
that deals with ~tarfile.TarInfo objects.

The shutil.unpack_archive function will get a filter argument, which it
will pass to extractall.

Adding function-based API that would work across archive formats is out
of scope of this PEP.

Full disclosure & redistributor info

The PEP author works for Red Hat, a redistributor of Python with
different security needs and support periods than CPython in general.
Such redistributors may want to carry vendor patches to:

-   Allow configuring the defaults system-wide, and
-   Change the default as soon as possible, even in older Python
    versions.

The proposal makes this easy to do, and it allows users to query the
settings.

Specification

Modifying and forgetting member metadata

The ~tarfile.TarInfo class will gain a new method, replace(), which will
work similarly to dataclasses.replace. It will return a copy of the
TarInfo object with attributes replaced as specified by keyword-only
arguments:

-   name
-   mtime
-   mode
-   linkname
-   uid
-   gid
-   uname
-   gname

Any of these, except name and linkname, will be allowed to be set to
None. When extract or extractall encounters such a None, it will not set
that piece of metadata. (If uname or gname is None, it will fall back to
uid or gid as if the name wasn't found.) When addfile or tobuf
encounters such a None, it will raise a ValueError. When list encounters
such a None, it will print a placeholder string.

The documentation will mention why the method is there: TarInfo objects
retrieved from TarFile.getmembers <tarfile.TarFile.getmembers> are
“live”; modifying them directly will affect subsequent unrelated
operations.

Filters

TarFile.extract <tarfile.TarFile.extract> and
TarFile.extractall <tarfile.TarFile.extractall> methods will grow a
filter keyword-only parameter, which takes a callable that can be called
as:

    filter(/, member: TarInfo, path: str) -> TarInfo|None

where member is the member to be extracted, and path is the path to
where the archive is extracted (i.e., it'll be the same for every
member).

When used it will be called on each member as it is extracted, and
extraction will work with the result. If it returns None, the member
will be skipped.

The function can also raise an exception. This can, depending on
TarFile.errorlevel, abort the extraction or cause the member to be
skipped.

Note

If extraction is aborted, the archive may be left partially extracted.
It is the user’s responsibility to clean up.

We will also provide a set of defaults for common use cases. In addition
to a function, the filter argument can be one of the following strings:

-   'fully_trusted': Current behavior: honor the metadata as is. Should
    be used if the user trusts the archive completely, or implements
    their own complex verification.
-   'tar': Roughly follow defaults of the GNU tar command (when run as a
    normal user):
    -   Strip leading '/' and os.sep from filenames
    -   Refuse to extract files with absolute paths (after the /
        stripping above, e.g. C:/foo on Windows).
    -   Refuse to extract files whose absolute path (after following
        symlinks) would end up outside the destination. (Note that GNU
        tar instead delays creating some links.)
    -   Clear high mode bits (setuid, setgid, sticky) and group/other
        write bits (S_IWGRP|S_IWOTH <stat.S_IWGRP>). (This is an
        approximation of GNU tar's default, which limits the mode by the
        current umask setting.)
-   'data': Extract a "data" archive, disallowing common attack vectors
    but limiting functionality. In particular, many features specific to
    UNIX-style filesystems (or equivalently, to the tar archive format)
    are ignored, making this a good filter for cross-platform archives.
    In addition to tar:
    -   Refuse to extract links (hard or soft) that link to absolute
        paths.
    -   Refuse to extract links (hard or soft) which end up linking to a
        path outside of the destination. (On systems that don't support
        links, tarfile will, in most cases, fall back to creating
        regular files. This proposal doesn't change that behaviour.)
    -   Refuse to extract device files (including pipes).
    -   For regular files and hard links:
        -   Set the owner read and write permissions
            (S_IRUSR|S_IWUSR <stat.S_IRUSR>).
        -   Remove the group & other executable permission
            (S_IXGRP|S_IXOTH <stat.S_IXGRP>) if the owner doesn't have
            it (~stat.S_IXUSR).
    -   For other files (directories), ignore mode entirely (set it to
        None).
    -   Ignore user and group info (set uid, gid, uname, gname to None).

Any other string will cause a ValueError.

The corresponding filter functions will be available as
tarfile.fully_trusted_filter(), tarfile.tar_filter(), etc., so they can
be easily used in custom policies.

Note that these filters never return None. Skipping members this way is
a feature for user-defined filters.

Defaults and their configuration

~tarfile.TarFile will gain a new attribute, extraction_filter, to allow
configuring the default filter. By default it will be None, but users
can set it to a callable that will be used if the filter argument is
missing or None.

Note

String names won't be accepted here. That would encourage code like
my_tarfile.extraction_filter = 'data'. On Python versions without this
feature, this would do nothing, silently ignoring a security-related
request.

If both the argument and attribute are None:

-   In Python 3.12-3.13, a DeprecationWarning will be emitted and
    extraction will use the 'fully_trusted' filter.
-   In Python 3.14+, it will use the 'data' filter.

Applications and system integrators may wish to change extraction_filter
of the TarFile class itself to set a global default. When using a
function, they will generally want to wrap it in staticmethod() to
prevent injection of a self argument.

Subclasses of TarFile can also override extraction_filter.

FilterError

A new exception, FilterError, will be added to the tarfile module. It'll
have several new subclasses, one for each of the refusal reasons above.
FilterError's member attribute will contain the relevant TarInfo.

In the lists above, “refusing" to extract a file means that a
FilterError will be raised. As with other extraction errors, if the
TarFile.errorlevel is 1 or more, this will abort the extraction; with
errorlevel=0 the error will be logged and the member will be ignored,
but extraction will continue. Note that extractall() may leave the
archive partially extracted; it is the user's responsibility to clean
up.

Errorlevel, and fatal/non-fatal errors

Currently, ~tarfile.TarFile has an errorlevel argument/attribute, which
specifies how errors are handled:

-   With errorlevel=0, documentation says that “all errors are ignored
    when using ~tarfile.TarFile.extract and
    ~tarfile.TarFile.extractall”. The code only ignores non-fatal and
    fatal errors (see below), so, for example, you still get TypeError
    if you pass None as the destination path.

-   With errorlevel=1 (the default), all non-fatal errors are ignored.
    (They may be logged to sys.stderr by setting the debug
    argument/attribute.) Which errors are non-fatal is not defined in
    documentation, but code treats ExtractionError as such.
    Specifically, it's these issues:

    -   “unable to resolve link inside archive” (raised on systems that
        do not support symlinks)
    -   “fifo/special devices not supported by system” (not used for
        failures if the system supports these, e.g. for a
        PermissionError)
    -   “could not change owner/mode/modification time”

    Note that, for example, file name too long or out of disk space
    don't qualify. The non-fatal errors are not very likely to appear on
    a Unix-like system.

-   With errorlevel=2, all errors are raised, including fatal ones.
    Which errors are fatal is, again, not defined; in practice it's
    OSError.

A filter refusing to extract a member does not fit neatly into the
fatal/*non-fatal* categories.

-   This PEP does not change existing behavior. (Ideas for improvements
    are welcome in Discourse topic 25970.)
-   When a filter refuses to extract a member, the error should not pass
    silently by default.

To satisfy this, FilterError will be considered a fatal error, that is,
it'll be ignored only with errorlevel=0.

Users that want to ignore FilterError but not other fatal errors should
create a custom filter function, and call another filter in a try block.

Hints for further verification

Even with the proposed changes, tarfile will not be suited for
extracting untrusted files without prior inspection. Among other issues,
the proposed policies don't prevent denial-of-service attacks. Users
should do additional checks.

New docs will tell users to consider:

-   extracting to a new empty directory,
-   using external (e.g. OS-level) limits on disk, memory and CPU usage,
-   checking filenames against an allow-list of characters (to filter
    out control characters, confusables, etc.),
-   checking that filenames have expected extensions (discouraging files
    that execute when you “click on them”, or extension-less files like
    Windows special device names),
-   limiting the number of extracted files, total size of extracted
    data, and size of individual files,
-   checking for files that would be shadowed on case-insensitive
    filesystems.

Also, the docs will note that:

-   tar files commonly contain multiple versions of the same file: later
    ones are expected to overwrite earlier ones on extraction,
-   tarfile does not protect against issues with “live” data, e.g. an
    attacker tinkering with the destination directory while extracting
    (or adding) is going on (see the GNU tar manual for more info).

This list is not comprehensive, but the documentation is a good place to
collect such general tips. It can be moved into a separate document if
grows too long or if it needs to be consolidated with zipfile or shutil
(which is out of scope for this proposal).

TarInfo identity, and offset

With filters that use replace(), the TarInfo objects handled by the
extraction machinery will not necessarily be the same objects as those
present in members. This may affect TarInfo subclasses that override
methods like makelink and rely on object identity.

Such code can switch to comparing offset, the position of the member
header inside the file.

Note that both the overridable methods and offset are only documented in
source comments.

tarfile CLI

The CLI (python -m tarfile) will gain a --filter option that will take
the name of one of the provided default filters. It won't be possible to
specify a custom filter function.

If --filter is not given, the CLI will use the default filter
('fully_trusted' with a deprecation warning now, and 'data' from Python
3.14 on).

There will be no short option. (-f would be confusingly similar to the
filename option of GNU tar.)

Other archive libraries

If and when other archive libraries, such as zipfile, grow similar
functionality, their extraction functions should use a filter argument
that takes, at least, the strings 'fully_trusted' (which should disable
any security precautions) and 'data' (which should avoid features that
might surprise users).

Standardizing a function-based filter API is out of scope of this PEP.

Shutil

shutil.unpack_archive will gain a filter argument. If it's given, it
will be passed to the underlying extraction function. Passing it for a
zip archive will fail for now (until zipfile gains a filter argument, if
it ever does).

If filter is not specified (or left as None), it won't be passed on, so
extracting a tarball will use the default filter ('fully_trusted' with a
deprecation warning now, and 'data' from Python 3.14 on).

Complex filters

Note that some user-defined filters need, for example, to count
extracted members of do post-processing. This requires a more complex
API than a filter callable. However, that complex API need not be
exposed to tarfile. For example, with a hypothetical StatefulFilter
users would write:

    with StatefulFilter() as filter_func:
        my_tar.extract(path, filter=filter_func)

A simple StatefulFilter example will be added to the docs.

Note

The need for stateful filters is a reason against allowing registration
of custom filter names in addition to 'fully_trusted', 'tar' and 'data'.
With such a mechanism, API for (at least) set-up and tear-down would
need to be set in stone.

Backwards Compatibility

The default behavior of TarFile.extract <tarfile.TarFile.extract> and
TarFile.extractall <tarfile.TarFile.extractall> will change, after
raising DeprecationWarning for 2 releases (shortest deprecation period
allowed in Python's backwards compatibility policy <387>).

Additionally, code that relies on tarfile.TarInfo object identity may
break, see 706-offset.

Backporting & Forward Compatibility

This feature may be backported to older versions of Python.

In CPython, we don't add warnings to patch releases, so the default
filter should be changed to 'fully_trusted' in backports.

Other than that, all of the changes to tarfile should be backported, so
hasattr(tarfile, 'data_filter') becomes a reliable check for all of the
new functionality.

Note that CPython's usual policy is to avoid adding new APIs in security
backports. This feature does not make sense without a new API
(TarFile.extraction_filter and the filter argument), so we'll make an
exception. (See Discourse comment 23149/16 for details.)

Here are examples of code that takes into account that tarfile may or
may not have the proposed feature.

When copying these snippets, note that setting extraction_filter will
affect subsequent operations.

-   Fully trusted archive:

        my_tarfile.extraction_filter = (lambda member, path: member)
        my_tarfile.extractall()

-   Use the 'data' filter if available, but revert to Python 3.11
    behavior ('fully_trusted') if this feature is not available:

        my_tarfile.extraction_filter = getattr(tarfile, 'data_filter',
                                               (lambda member, path: member))
        my_tarfile.extractall()

    (This is an unsafe operation, so it should be spelled out
    explicitly, ideally with a comment.)

-   Use the 'data' filter; fail if it is not available:

        my_tarfile.extractall(filter=tarfile.data_filter)

    or:

        my_tarfile.extraction_filter = tarfile.data_filter
        my_tarfile.extractall()

-   Use the 'data' filter; warn if it is not available:

        if hasattr(tarfile, 'data_filter'):
            my_tarfile.extractall(filter='data')
        else:
            # remove this when no longer needed
            warn_the_user('Extracting may be unsafe; consider updating Python')
            my_tarfile.extractall()

Security Implications

This proposal improves security, at the expense of backwards
compatibility. In particular, it will help users avoid CVE-2007-4559.

How to Teach This

The API, usage notes and tips for further verification will be added to
the documentation. These should be usable for users who are familiar
with archives in general, but not with the specifics of UNIX filesystems
nor the related security issues.

Reference Implementation

See pull request #102953 on GitHub.

Rejected Ideas

SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class that
implements security checks (see gh-65308). There are two major issues
with this approach:

-   The name is misleading. General archive operations can never be made
    “safe” from all kinds of unwanted behavior, without impacting
    legitimate use cases.
-   It does not solve the problem of unsafe defaults.

However, many of the ideas behind SafeTarFile were reused in this PEP.

Add absolute_path option to tarfile

Issue gh-73974 asks for adding an absolute_path option to extraction
methods. This would be a minimal change to formally resolve
CVE-2007-4559. It doesn't go far enough to protect the unaware, nor to
empower the diligent and curious.

Other names for the 'tar' filter

The 'tar' filter exposes features specific to UNIX-like filesystems, so
it could be named 'unix'. Or 'unix-like', 'nix', '*nix', 'posix'?

Feature-wise, tar format and UNIX-like filesystem are essentially
equivalent, so tar is a good name.

Possible Further Work

Adding filters to zipfile and shutil.unpack_archive

For consistency, zipfile and shutil.unpack_archive could gain support
for a filter argument. However, this would require research that this
PEP's author can't promise for Python 3.12.

Filters for zipfile would probably not help security. Zip is used
primarily for cross-platform data bundles, and correspondingly,
ZipFile.extract <zipfile.ZipFile.extract>'s defaults are already similar
to what a 'data' filter would do. A 'fully_trusted' filter, which would
newly allow absolute paths and .. path components, might not be useful
for much except a unified unpack_archive API.

Filters should be useful for use cases other than security, but those
would usually need custom filter functions, and those would need API
that works with both ~tarfile.TarInfo and ~zipfile.ZipInfo. That is
definitely out of scope of this PEP.

If only this PEP is implemented and nothing changes for zipfile, the
effect for callers of unpack_archive is that the default for tar files
is changing from 'fully_trusted' to the more appropriate 'data'. In the
interim period, Python 3.12-3.13 will emit DeprecationWarning. That's
annoying, but there are several ways to handle it: e.g. add a filter
argument conditionally, set TarFile.extraction_filter globally, or
ignore/suppress the warning until Python 3.14.

Also, since many calls to unpack_archive are likely to be unsafe,
there's hope that the DeprecationWarning will often turn out to be a
helpful hint to review affected code.

Thanks

This proposal is based on prior work and discussions by many people, in
particular Lars Gustäbel, Gregory P. Smith, Larry Hastings, Joachim
Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, Miro
Hrončok, and many others.

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.