PEP: 722 Title: Dependency specification for single-file scripts Author:
Paul Moore <p.f.moore@gmail.com> PEP-Delegate: Brett Cannon
<brett@python.org> Discussions-To: https://discuss.python.org/t/29905
Status: Rejected Type: Standards Track Topic: Packaging Content-Type:
text/x-rst Created: 19-Jul-2023 Post-History: 19-Jul-2023 Resolution:
https://discuss.python.org/t/pep-722-723-decision/36763/

Abstract

This PEP specifies a format for including 3rd-party dependencies in a
single-file Python script.

Motivation

Not all Python code is structured as a "project", in the sense of having
its own directory complete with a pyproject.toml file, and being built
into an installable distribution package. Python is also routinely used
as a scripting language, with Python scripts as a (better) alternative
to shell scripts, batch files, etc. When used to create scripts, Python
code is typically stored as a single file, often in a directory
dedicated to such "utility scripts", which might be in a mix of
languages with Python being only one possibility among many. Such
scripts may be shared, often by something as simple as email, or a link
to a URL such as a Github gist. But they are typically not "distributed"
or "installed" as part of a normal workflow.

One problem when using Python as a scripting language in this way is how
to run the script in an environment that contains whatever third party
dependencies are required by the script. There is currently no standard
tool that addresses this issue, and this PEP does not attempt to define
one. However, any tool that does address this issue will need to know
what 3rd party dependencies a script requires. By defining a standard
format for storing such data, existing tools, as well as any future
tools, will be able to obtain that information without requiring users
to include tool-specific metadata in their scripts.

Rationale

Because a key requirement is writing single-file scripts, and simple
sharing by giving someone a copy of the script, the PEP defines a
mechanism for embedding dependency data within the script itself, and
not in an external file.

We define the concept of a dependency block that contains information
about what 3rd party packages a script depends on.

In order to identify dependency blocks, the script can simply be read as
a text file. This is deliberate, as Python syntax changes over time, so
attempting to parse the script as Python code would require choosing a
specific version of Python syntax. Also, it is likely that at least some
tools will not be written in Python, and expecting them to implement a
Python parser is too much of a burden.

However, to avoid needing changes to core Python, the format is designed
to appear as comments to the Python parser. It is possible to write code
where a dependency block is not interpreted as a comment (for example,
by embedding it in a Python multi-line string), but such uses are
discouraged and can easily be avoided assuming you are not deliberately
trying to create a pathological example.

A review of how other languages allow scripts to specify their
dependencies shows that a "structured comment" like this is a
commonly-used approach.

Specification

The content of this section will be published in the Python Packaging
user guide, PyPA Specifications section, as a document with the title
"Embedding Metadata in Script Files".

Any Python script may contain a dependency block. The dependency block
is identified by reading the script as a text file (i.e., the file is
not parsed as Python source code), looking for the first line of the
form:

    # Script Dependencies:

The hash character must be at the start of the line with no preceding
whitespace. The text "Script Dependencies" is recognised regardless of
case, and the spaces represent arbitrary whitespace (although at least
one space must be present). The following regular expression recognises
the dependency block header line:

    (?i)^#\s+script\s+dependencies:\s*$

Tools reading the dependency block MAY respect the standard Python
encoding declaration. If they choose not to do so, they MUST process the
file as UTF-8.

After the header line, all lines in the file up to the first line that
doesn't start with a # sign are considered dependency lines and are
treated as follows:

1.  The initial # sign is stripped.
2.  If the line contains the character sequence " # " (SPACE HASH
    SPACE), then those characters and any subsequent characters are
    discarded. This allows dependency blocks to contain inline comments.
3.  Whitespace at the start and end of the remaining text is discarded.
4.  If the line is now empty, it is ignored.
5.  The content of the line MUST now be a valid PEP 508 dependency
    specifier.

The requirement for spaces before and after the # in an inline comment
is necessary to distinguish them from part of a PEP 508 URL specifier
(which can contain a hash, but without surrounding whitespace).

Consumers MUST validate that at a minimum, all dependencies start with a
name as defined in PEP 508, and they MAY validate that all dependencies
conform fully to PEP 508. They MUST fail with an error if they find an
invalid specifier.

Example

The following is an example of a script with an embedded dependency
block:

    # In order to run, this script needs the following 3rd party libraries
    #
    # Script Dependencies:
    #    requests
    #    rich     # Needed for the output
    #
    #    # Not needed - just to show that fragments in URLs do not
    #    # get treated as comments
    #    pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686

    import requests
    from rich.pretty import pprint

    resp = requests.get("https://peps.python.org/api/peps.json")
    data = resp.json()
    pprint([(k, v["title"]) for k, v in data.items()][:10])

Backwards Compatibility

As dependency blocks take the form of a structured comment, they can be
added without altering the meaning of existing code.

It is possible that a comment may already exist which matches the form
of a dependency block. While the identifying header text, "Script
Dependencies" is chosen to minimise this risk, it is still possible.

In the rare case where an existing comment would be interpreted
incorrectly as a dependency block, this can be addressed by adding an
actual dependency block (which can be empty if the script has no
dependencies) earlier in the code.

Security Implications

If a script containing a dependency block is run using a tool that
automatically installs dependencies, this could cause arbitrary code to
be downloaded and installed in the user's environment.

The risk here is part of the functionality of the tool being used to run
the script, and as such should already be addressed by the tool itself.
The only additional risk introduced by this PEP is if an untrusted
script with a dependency block is run, when a potentially malicious
dependency might be installed. This risk is addressed by the normal good
practice of reviewing code before running it.

How to Teach This

The format is intended to be close to how a developer might already
specify script dependencies in an explanatory comment. The required
structure is deliberately minimal, so that formatting rules are easy to
learn.

Users will need to know how to write Python dependency specifiers. This
is covered by PEP 508, but for simple examples (which is expected to be
the norm for inexperienced users) the syntax is either just a package
name, or a name and a version restriction, which is fairly
well-understood syntax.

Users will also know how to run a script using a tool that interprets
dependency data. This is not covered by this PEP, as it is the
responsibility of such a tool to document how it should be used.

Note that the core Python interpreter does not interpret dependency
blocks. This may be a point of confusion for beginners, who try to run
python some_script.py and do not understand why it fails. This is no
different than the current status quo, though, where running a script
without its dependencies present will give an error.

In general, it is assumed that if a beginner is given a script with
dependencies (regardless of whether they are specified in a dependency
block), the person supplying the script should explain how to run that
script, and if that involves using a script runner tool, that should be
noted.

Recommendations

This section is non-normative and simply describes "good practices" when
using dependency blocks.

While it is permitted for tools to do minimal validation of
requirements, in practice they should do as much "sanity check"
validation as possible, even if they cannot do a full check for PEP 508
syntax. This helps to ensure that dependency blocks that are not
correctly terminated are reported early. A good compromise between the
minimal approach of checking just that the requirement starts with a
name, and full PEP 508 validation, is to check for a bare name, or a
name followed by optional whitespace, and then one of [ (extra), @
(urlspec), ; (marker) or one of (<!=>~ (version).

Scripts should, in general, place the dependency block at the top of the
file, either immediately after any shebang line, or straight after the
script docstring. In particular, the dependency block should always be
placed before any executable code in the file. This makes it easy for
the human reader to locate it.

Reference Implementation

Code to implement this proposal in Python is fairly straightforward, so
the reference implementation can be included here.

    import re
    import tokenize
    from packaging.requirements import Requirement

    DEPENDENCY_BLOCK_MARKER = r"(?i)^#\s+script\s+dependencies:\s*$"

    def read_dependency_block(filename):
        # Use the tokenize module to handle any encoding declaration.
        with tokenize.open(filename) as f:
            # Skip lines until we reach a dependency block (OR EOF).
            for line in f:
                if re.match(DEPENDENCY_BLOCK_MARKER, line):
                    break
            # Read dependency lines until we hit a line that doesn't
            # start with #, or we are at EOF.
            for line in f:
                if not line.startswith("#"):
                    break
                # Remove comments. An inline comment is introduced by
                # a hash, which must be preceded and followed by a
                # space.
                line = line[1:].split(" # ", maxsplit=1)[0]
                line = line.strip()
                # Ignore empty lines
                if not line:
                    continue
                # Try to convert to a requirement. This will raise
                # an error if the line is not a PEP 508 requirement
                yield Requirement(line)

A format similar to the one proposed here is already supported in pipx
and in pip-run.

Rejected Ideas

Why not include other metadata?

The core use case addressed by this proposal is that of identifying what
dependencies a standalone script needs in order to run successfully.
This is a common real-world issue that is currently solved by script
runner tools, using implementation-specific ways of storing the data.
Standardising the storage format improves interoperability by not typing
the script to a particular runner.

While it is arguable that other forms of metadata could be useful in a
standalone script, the need is largely theoretical at this point. In
practical terms, scripts either don't use other metadata, or they store
it in existing, widely used (and therefore de facto standard) formats.
For example, scripts needing README style text typically use the
standard Python module docstring, and scripts wanting to declare a
version use the common convention of having a __version__ variable.

One case which was raised during the discussion on this PEP, was the
ability to declare a minimum Python version that a script needed to run,
by analogy with the Requires-Python core metadata item for packages.
Unlike packages, scripts are normally only run by one user or in one
environment, in contexts where multiple versions of Python are uncommon.
The need for this metadata is therefore much less critical in the case
of scripts. As further evidence of this, the two key script runners
currently available, pipx and pip-run do not offer a means of including
this data in a script.

Creating a standard "metadata container" format would unify the various
approaches, but in practical terms there is no real need for
unification, and the disruption would either delay adoption, or more
likely simply mean script authors would ignore the standard.

This proposal therefore chooses to focus just on the one use case where
there is a clear need for something, and no existing standard or common
practice.

Why not use a marker per line?

Rather than using a comment block with a header, another possibility
would be to use a marker on each line, something like:

    # Script-Dependency: requests
    # Script-Dependency: click

While this makes it easier to parse lines individually, it has a number
of issues. The first is simply that it's rather verbose, and less
readable. This is clearly affected by the chosen keyword, but all of the
suggested options were (in the author's opinion) less readable than the
block comment form.

More importantly, this form by design makes it impossible to require
that the dependency specifiers are all together in a single block. As a
result, it's not possible for a human reader, without a careful check of
the whole file, to be sure that they have identified all of the
dependencies. See the question below, "Why not allow multiple dependency
blocks and merge them?", for further discussion of this problem.

Finally, as the reference implementation demonstrates, parsing the
"comment block" form isn't, in practice, significantly more difficult
than parsing this form.

Why not use a distinct form of comment for the dependency block?

A previous version of this proposal used ## to identify dependency
blocks. Unfortunately, however, the flake8 linter implements a rule
requiring that comments must have a space after the initial # sign.
While the PEP author considers that rule misguided, it is on by default
and as a result would cause checks to fail when faced with a dependency
block.

Furthermore, the black formatter, although it allows the ## form, does
add a space after the # for most other forms of comment. This means that
if we chose an alternative like #%, automatic reformatting would corrupt
the dependency block. Forms including a space, like # # are possible,
but less natural for the average user (omitting the space is an obvious
mistake to make).

While it is possible that linters and formatters could be changed to
recognise the new standard, the benefit of having a dedicated prefix did
not seem sufficient to justify the transition cost, or the risk that
users might be using older tools.

Why not allow multiple dependency blocks and merge them?

Because it's too easy for the human reader to miss the fact that there's
a second dependency block. This could simply result in the script runner
unexpectedly downloading extra packages, or it could even be a way to
smuggle malicious packages onto a user's machine (by "hiding" a second
dependency block in the body of the script).

While the principle of "don't run untrusted code" applies here, the
benefits aren't sufficient to be worth the risk.

Why not use a more standard data format (e.g., TOML)?

First of all, the only practical choice for an alternative format is
TOML. Python packaging has standardised on TOML for structured data, and
using a different format, such as YAML or JSON, would add complexity and
confusion for no real benefit.

So the question is essentially, "why not use TOML?"

The key idea behind the "dependency block" format is to define something
that reads naturally as a comment in the script. Dependency data is
useful both for tools and for the human reader, so having a human
readable format is beneficial. On the other hand, TOML of necessity has
a syntax of its own, which distracts from the underlying data.

It is important to remember that developers who write scripts in Python
are often not experienced in Python, or Python packaging. They are often
systems administrators, or data analysts, who may simply be using Python
as a "better batch file". For such users, the TOML format is extremely
likely to be unfamiliar, and the syntax will be obscure to them, and not
particularly intuitive. Such developers may well be copying dependency
specifiers from sources such as Stack Overflow, without really
understanding them. Having to embed such a requirement into a TOML
structure is an additional complexity --and it is important to remember
that the goal here is to make using 3rd party libraries easy for such
users.

Furthermore, TOML, by its nature, is a flexible format intended to
support very general data structures. There are many ways of writing a
simple list of strings in it, and it will not be clear to inexperienced
users which form to use.

Another potential issue is that using a generalised TOML parser can in
some cases result in a measurable performance overhead. Startup time is
often quoted as an issue when running small scripts, so this may be a
problem for script runners that are aiming for high performance.

And finally, there will be tools that expect to write dependency data
into scripts -- for example, an IDE with a feature that automatically
adds an import and a dependency specifier when you reference a library
function. While libraries exist that allow editing TOML data, they are
not always good at preserving the user's layout. Even if libraries exist
which do an effective job at this, expecting all tools to use such a
library is a significant imposition on code supporting this PEP.

By choosing a simple, line-based format with no quoting rules,
dependency data is easy to read (for humans and tools) and easy to
write. The format doesn't have the flexibility of something like TOML,
but the use case simply doesn't demand that sort of flexibility.

Why not use (possibly restricted) Python syntax?

This would typically involve storing the dependencies as a (runtime)
list variable with a conventional name, such as:

    __requires__ = [
        "requests",
        "click",
    ]

Other suggestions include a static multi-line string, or including the
dependencies in the script's docstring.

The most significant problem with this proposal is that it requires all
consumers of the dependency data to implement a Python parser. Even if
the syntax is restricted, the rest of the script will use the full
Python syntax, and trying to define a syntax which can be successfully
parsed in isolation from the surrounding code is likely to be extremely
difficult and error-prone.

Furthermore, Python's syntax changes in every release. If extracting
dependency data needs a Python parser, the parser will need to know
which version of Python the script is written for, and the overhead for
a generic tool of having a parser that can handle multiple versions of
Python is unsustainable.

Even if the above issues could be addressed, the format would give the
impression that the data could be altered at runtime. However, this is
not the case in general, and code that tries to do so will encounter
unexpected and confusing behaviour.

And finally, there is no evidence that having dependency data available
at runtime is of any practical use. Should such a use be found, it is
simple enough to get the data by parsing the source -
read_dependency_block(__file__).

It is worth noting, though, that the pip-run utility does implement (an
extended form of) this approach. Further discussion of the pip-run
design is available on the project's issue tracker.

Why not embed a pyproject.toml file in the script?

First of all, pyproject.toml is a TOML based format, so all of the
previous concerns around TOML as a format apply. However, pyproject.toml
is a standard used by Python packaging, and re-using an existing
standard is a reasonable suggestion that deserves to be addressed on its
own merits.

The first issue is that the suggestion rarely implies that all of
pyproject.toml is to be supported for scripts. A script is not intended
to be "built" into any sort of distributable artifact like a wheel (see
below for more on this point), so the [build-system] section of
pyproject.toml makes little sense, for example. And while the
tool-specific sections of pyproject.toml might be useful for scripts,
it's not at all clear that a tool like ruff would want to support
per-file configuration in this way, leading to confusion when users
expect it to work, but it doesn't. Furthermore, this sort of
tool-specific configuration is just as useful for individual files in a
larger project, so we have to consider what it would mean to embed a
pyproject.toml into a single file in a larger project that has its own
pyproject.toml.

In addition, pyproject.toml is currently focused on projects that are to
be built into wheels. There is an ongoing discussion about how to use
pyproject.toml for projects that are not intended to be built as wheels,
and until that question is resolved (which will likely require some PEPs
of its own) it seems premature to be discussing embedding pyproject.toml
into scripts, which are definitely not intended to be built and
distributed in that manner.

The conclusion, therefore (which has been stated explicitly in some, but
not all, cases) is that this proposal is intended to mean that we would
embed part of pyproject.toml. Typically this is the [project] section
from PEP 621, or even just the dependencies item from that section.

At this point, the first issue is that by framing the proposal as
"embedding pyproject.toml", we would be encouraging the sort of
confusion discussed in the previous paragraphs - developers will expect
the full capabilities of pyproject.toml, and be confused when there are
differences and limitations. It would be better, therefore, to consider
this suggestion as simply being a proposal to use an embedded TOML
format, but specifically re-using the structure of a particular part of
pyproject.toml. The problem then becomes how we describe that structure,
without causing confusion for people familiar with pyproject.toml. If we
describe it with reference to pyproject.toml, the link is still there.
But if we describe it in isolation, people will be confused by the
"similar but different" nature of the structure.

It is also important to remember that a key part of the target audience
for this proposal is developers who are simply using Python as a "better
batch file" solution. These developers will generally not be familiar
with Python packaging and its conventions, and are often the people most
critical of the "complexity" and "difficulty" of packaging solutions. As
a result, proposals based on those existing solutions are likely to be
unwelcome to that audience, and could easily result in people simply
continuing to use existing adhoc solutions, and ignoring the standard
that was intended to make their lives easier.

Why not infer the requirements from import statements?

The idea would be to automatically recognize import statements in the
source file and turn them into a list of requirements.

However, this is infeasible for several reasons. First, the points above
about the necessity to keep the syntax easily parsable, for all Python
versions, also by tools written in other languages, apply equally here.

Second, PyPI and other package repositories conforming to the Simple
Repository API do not provide a mechanism to resolve package names from
the module names that are imported (see also this related discussion).

Third, even if repositories did offer this information, the same import
name may correspond to several packages on PyPI. One might object that
disambiguating which package is wanted would only be needed if there are
several projects providing the same import name. However, this would
make it easy for anyone to unintentionally or malevolently break working
scripts, by uploading a package to PyPI providing an import name that is
the same as an existing project. The alternative where, among the
candidates, the first package to have been registered on the index is
chosen, would be confusing in case a popular package is developed with
the same import name as an existing obscure package, and even harmful if
the existing package is malware intentionally uploaded with a
sufficiently generic import name that has a high probability of being
reused.

A related idea would be to attach the requirements as comments to the
import statements instead of gathering them in a block, with a syntax
such as:

    import numpy as np # requires: numpy
    import rich # requires: rich

This still suffers from parsing difficulties. Also, where to place the
comment in the case of multiline imports is ambiguous and may look ugly:

    from PyQt5.QtWidgets import (
        QCheckBox, QComboBox, QDialog, QDialogButtonBox,
        QGridLayout, QLabel, QSpinBox, QTextEdit
    ) # requires: PyQt5

Furthermore, this syntax cannot behave as might be intuitively expected
in all situations. Consider:

    import platform
    if platform.system() == "Windows":
        import pywin32 # requires: pywin32

Here, the user's intent is that the package is only required on Windows,
but this cannot be understood by the script runner (the correct way to
write it would be requires: pywin32 ; sys_platform == 'win32').

(Thanks to Jean Abou-Samra for the clear discussion of this point)

Why not simply manage the environment at runtime?

Another approach to running scripts with dependencies is simply to
manage those dependencies at runtime. This can be done by using a
library that makes packages available. There are many options for
implementing such a library, for example by installing them directly
into the user's environment or by manipulating sys.path to make them
available from a local cache.

These approaches are not incompatible with this PEP. An API such as

    env_mgr.install("rich")
    env_mgr.install("click")

    import rich
    import click

    ...

is certainly feasible. However, such a library could be written without
the need for any new standards, and as far as the PEP author is aware,
this has not happened. This suggests that an approach like this is not
as attractive as it first seems. There is also the bootstrapping issue
of making the env_mgr library available in the first place. And finally,
this approach doesn't actually offer any interoperability benefits, as
it does not use a standard form for the dependency list, and so other
tools cannot access the data.

In any case, such a library could still benefit from this proposal, as
it could include an API to read the packages to install from the script
dependency block. This would give the same functionality while allowing
interoperability with other tools that support this specification.

    # Script Dependencies:
    #     rich
    #     click
    env_mgr.install_dependencies(__file__)

    import rich
    import click

    ...

Why not just set up a Python project with a pyproject.toml?

Again, a key issue here is that the target audience for this proposal is
people writing scripts which aren't intended for distribution. Sometimes
scripts will be "shared", but this is far more informal than
"distribution" - it typically involves sending a script via an email
with some written instructions on how to run it, or passing someone a
link to a gist.

Expecting such users to learn the complexities of Python packaging is a
significant step up in complexity, and would almost certainly give the
impression that "Python is too hard for scripts".

In addition, if the expectation here is that the pyproject.toml will
somehow be designed for running scripts in place, that's a new feature
of the standard that doesn't currently exist. At a minimum, this isn't a
reasonable suggestion until the current discussion on Discourse about
using pyproject.toml for projects that won't be distributed as wheels is
resolved. And even then, it doesn't address the "sending someone a
script in a gist or email" use case.

Why not use a requirements file for dependencies?

Putting your requirements in a requirements file, doesn't require a PEP.
You can do that right now, and in fact it's quite likely that many adhoc
solutions do this. However, without a standard, there's no way of
knowing how to locate a script's dependency data. And furthermore, the
requirements file format is pip-specific, so tools relying on it are
depending on a pip implementation detail.

So in order to make a standard, two things would be required:

1.  A standardised replacement for the requirements file format.
2.  A standard for how to locate the requirements file for a given
    script.

The first item is a significant undertaking. It has been discussed on a
number of occasions, but so far no-one has attempted to actually do it.
The most likely approach would be for standards to be developed for
individual use cases currently addressed with requirements files. One
option here would be for this PEP to simply define a new file format
which is simply a text file containing PEP 508 requirements, one per
line. That would just leave the question of how to locate that file.

The "obvious" solution here would be to do something like name the file
the same as the script, but with a .reqs extension (or something
similar). However, this still requires two files, where currently only a
single file is needed, and as such, does not match the "better batch
file" model (shell scripts and batch files are typically
self-contained). It requires the developer to remember to keep the two
files together, and this may not always be possible. For example, system
administration policies may require that all files in a certain
directory are executable (the Linux filesystem standards require this of
/usr/bin, for example). And some methods of sharing a script (for
example, publishing it on a text file sharing service like Github's
gist, or a corporate intranet) may not allow for deriving the location
of an associated requirements file from the script's location (tools
like pipx support running a script directly from a URL, so "download and
unpack a zip of the script and its dependencies" may not be an
appropriate requirement).

Essentially, though, the issue here is that there is an explicitly
stated requirement that the format supports storing dependency data in
the script file itself. Solutions that don't do that are simply ignoring
that requirement.

Should scripts be able to specify a package index?

Dependency metadata is about what package the code depends on, and not
where that package comes from. There is no difference here between
metadata for scripts, and metadata for distribution packages (as defined
in pyproject.toml). In both cases, dependencies are given in "abstract"
form, without specifying how they are obtained.

Some tools that use the dependency information may, of course, need to
locate concrete dependency artifacts - for example if they expect to
create an environment containing those dependencies. But the way they
choose to do that will be closely linked to the tool's UI in general,
and this PEP does not try to dictate the UI for tools.

There is more discussion of this point, and in particular of the UI
choices made by the pip-run tool, in the previously mentioned pip-run
issue <pip-run issue_>.

What about local dependencies?

These can be handled without needing special metadata and tooling,
simply by adding the location of the dependencies to sys.path. This PEP
simply isn't needed for this case. If, on the other hand, the "local
dependencies" are actual distributions which are published locally, they
can be specified as usual with a PEP 508 requirement, and the local
package index specified when running a tool by using the tool's UI for
that.

Open Issues

None at this point.

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.