PEP: 750 Title: Tag Strings For Writing Domain-Specific Languages
Author: Jim Baker <jim.baker@python.org>, Guido van Rossum
<guido@python.org>, Paul Everitt <pauleveritt@me.com> Sponsor: Lysandros
Nikolaou <lisandrosnik@gmail.com> Discussions-To:
https://discuss.python.org/t/pep-750-tag-strings-for-writing-domain-specific-languages/60408
Status: Draft Type: Standards Track Created: 08-Jul-2024 Python-Version:
3.14

Abstract

This PEP introduces tag strings for custom, repeatable string
processing. Tag strings are an extension to f-strings, with a custom
function -- the "tag" -- in place of the f prefix. This function can
then provide rich features such as safety checks, lazy evaluation,
domain-specific languages (DSLs) for web templating, and more.

Tag strings are similar to JavaScript tagged template literals and
related ideas in other languages. The following tag string usage shows
how similar it is to an f string, albeit with the ability to process the
literal string and embedded values:

    name = "World"
    greeting = greet"hello {name}"
    assert greeting == "Hello WORLD!"

Tag functions accept prepared arguments and return a string:

    def greet(*args):
        """Tag function to return a greeting with an upper-case recipient."""
        salutation, recipient, *_ = args
        getvalue, *_ = recipient
        return f"{salutation.title().strip()} {getvalue().upper()}!"

Below you can find richer examples. As a note, an implementation based
on CPython 3.14 exists, as discussed in this document.

Relationship With Other PEPs

Python introduced f-strings in Python 3.6 with PEP 498. The grammar was
then formalized in PEP 701 which also lifted some restrictions. This PEP
is based off of PEP 701.

At nearly the same time PEP 498 arrived, PEP 501 was written to provide
"i-strings" -- that is, "interpolation template strings". The PEP was
deferred pending further experience with f-strings. Work on this PEP was
resumed by a different author in March 2023, introducing "t-strings" as
template literal strings, and built atop PEP 701.

The authors of this PEP consider tag strings as a generalization of the
updated work in PEP 501.

Motivation

Python f-strings became very popular, very fast. The syntax was simple,
convenient, and interpolated expressions had access to regular scoping
rules. However, f-strings have two main limitations - expressions are
eagerly evaluated, and interpolated values cannot be intercepted. The
former means that f-strings cannot be re-used like templates, and the
latter means that how values are interpolated cannot be customized.

Templating in Python is currently achieved using packages like Jinja2
which bring their own templating languages for generating dynamic
content. In addition to being one more thing to learn, these languages
are not nearly as expressive as Python itself. This means that business
logic, which cannot be expressed in the templating language, must be
written in Python instead, spreading the logic across different
languages and files.

Likewise, the inability to intercept interpolated values means that they
cannot be sanitized or otherwise transformed before being integrated
into the final string. Here, the convenience of f-strings could be
considered a liability. For example, a user executing a query with
sqlite3 may be tempted to use an f-string to embed values into their SQL
expression instead of using the ? placeholder and passing the values as
a tuple to avoid an SQL injection attack.

Tag strings address both these problems by extending the f-string syntax
to provide developers access to the string and its interpolated values
before they are combined. In doing so, tag strings may be interpreted in
many different ways, opening up the possibility for DSLs and other
custom string processing.

Proposal

This PEP proposes customizable prefixes for f-strings. These f-strings
then become a "tag string": an f-string with a "tag function." The tag
function is a callable which is given a sequence of arguments for the
parsed tokens in the string.

Here's a very simple example. Imagine we want a certain kind of string
with some custom business policies: uppercase the value and add an
exclamation point.

Let's start with a tag string which simply returns a static greeting:

    def greet(*args):
        """Give a static greeting."""
        return "Hello!"

    assert greet"Hi" == "Hello!"  # Use the custom "tag" on the string

As you can see, greet is just a callable, in the place that the f prefix
would go. Let's look at the args:

    def greet(*args):
        """Uppercase and add exclamation."""
        salutation = args[0].upper()
        return f"{salutation}!"

    greeting = greet"Hello"  # Use the custom "tag" on the string
    assert greeting == "HELLO!"

The tag function is passed a sequence of arguments. Since our tag string
is simply "Hello", the args sequence only contains a string-like value
of 'Hello'.

With this in place, let's introduce an interpolation. That is, a place
where a value should be inserted:

    def greet(*args):
        """Handle an interpolation."""
        # The first arg is the string-like value "Hello " with a space
        salutation = args[0].strip()
        # The second arg is an "interpolation"
        interpolation = args[1]
        # Interpolations are tuples, the first item is a lambda
        getvalue = interpolation[0]
        # It gets called in the scope where it was defined, so
        # the interpolation returns "World"
        result = getvalue()
        recipient = result.upper()
        return f"{salutation} {recipient}!"

    name = "World"
    greeting = greet"Hello {name}"
    assert greeting == "Hello WORLD!"

The f-string interpolation of {name} leads to the new machinery in tag
strings:

-   args[0] is still the string-like 'Hello ', this time with a trailing
    space
-   args[1] is an expression -- the {name} part
-   Tag strings represent this part as an interpolation object as
    discussed below

The *args list is a sequence of Decoded and Interpolation values. A
"decoded" object is a string-like object with extra powers, as described
below. An "interpolation" object is a tuple-like value representing how
Python processed the interpolation into a form useful for your tag
function. Both are fully described below in Specification.

Here is a more generalized version using structural pattern matching and
type hints:

    from typing import Decoded, Interpolation  # Get the new protocols

    def greet(*args: Decoded | Interpolation) -> str:
        """Handle arbitrary args using structural pattern matching."""
        result = []
        for arg in args:
            match arg:
                case Decoded() as decoded:
                    result.append(decoded)
                case Interpolation() as interpolation:
                    value = interpolation.getvalue()
                    result.append(value.upper())

        return f"{''.join(result)}!"

    name = "World"
    greeting = greet"Hello {name} nice to meet you"
    assert greeting == "Hello WORLD nice to meet you!"

Tag strings extract more than just a callable from the Interpolation.
They also provide Python string formatting info, as well as the original
text:

    def greet(*args: Decoded | Interpolation) -> str:
        """Interpolations can have string formatting specs and conversions."""
        result = []
        for arg in args:
            match arg:
                case Decoded() as decoded:
                    result.append(decoded)
                case getvalue, raw, conversion, format_spec:  # Unpack
                    gv = f"gv: {getvalue()}"
                    r = f"r: {raw}"
                    c = f"c: {conversion}"
                    f = f"f: {format_spec}"
                    result.append(", ".join([gv, r, c, f]))

        return f"{''.join(result)}!"

    name = "World"
    assert greet"Hello {name!r:s}" == "Hello gv: World, r: name, c: r, f: s!"

You can see each of the Interpolation parts getting extracted:

-   The lambda expression to call and get the value in the scope it was
    defined
-   The raw string of the interpolation (name)
-   The Python "conversion" field (r)
-   Any format specification (s)

Specification

In the rest of this specification, my_tag will be used for an arbitrary
tag. For example:

    def mytag(*args):
        return args

    trade = 'shrubberies'
    mytag'Did you say "{trade}"?'

Valid Tag Names

The tag name can be any undotted name that isn't already an existing
valid string or bytes prefix, as seen in the lexical analysis
specification. Therefore these prefixes can't be used as a tag:

    stringprefix: "r" | "u" | "R" | "U" | "f" | "F"
                : | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"

    bytesprefix: "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"

Python restricts certain keywords from being used as identifiers. This
restriction also applies to tag names. Usage of keywords should trigger
a helpful error, as done in recent CPython releases.

Tags Must Immediately Precede the Quote Mark

As with other string literal prefixes, no whitespace can be between the
tag and the quote mark.

PEP 701

Tag strings support the full syntax of PEP 701 in that any string
literal, with any quote mark, can be nested in the interpolation. This
nesting includes of course tag strings.

Evaluating Tag Strings

When the tag string is evaluated, the tag must have a binding, or a
NameError is raised; and it must be a callable, or a TypeError is
raised. The callable must accept a sequence of positional arguments.
This behavior follows from the de-sugaring of:

    trade = 'shrubberies'
    mytag'Did you say "{trade}"?'

to:

    mytag(DecodedConcrete(r'Did you say "'), InterpolationConcrete(lambda: trade, 'trade', None, None), DecodedConcrete(r'"?'))

Note

DecodedConcrete and InterpolationConcrete are just example
implementations. If approved, tag strings will have concrete types in
builtins.

Decoded Strings

In the mytag'Did you say "{trade}"?' example, there are two strings:
r'Did you say "' and r'"?'.

Strings are internally stored as objects with a Decoded structure,
meaning: conforming to a protocol Decoded:

    @runtime_checkable
    class Decoded(Protocol):
        def __str__(self) -> str:
            ...

        raw: str

These Decoded objects have access to raw strings. Raw strings are used
because tag strings are meant to target a variety of DSLs, such as the
shell and regexes. Such DSLs have their own specific treatment of
metacharacters, namely the backslash.

However, often the "cooked" string is what is needed, by decoding the
string as if it were a standard Python string. In the proposed
implementation, the decoded object's __new__ will store the raw string
and store and return the "cooked" string.

The protocol is marked as @runtime_checkable to allow structural pattern
matching to test against the protocol instead of a type. This can incur
a small performance penalty. Since the case tests are in user-code tag
functions, authors can choose to optimize by testing for the
implementation type discussed next.

The Decoded protocol will be available from typing. In CPython, Decoded
will be implemented in C, but for discussion of this PEP, the following
is a compatible implementation:

    class DecodedConcrete(str):
        _raw: str

        def __new__(cls, raw: str):
            decoded = raw.encode("utf-8").decode("unicode-escape")
            if decoded == raw:
                decoded = raw
            chunk = super().__new__(cls, decoded)
            chunk._raw = raw
            return chunk

        @property
        def raw(self):
            return self._raw

Interpolation

An Interpolation is the data structure representing an expression inside
the tag string. Interpolations enable a delayed evaluation model, where
the interpolation expression is computed, transformed, memoized, or
processed in any way.

In addition, the original text of the interpolation expression is made
available to the tag function. This can be useful for debugging or
metaprogramming.

Interpolation is a Protocol which will be made available from typing. It
has the following definition:

    @runtime_checkable
    class Interpolation(Protocol):
        def __len__(self):
            ...

        def __getitem__(self, index: int):
            ...

        def getvalue(self) -> Callable[[], Any]:
            ...

        expr: str
        conv: Literal["a", "r", "s"] | None
        format_spec: str | None

Given this example interpolation:

    mytag'{trade!r:some-formatspec}'

these attributes are as follows:

-   getvalue is a zero argument closure for the interpolation. In this
    case, lambda: trade.
-   expr is the expression text of the interpolation. Example: 'trade'.
-   conv is the optional conversion to be used by the tag function, one
    of r, s, and a, corresponding to repr, str, and ascii conversions.
    Note that as with f-strings, no other conversions are supported.
    Example: 'r'.
-   format_spec is the optional format_spec string. A format_spec is
    eagerly evaluated if it contains any expressions before being passed
    to the tag function. Example: 'some-formatspec'.

In all cases, the tag function determines what to do with valid
Interpolation attributes.

In the CPython reference implementation, implementing Interpolation in C
would use the equivalent Struct Sequence Objects (see such code as
os.stat_result). For purposes of this PEP, here is an example of a pure
Python implementation:

    class InterpolationConcrete(NamedTuple):
        getvalue: Callable[[], Any]
        expr: str
        conv: Literal['a', 'r', 's'] | None = None
        format_spec: str | None = None

Interpolation Expression Evaluation

Expression evaluation for interpolations is the same as in
498#expression-evaluation, except that all expressions are always
implicitly wrapped with a lambda:

  The expressions that are extracted from the string are evaluated in
  the context where the tag string appeared. This means the expression
  has full access to its lexical scope, including local and global
  variables. Any valid Python expression can be used, including function
  and method calls.

However, there's one additional nuance to consider, function scope
versus annotation scope. Consider this somewhat contrived example to
configure captions:

    class CaptionConfig:
        tag = 'b'
        figure = f'<{tag}>Figure</{tag}>'

Let's now attempt to rewrite the above example to use tag strings:

    class CaptionConfig:
        tag = 'b'
        figure = html'<{tag}>Figure</{tag}>'

Unfortunately, this rewrite doesn't work if using the usual lambda
wrapping to implement interpolations, namely lambda: tag. When the
interpolations are evaluated by the tag function, it will result in
NameError: name 'tag' is not defined. The root cause of this name error
is that lambda: tag uses function scope, and it's therefore not able to
use the class definition where tag is defined.

Desugaring how the tag string could be evaluated will result in the same
NameError even using f-strings; the lambda wrapping here also uses
function scoping:

    class CaptionConfig:
        tag = 'b'
        figure = f'<{(lambda: tag)()}>Figure</{(lambda: tag)()}>'

For tag strings, getting such a NameError would be surprising. It would
also be a rough edge in using tag strings in this specific case of
working with class variables. After all, tag strings are supposed to
support a superset of the capabilities of f-strings.

The solution is to use annotation scope for tag string interpolations.
While the name "annotation scope" suggests it's only about annotations,
it solves this problem by lexically resolving names in the class
definition, such as tag, unlike function scope.

Note

The use of annotation scope means it's not possible to fully desugar
interpolations into Python code. Instead it's as if one is writing
interpolation_lambda: tag, not lambda: tag, where a hypothetical
interpolation_lambda keyword variant uses annotation scope instead of
the standard function scope.

This is more or less how the reference implementation implements this
concept (but without creating a new keyword of course).

This PEP and its reference implementation therefore use the support for
annotation scope. Note that this usage is a separable part from the
implementation of PEP 649 and PEP 695 which provides a somewhat similar
deferred execution model for annotations. Instead it's up to the tag
function to evaluate any interpolations.

With annotation scope in place, lambda-wrapped expressions in
interpolations then provide the usual lexical scoping seen with
f-strings. So there's no need to use locals(), globals(), or frame
introspection with sys._getframe to evaluate the interpolation. In
addition, the code of each expression is available and does not have to
be looked up with inspect.getsource or some other means.

Format Specification

The format_spec is by default None if it is not specified in the tag
string's corresponding interpolation.

Because the tag function is completely responsible for processing
Decoded and Interpolation values, there is no required interpretation
for the format spec and conversion in an interpolation. For example,
this is a valid usage:

    html'<div id={id:int}>{content:HTML|str}</div>'

In this case the format_spec for the second interpolation is the string
'HTML|str'; it is up to the html tag to do something with the "format
spec" here, if anything.

f-string-style = Evaluation

mytag'{expr=}' is parsed to being the same as mytag'expr={expr}', as
implemented in the issue Add = to f-strings for easier debugging.

Tag Function Arguments

The tag function has the following signature:

    def mytag(*args: Decoded | Interpolation) -> Any:
        ...

This corresponds to the following protocol:

    class TagFunction(Protocol):
        def __call__(self, *args: Decoded | Interpolation) -> Any:
            ...

Because of subclassing, the signature for mytag can of course be widened
to the following, at the cost of losing some type specificity:

    def mytag(*args: str | tuple) -> Any:
        ...

A user might write a tag string as follows:

    def tag(*args):
        return args

    tag"\N{{GRINNING FACE}}"

Tag strings will represent this as exactly one Decoded argument. In this
case, Decoded.raw would be '\\N{GRINNING FACE}'. The "cooked"
representation via encode and decode would be:

    '\\N{GRINNING FACE}'.encode('utf-8').decode('unicode-escape')
    '😀'

Named unicode characters immediately followed by more text will still
produce just one Decoded argument:

    def tag(*args):
        return args

    assert tag"\N{{GRINNING FACE}}sometext" == (DecodedConcrete("😀sometext"),)

Return Value

Tag functions can return any type. Often they will return a string, but
richer systems can be built by returning richer objects. See below for a
motivating example.

Function Application

Tag strings desugar as follows:

    mytag'Hi, {name!s:format_spec}!'

This is equivalent to:

    mytag(DecodedConcrete(r'Hi, '), InterpolationConcrete(lambda: name, 'name',
    's', 'format_spec'), DecodedConcrete(r'!'))

Note

To keep it simple, this and subsequent desugaring omits an important
scoping aspect in how names in interpolation expressions are resolved,
specifically when defining classes. See Interpolation Expression
Evaluation.

No Empty Decoded String

Alternation between decodeds and interpolations is commonly seen, but it
depends on the tag string. Decoded strings will never have a value that
is the empty string:

    mytag'{a}{b}{c}'

...which results in this desugaring:

    mytag(InterpolationConcrete(lambda: a, 'a', None, None), InterpolationConcrete(lambda: b, 'b', None, None), InterpolationConcrete(lambda: c, 'c', None, None))

Likewise:

    mytag''

...results in this desugaring:

    mytag()

HTML Example of Rich Return Types

Tag functions can be a powerful part of larger processing chains by
returning richer objects. JavaScript tagged template literals, for
example, are not constrained by a requirement to return a string. As an
example, let's look at an HTML generation system, with a usage and
"subcomponent":

    def Menu(*, logo: str, class_: str) -> HTML:
        return html'<img alt="Site Logo" src={logo} class={class_} />'

    icon = 'acme.png'
    result = html'<header><{Menu} logo={icon} class="my-menu"/></header>'
    img = result.children[0]
    assert img.tag == "img"
    assert img.attrs == {"src": "acme.png", "class": "my-menu", "alt": "Site Logo"}
    # We can also treat the return type as a string of specially-serialized HTML
    assert str(result) = '<header>' # etc.

This html tag function might have the following signature:

    def html(*args: Decoded | Interpolation) -> HTML:
        ...

The HTML return class might have the following shape as a Protocol:

    @runtime_checkable
    class HTML(Protocol):
        tag: str
        attrs: dict[str, Any]
        children: Sequence[str | HTML]

In summary, the returned instance can be used as:

-   A string, for serializing to the final output
-   An iterable, for working with WSGI/ASGI for output streamed and
    evaluated interpolations in the order they are written out
-   A DOM (data) structure of nested Python data

In each case, the result can be lazily and recursively composed in a
safe fashion, because the return value isn't required to be a string.
Recommended practice is that return values are "passive" objects.

What benefits might come from returning rich objects instead of strings?
A DSL for a domain such as HTML templating can provide a toolchain of
post-processing, as Babel does for JavaScript with AST-based
transformation plugins. Similarly, systems that provide middleware
processing can operate on richer, standard objects with more
capabilities. Tag string results can be tested as nested Python objects,
rather than string manipulation. Finally, the intermediate results can
be cached/persisted in useful ways.

Tool Support

Python Semantics in Tag Strings

Python template languages and other DSLs have semantics quite apart from
Python. Different scope rules, different calling semantics e.g. for
macros, their own grammar for loops, and the like.

This means all tools need to write special support for each language.
Even then, it is usually difficult to find all the possible scopes, for
example to autocomplete values.

However, f-strings do not have this issue. An f-string is considered
part of Python. Expressions in curly braces behave as expected and
values should resolve based on regular scoping rules. Tools such as mypy
can see inside f-string expressions, but will likely never look inside a
Jinja2 template.

DSLs written with tag strings will inherit much of this value. While we
can't expect standard tooling to understand the "domain" in the DSL,
they can still inspect anything expressible in an f-string.

Backwards Compatibility

Like f-strings, use of tag strings will be a syntactic backwards
incompatibility with previous versions.

Security Implications

The security implications of working with interpolations, with respect
to interpolations, are as follows:

1.  Scope lookup is the same as f-strings (lexical scope). This model
    has been shown to work well in practice.
2.  Tag functions can ensure that any interpolations are done in a safe
    fashion, including respecting the context in the target DSL.

How To Teach This

Tag strings have several audiences: consumers of tag functions, authors
of tag functions, and framework authors who provide interesting
machinery for tag functions.

All three groups can start from an important framing:

-   Existing solutions (such as template engines) can do parts of tag
    strings
-   But tag strings move logic closer to "normal Python"

Consumers can look at tag strings as starting from f-strings:

-   They look familiar
-   Scoping and syntax rules are the same

They first thing they need to absorb: unlike f-strings, the string isn't
immediately evaluated "in-place". Something else (the tag function)
happens. That's the second thing to teach: the tag functions do
something particular. Thus the concept of "domain specific languages"
(DSLs). What's extra to teach: you need to import the tag function
before tagging a string.

Tag function authors think in terms of making a DSL. They have business
policies they want to provide in a Python-familiar way. With tag
functions, Python is going to do much of the pre-processing. This lowers
the bar for making a DSL.

Tag authors can begin with simple use cases. After authors gain
experience, tag strings can be used to add larger patterns: lazy
evaluation, intermediate representations, registries, and more.

Each of these points also match the teaching of decorators. In that
case, a learner consumes something which applies to the code just after
it. They don't need to know too much about decorator theory to take
advantage of the utility.

Common Patterns Seen In Writing Tag Functions

Structural Pattern Matching

Iterating over the arguments with structural pattern matching is the
expected best practice for many tag function implementations:

    def tag(*args: Decoded | Interpolation) -> Any:
        for arg in args:
            match arg:
                case Decoded() as decoded:
                    ... # handle each decoded string
                case Interpolation() as interpolation:
                    ... # handle each interpolation

Lazy Evaluation

The example tag functions above each call the interpolation's getvalue
lambda immediately. Python developers have frequently wished that
f-strings could be deferred, or lazily evaluated. It would be
straightforward to write a wrapper that, for example, defers calling the
lambda until an __str__ was invoked.

Memoizing

Tag function authors have control of processing the static string parts
and the dynamic interpolation parts. For higher performance, they can
deploy approaches for memoizing processing, for example by generating
keys.

Order of Evaluation

Imagine a tag that generates a number of sections in HTML. The tag needs
inputs for each section. But what if the last input argument takes a
while? You can't return the HTML for the first section until all the
arguments are available.

You'd prefer to emit markup as the inputs are available. Some templating
tools support this approach, as does tag strings.

Reference Implementation

At the time of this PEP's announcement, a fully-working implementation
is available.

This implementation is not final, as the PEP discussion will likely
provide changes.

Rejected Ideas

Enable Exact Round-Tripping of conv and format_spec

There are two limitations with respect to exactly round-tripping to the
original source text.

First, the format_spec can be arbitrarily nested:

    mytag'{x:{a{b{c}}}}'

In this PEP and corresponding reference implementation, the format_spec
is eagerly evaluated to set the format_spec in the interpolation,
thereby losing the original expressions.

While it would be feasible to preserve round-tripping in every usage,
this would require an extra flag equals to support, for example, {x=},
and a recursive Interpolation definition for format_spec. The following
is roughly the pure Python equivalent of this type, including preserving
the sequence unpacking (as used in case statements):

    class InterpolationConcrete(NamedTuple):
        getvalue: Callable[[], Any]
        raw: str
        conv: str | None = None
        format_spec: str | None | tuple[Decoded | Interpolation, ...] = None
        equals: bool = False

        def __len__(self):
            return 4

        def __iter__(self):
            return iter((self.getvalue, self.raw, self.conv, self.format_spec))

However, the additional complexity to support exact round-tripping seems
unnecessary and is thus rejected.

No Implicit String Concatenation

Implicit tag string concatenation isn't supported, which is unlike other
string literals.

The expectation is that triple quoting is sufficient. If implicit string
concatenation is supported, results from tag evaluations would need to
support the + operator with __add__ and __radd__.

Because tag strings target embedded DSLs, this complexity introduces
other issues, such as determining appropriate separators. This seems
unnecessarily complicated and is thus rejected.

Arbitrary Conversion Values

Python allows only r, s, or a as possible conversion type values. Trying
to assign a different value results in SyntaxError.

In theory, tag functions could choose to handle other conversion types.
But this PEP adheres closely to PEP 701. Any changes to allowed values
should be in a separate PEP.

Acknowledgements

Thanks to Ryan Morshead for contributions during development of the
ideas leading to tag strings. Thanks also to Koudai Aono for
infrastructure work on contributing materials. Special mention also to
Dropbox's pyxl as tackling similar ideas years ago.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.