PEP: 701 Title: Syntactic formalization of f-strings Author: Pablo
Galindo <pablogsal@python.org>, Batuhan Taskaya <batuhan@python.org>,
Lysandros Nikolaou <lisandrosnik@gmail.com>, Marta Gómez Macías
<cyberwitch@google.com> Discussions-To:
https://discuss.python.org/t/pep-701-syntactic-formalization-of-f-strings/22046
Status: Accepted Type: Standards Track Content-Type: text/x-rst Created:
15-Nov-2022 Python-Version: 3.12 Post-History: 19-Dec-2022, Resolution:
14-Mar-2023

Abstract

This document proposes to lift some of the restrictions originally
formulated in PEP 498 and to provide a formalized grammar for f-strings
that can be integrated into the parser directly. The proposed syntactic
formalization of f-strings will have some small side-effects on how
f-strings are parsed and interpreted, allowing for a considerable number
of advantages for end users and library developers, while also
dramatically reducing the maintenance cost of the code dedicated to
parsing f-strings.

Motivation

When f-strings were originally introduced in PEP 498, the specification
was provided without providing a formal grammar for f-strings.
Additionally, the specification contains several restrictions that are
imposed so the parsing of f-strings could be implemented into CPython
without modifying the existing lexer. These limitations have been
recognized previously and previous attempts have been made to lift them
in PEP 536, but none of this work was ever implemented. Some of these
limitations (collected originally by PEP 536) are:

1.  It is impossible to use the quote character delimiting the f-string
    within the expression portion:

        >>> f'Magic wand: { bag['wand'] }'
                                     ^
        SyntaxError: invalid syntax

2.  A previously considered way around it would lead to escape sequences
    in executed code and is prohibited in f-strings:

        >>> f'Magic wand { bag[\'wand\'] } string'
        SyntaxError: f-string expression portion cannot include a backslash

3.  Comments are forbidden even in multi-line f-strings:

        >>> f'''A complex trick: {
        ... bag['bag']  # recursive bags!
        ... }'''
        SyntaxError: f-string expression part cannot include '#'

4.  Arbitrary nesting of expressions without expansion of escape
    sequences is available in many other languages that employ a string
    interpolation method that uses expressions instead of just variable
    names. Some examples:

        # Ruby
        "#{ "#{1+2}" }"

        # JavaScript
        `${`${1+2}`}`

        # Swift
        "\("\(1+2)")"

        # C#
        $"{$"{1+2}"}"

These limitations serve no purpose from a language user perspective and
can be lifted by giving f-string literals a regular grammar without
exceptions and implementing it using dedicated parse code.

The other issue that f-strings have is that the current implementation
in CPython relies on tokenising f-strings as STRING tokens and a post
processing of these tokens. This has the following problems:

1.  It adds a considerable maintenance cost to the CPython parser. This
    is because the parsing code needs to be written by hand, which has
    historically led to a considerable number of inconsistencies and
    bugs. Writing and maintaining parsing code by hand in C has always
    been considered error prone and dangerous as it needs to deal with a
    lot of manual memory management over the original lexer buffers.
2.  The f-string parsing code is not able to use the new improved error
    message mechanisms that the new PEG parser, originally introduced in
    PEP 617, has allowed. The improvements that these error messages
    brought has been greatly celebrated but unfortunately f-strings
    cannot benefit from them because they are parsed in a separate piece
    of the parsing machinery. This is especially unfortunate, since
    there are several syntactical features of f-strings that can be
    confusing due to the different implicit tokenization that happens
    inside the expression part (for instance f"{y:=3}" is not an
    assignment expression).
3.  Other Python implementations have no way to know if they have
    implemented f-strings correctly because contrary to other language
    features, they are not part of the
    official Python grammar <python:f-strings>. This is important
    because several prominent alternative implementations are using
    CPython's PEG parser, such as PyPy, and/or are basing their grammars
    on the official PEG grammar. The fact that f-strings use a separate
    parser prevents these alternative implementations from leveraging
    the official grammar and benefiting from improvements in error
    messages derived from the grammar.

A version of this proposal was originally discussed on Python-Dev and
presented at the Python Language Summit 2022 where it was
enthusiastically received.

Rationale

By building on top of the new Python PEG Parser (PEP 617), this PEP
proposes to redefine “f-strings”, especially emphasizing the clear
separation of the string component and the expression (or replacement,
{...}) component. PEP 498 summarizes the syntactical part of “f-strings”
as the following:

  In Python source code, an f-string is a literal string, prefixed with
  ‘f’, which contains expressions inside braces. The expressions are
  replaced with their values.

However, PEP 498 also contained a formal list of exclusions on what can
or cannot be contained inside the expression component (primarily due to
the limitations of the existing parser). By clearly establishing the
formal grammar, we now also have the ability to define the expression
component of an f-string as truly "any applicable Python expression" (in
that particular context) without being bound by the limitations imposed
by the details of our implementation.

The formalization effort and the premise above also has a significant
benefit for Python programmers due to its ability to simplify and
eliminate the obscure limitations. This reduces the mental burden and
the cognitive complexity of f-string literals (as well as the Python
language in general).

1.  The expression component can include any string literal that a
    normal Python expression can include. This opens up the possibility
    of nesting string literals (formatted or not) inside the expression
    component of an f-string with the same quote type (and length):

        >>> f"These are the things: {", ".join(things)}"

        >>> f"{source.removesuffix(".py")}.c: $(srcdir)/{source}"

        >>> f"{f"{f"infinite"}"}" + " " + f"{f"nesting!!!"}"

    This "feature" is not universally agreed to be desirable, and some
    users find this unreadable. For a discussion on the different views
    on this, see the considerations regarding quote reuse section.

2.  Another issue that has felt unintuitive to most is the lack of
    support for backslashes within the expression component of an
    f-string. One example that keeps coming up is including a newline
    character in the expression part for joining containers. For
    example:

        >>> a = ["hello", "world"]
        >>> f"{'\n'.join(a)}"
        File "<stdin>", line 1
            f"{'\n'.join(a)}"
                            ^
        SyntaxError: f-string expression part cannot include a backslash

    A common work-around for this was to either assign the newline to an
    intermediate variable or pre-create the whole string prior to
    creating the f-string:

        >>> a = ["hello", "world"]
        >>> joined = '\n'.join(a)
        >>> f"{joined}"
        'hello\nworld'

    It only feels natural to allow backslashes in the expression part
    now that the new PEG parser can easily support it.

      >>> a = ["hello", "world"] >>> f"{'n'.join(a)}" 'hellonworld'

3.  Before the changes proposed in this document, there was no explicit
    limit in how f-strings can be nested, but the fact that string
    quotes cannot be reused inside the expression component of f-strings
    made it impossible to nest f-strings arbitrarily. In fact, this is
    the most nested-fstring that can be written:

        >>> f"""{f'''{f'{f"{1+1}"}'}'''}"""
        '2'

    As this PEP allows placing any valid Python expression inside the
    expression component of the f-strings, it is now possible to reuse
    quotes and therefore is possible to nest f-strings arbitrarily:

        >>> f"{f"{f"{f"{f"{f"{1+1}"}"}"}"}"}"
        '2'

    Although this is just a consequence of allowing arbitrary
    expressions, the authors of this PEP do not believe that this is a
    fundamental benefit and we have decided that the language
    specification will not explicitly mandate that this nesting can be
    arbitrary. This is because allowing arbitrarily-deep nesting imposes
    a lot of extra complexity to the lexer implementation (particularly
    as lexer/parser pipelines need to allow "untokenizing" to support
    the 'f-string debugging expressions' and this is especially taxing
    when arbitrary nesting is allowed). Implementations are therefore
    free to impose a limit on the nesting depth if they need to. Note
    that this is not an uncommon situation, as the CPython
    implementation already imposes several limits all over the place,
    including a limit on the nesting depth of parentheses and brackets,
    a limit on the nesting of the blocks, a limit in the number of
    branches in if statements, a limit on the number of expressions in
    star-unpacking, etc.

Specification

The formal proposed PEG grammar specification for f-strings is (see PEP
617 for details on the syntax):

    fstring
        | FSTRING_START fstring_middle* FSTRING_END
    fstring_middle
        | fstring_replacement_field
        | FSTRING_MIDDLE
    fstring_replacement_field
        | '{' (yield_expr | star_expressions) "="? [ "!" NAME ] [ ':' fstring_format_spec* ] '}'
    fstring_format_spec:
        | FSTRING_MIDDLE
        | fstring_replacement_field

The new tokens (FSTRING_START, FSTRING_MIDDLE, FSTRING_END) are defined
later in this document.

This PEP leaves up to the implementation the level of f-string nesting
allowed (f-strings within the expression parts of other f-strings) but
specifies a lower bound of 5 levels of nesting. This is to ensure that
users can have a reasonable expectation of being able to nest f-strings
with "reasonable" depth. This PEP implies that limiting nesting is not
part of the language specification but also the language specification
doesn't mandate arbitrary nesting.

Similarly, this PEP leaves up to the implementation the level of
expression nesting in format specifiers but specifies a lower bound of 2
levels of nesting. This means that the following should always be valid:

    f"{'':*^{1:{1}}}"

but the following can be valid or not depending on the implementation:

    f"{'':*^{1:{1:{1}}}}"

The new grammar will preserve the Abstract Syntax Tree (AST) of the
current implementation. This means that no semantic changes will be
introduced by this PEP on existing code that uses f-strings.

Handling of f-string debug expressions

Since Python 3.8, f-strings can be used to debug expressions by using
the = operator. For example:

    >>> a = 1
    >>> f"{1+1=}"
    '1+1=2'

This semantics were not introduced formally in a PEP and they were
implemented in the current string parser as a special case in bpo-36817
and documented in the f-string lexical analysis section.

This feature is not affected by the changes proposed in this PEP but is
important to specify that the formal handling of this feature requires
the lexer to be able to "untokenize" the expression part of the
f-string. This is not a problem for the current string parser as it can
operate directly on the string token contents. However, incorporating
this feature into a given parser implementation requires the lexer to
keep track of the raw string contents of the expression part of the
f-string and make them available to the parser when the parse tree is
constructed for f-string nodes. A pure "untokenization" is not enough
because as specified currently, f-string debug expressions preserve
whitespace in the expression, including spaces after the { and the =
characters. This means that the raw string contents of the expression
part of the f-string must be kept intact and not just the associated
tokens.

How parser/lexer implementations deal with this problem is of course up
to the implementation.

New tokens

Three new tokens are introduced: FSTRING_START, FSTRING_MIDDLE and
FSTRING_END. Different lexers may have different implementations that
may be more efficient than the ones proposed here given the context of
the particular implementation. However, the following definitions will
be used as part of the public APIs of CPython (such as the tokenize
module) and are also provided as a reference so that the reader can have
a better understanding of the proposed grammar changes and how the
tokens are used:

-   FSTRING_START: This token includes the f-string prefix (f/F/fr) and
    the opening quote(s).
-   FSTRING_MIDDLE: This token includes a portion of text inside the
    string that's not part of the expression part and isn't an opening
    or closing brace. This can include the text between the opening
    quote and the first expression brace ({), the text between two
    expression braces (} and {) and the text between the last expression
    brace (}) and the closing quote.
-   FSTRING_END: This token includes the closing quote.

These tokens are always string parts and they are semantically
equivalent to the STRING token with the restrictions specified. These
tokens must be produced by the lexer when lexing f-strings. This means
that the tokenizer cannot produce a single token for f-strings anymore.
How the lexer emits this token is not specified as this will heavily
depend on every implementation (even the Python version of the lexer in
the standard library is implemented differently to the one used by the
PEG parser).

As an example:

    f'some words {a+b:.3f} more words {c+d=} final words'

will be tokenized as:

    FSTRING_START - "f'"
    FSTRING_MIDDLE - 'some words '
    LBRACE - '{'
    NAME - 'a'
    PLUS - '+'
    NAME - 'b'
    OP - ':'
    FSTRING_MIDDLE - '.3f'
    RBRACE - '}'
    FSTRING_MIDDLE - ' more words '
    LBRACE - '{'
    NAME - 'c'
    PLUS - '+'
    NAME - 'd'
    OP - '='
    RBRACE - '}'
    FSTRING_MIDDLE - ' final words'
    FSTRING_END - "'"

while f"""some words""" will be tokenized simply as:

    FSTRING_START - 'f"""'
    FSTRING_MIDDLE - 'some words'
    FSTRING_END - '"""'

Changes to the tokenize module

The tokenize module will be adapted to emit these tokens as described in
the previous section when parsing f-strings so tools can take advantage
of this new tokenization schema and avoid having to implement their own
f-string tokenizer and parser.

How to produce these new tokens

One way existing lexers can be adapted to emit these tokens is to
incorporate a stack of "lexer modes" or to use a stack of different
lexers. This is because the lexer needs to switch from "regular Python
lexing" to "f-string lexing" when it encounters an f-string start token
and as f-strings can be nested, the context needs to be preserved until
the f-string closes. Also, the "lexer mode" inside an f-string
expression part needs to behave as a "super-set" of the regular Python
lexer (as it needs to be able to switch back to f-string lexing when it
encounters the } terminator for the expression part as well as handling
f-string formatting and debug expressions). For reference, here is a
draft of the algorithm to modify a CPython-like tokenizer to emit these
new tokens:

1.  If the lexer detects that an f-string is starting (by detecting the
    letter 'f/F' and one of the possible quotes) keep advancing until a
    valid quote is detected (one of ", """, ' or ''') and emit a
    FSTRING_START token with the contents captured (the 'f/F' and the
    starting quote). Push a new tokenizer mode to the tokenizer mode
    stack for "F-string tokenization". Go to step 2.

2.  Keep consuming tokens until a one of the following is encountered:

    -   A closing quote equal to the opening quote.
    -   If in "format specifier mode" (see step 3), an opening brace
        ({), a closing brace (}), or a newline token (\n).
    -   If not in "format specifier mode" (see step 3), an opening brace
        ({) or a closing brace (}) that is not immediately followed by
        another opening/closing brace.

    In all cases, if the character buffer is not empty, emit a
    FSTRING_MIDDLE token with the contents captured so far but transform
    any double opening/closing braces into single opening/closing
    braces. Now, proceed as follows depending on the character
    encountered:

    -   If a closing quote matching the opening quite is encountered go
        to step 4.
    -   If an opening bracket (not immediately followed by another
        opening bracket) is encountered, go to step 3.
    -   If a closing bracket (not immediately followed by another
        closing bracket) is encountered, emit a token for the closing
        bracket and go to step 2.

3.  Push a new tokenizer mode to the tokenizer mode stack for "Regular
    Python tokenization within f-string" and proceed to tokenize with
    it. This mode tokenizes as the "Regular Python tokenization" until a
    : or a } character is encountered with the same level of nesting as
    the opening bracket token that was pushed when we enter the f-string
    part. Using this mode, emit tokens until one of the stop points are
    reached. When this happens, emit the corresponding token for the
    stopping character encountered and, pop the current tokenizer mode
    from the tokenizer mode stack and go to step 2. If the stopping
    point is a : character, enter step 2 in "format specifier" mode.

4.  Emit a FSTRING_END token with the contents captured and pop the
    current tokenizer mode (corresponding to "F-string tokenization")
    and go back to "Regular Python mode".

Of course, as mentioned before, it is not possible to provide a precise
specification of how this should be done for an arbitrary tokenizer as
it will depend on the specific implementation and nature of the lexer to
be changed.

Consequences of the new grammar

All restrictions mentioned in the PEP are lifted from f-string literals,
as explained below:

-   Expression portions may now contain strings delimited with the same
    kind of quote that is used to delimit the f-string literal.

-   Backslashes may now appear within expressions just like anywhere
    else in Python code. In case of strings nested within f-string
    literals, escape sequences are expanded when the innermost string is
    evaluated.

-   New lines are now allowed within expression brackets. This means
    that these are now allowed:

        >>> x = 1
        >>> f"___{
        ...     x
        ... }___"
        '___1___'

        >>> f"___{(
        ...     x
        ... )}___"
        '___1___'

-   Comments, using the # character, are allowed within the expression
    part of an f-string. Note that comments require that the closing
    bracket (}) of the expression part to be present in a different line
    as the one the comment is in or otherwise it will be ignored as part
    of the comment.

Considerations regarding quote reuse

One of the consequences of the grammar proposed here is that, as
mentioned above, f-string expressions can now contain strings delimited
with the same kind of quote that is used to delimit the external
f-string literal. For example:

  >>> f" something { my_dict["key"] } something else "

In the discussion thread for this PEP, several concerns have been raised
regarding this aspect and we want to collect them here, as these should
be taken into consideration when accepting or rejecting this PEP.

Some of these objections include:

-   Many people find quote reuse within the same string confusing and
    hard to read. This is because allowing quote reuse will violate a
    current property of Python as it stands today: the fact that strings
    are fully delimited by two consecutive pairs of the same kind of
    quote, which by itself is a very simple rule. One of the reasons
    quote reuse may be harder for humans to parse, leading to less
    readable code, is that the quote character is the same for both
    start and end (as opposed to other delimiters).
-   Some users have raised concerns that quote reuse may break some
    lexer and syntax highlighting tools that rely on simple mechanisms
    to detect strings and f-strings, such as regular expressions or
    simple delimiter matching tools. Introducing quote reuse in
    f-strings will either make it trickier to keep these tools working
    or will break the tools altogether (as, for instance, regular
    expressions cannot parse arbitrary nested structures with
    delimiters). The IDLE editor, included in the standard library, is
    an example of a tool which may need some work to correctly apply
    syntax highlighting to f-strings.

Here are some of the arguments in favour:

-   Many languages that allow similar syntactic constructs (normally
    called "string interpolation") allow quote reuse and arbitrary
    nesting. These languages include JavaScript, Ruby, C#, Bash, Swift
    and many others. The fact that many languages allow quote reuse can
    be a compelling argument in favour of allowing it in Python. This is
    because it will make the language more familiar to users coming from
    other languages.

-   As many other popular languages allow quote reuse in string
    interpolation constructs, this means that editors that support
    syntax highlighting for these languages will already have the
    necessary tools to support syntax highlighting for f-strings with
    quote reuse in Python. This means that although the files that
    handle syntax highlighting for Python will need to be updated to
    support this new feature, is not expected to be impossible or very
    hard to do.

-   One advantage of allowing quote reuse is that it composes cleanly
    with other syntax. Sometimes this is referred to as "referential
    transparency". An example of this is that if we have f(x+1),
    assuming a is a brand new variable, it should behave the same as
    a = x+1; f(a). And vice versa. So if we have:

        def py2c(source):
            prefix = source.removesuffix(".py")
            return f"{prefix}.c"

    It should be expected that if we replace the variable prefix with
    its definition, the answer should be the same:

        def py2c(source):
            return f"{source.removesuffix(".py")}.c"

-   Code generators (like ast.unparse from standard library) in their
    current form rely on complicated algorithms to ensure expressions
    within an f-string are properly suited for the context in which they
    are being used. These non-trivial algorithms come with challenges
    such as finding an unused quote type (by tracking the outer quotes),
    and generating string representations which would not include
    backslashes if possible. Allowing quote reuse and backslashes would
    simplify the code generators which deal with f-strings considerably,
    as the regular Python expression logic can be used inside and
    outside of f-strings without any special treatment.

-   Limiting quote reuse will considerably increase the complexity of
    the implementation of the proposed changes. This is because it will
    force the parser to have the context that is parsing an expression
    part of an f-string with a given quote in order to know if it needs
    to reject an expression that reuses the quote. Carrying this context
    around is not trivial in parsers that can backtrack arbitrarily
    (such as the PEG parser). The issue becomes even more complex if we
    consider that f-strings can be arbitrarily nested and therefore
    several quote types may need to be rejected.

    To gather feedback from the community, a poll has been initiated to
    get a sense of how the community feels about this aspect of the PEP.

Backwards Compatibility

This PEP does not introduce any backwards incompatible syntactic or
semantic changes to the Python language. However, the tokenize module (a
quasi-public part of the standard library) will need to be updated to
support the new f-string tokens (to allow tool authors to correctly
tokenize f-strings). See changes to the tokenize module for more details
regarding how the public API of tokenize will be affected.

How to Teach This

As the concept of f-strings is already ubiquitous in the Python
community, there is no fundamental need for users to learn anything new.
However, as the formalized grammar allows some new possibilities, it is
important that the formal grammar is added to the documentation and
explained in detail, explicitly mentioning what constructs are possible
since this PEP is aiming to avoid confusion.

It is also beneficial to provide users with a simple framework for
understanding what can be placed inside an f-string expression. In this
case the authors think that this work will make it even simpler to
explain this aspect of the language, since it can be summarized as:

  You can place any valid Python expression inside an f-string
  expression.

With the changes in this PEP, there is no need to clarify that string
quotes are limited to be different from the quotes of the enclosing
string, because this is now allowed: as an arbitrary Python string can
contain any possible choice of quotes, so can any f-string expression.
Additionally there is no need to clarify that certain things are not
allowed in the expression part because of implementation restrictions
such as comments, new line characters or backslashes.

The only "surprising" difference is that as f-strings allow specifying a
format, expressions that allow a : character at the top level still need
to be enclosed in parenthesis. This is not new to this work, but it is
important to emphasize that this restriction is still in place. This
allows for an easier modification of the summary:

  You can place any valid Python expression inside an f-string
  expression, and everything after a : character at the top level will
  be identified as a format specification.

Reference Implementation

A reference implementation can be found in the implementation fork.

Rejected Ideas

1.  Although we think the readability arguments that have been raised
    against allowing quote reuse in f-string expressions are valid and
    very important, we have decided to propose not rejecting quote reuse
    in f-strings at the parser level. The reason is that one of the
    cornerstones of this PEP is to reduce the complexity and maintenance
    of parsing f-strings in CPython and this will not only work against
    that goal, but it may even make the implementation even more complex
    than the current one. We believe that forbidding quote reuse should
    be done in linters and code style tools and not in the parser, the
    same way other confusing or hard-to-read constructs in the language
    are handled today.

2.  We have decided not to lift the restriction that some expression
    portions need to wrap ':' and '!' in parentheses at the top level,
    e.g.:

        >>> f'Useless use of lambdas: { lambda x: x*2 }'
        SyntaxError: unexpected EOF while parsing

    The reason is that this will introduce a considerable amount of
    complexity for no real benefit. This is due to the fact that the :
    character normally separates the f-string format specification. This
    format specification is currently tokenized as a string. As the
    tokenizer MUST tokenize what's on the right of the : as either a
    string or a stream of tokens, this won't allow the parser to
    differentiate between the different semantics as that would require
    the tokenizer to backtrack and produce a different set of tokens
    (this is, first try as a stream of tokens, and if it fails, try as a
    string for a format specifier).

    As there is no fundamental advantage in being able to allow lambdas
    and similar expressions at the top level, we have decided to keep
    the restriction that these must be parenthesized if needed:

        >>> f'Useless use of lambdas: { (lambda x: x*2) }'

3.  We have decided to disallow (for the time being) using escaped
    braces (\{ and \}) in addition to the {{ and }} syntax. Although the
    authors of the PEP believe that allowing escaped braces is a good
    idea, we have decided to not include it in this PEP, as it is not
    strictly necessary for the formalization of f-strings proposed here,
    and it can be added independently in a regular CPython issue.

Open Issues

None yet

Footnotes

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.