PEP: 672 Title: Unicode-related Security Considerations for Python
Author: Petr Viktorin <encukou@gmail.com> Status: Active Type:
Informational Content-Type: text/x-rst Created: 01-Nov-2021
Post-History: 01-Nov-2021

Abstract

This document explains possible ways to misuse Unicode to write Python
programs that appear to do something else than they actually do.

This document does not give any recommendations and solutions.

Introduction

Unicode is a system for handling all kinds of written language. It aims
to allow any character from any human language to be used. Python code
may consist of almost all valid Unicode characters. While this allows
programmers from all around the world to express themselves, it also
allows writing code that is potentially confusing to readers.

It is possible to misuse Python's Unicode-related features to write code
that appears to do something else than what it does. Evildoers could
take advantage of this to trick code reviewers into accepting malicious
code.

The possible issues generally can't be solved in Python itself without
excessive restrictions of the language. They should be solved in code
editors and review tools (such as diff displays), by enforcing
project-specific policies, and by raising awareness of individual
programmers.

This document purposefully does not give any solutions or
recommendations: it is rather a list of things to keep in mind.

This document is specific to Python. For general security considerations
in Unicode text, see [tr36] and [tr39].

Acknowledgement

Investigation for this document was prompted by CVE-2021-42574, Trojan
Source Attacks, reported by Nicholas Boucher and Ross Anderson, which
focuses on Bidirectional override characters and homoglyphs in a variety
of programming languages.

Confusing Features

This section lists some Unicode-related features that can be surprising
or misusable.

ASCII-only Considerations

ASCII is a subset of Unicode, consisting of the most common symbols,
numbers, Latin letters and control characters.

While issues with the ASCII character set are generally well understood,
the're presented here to help better understanding of the non-ASCII
cases.

Confusables and Typos

Some characters look alike. Before the age of computers, many mechanical
typewriters lacked the keys for the digits 0 and 1: users typed O
(capital o) and l (lowercase L) instead. Human readers could tell them
apart by context only. In programming languages, however, distinction
between digits and letters is critical -- and most fonts designed for
programmers make it easy to tell them apart.

Similarly, in fonts designed for human languages, the uppercase “I” and
lowercase “l” can look similar. Or the letters “rn” may be virtually
indistinguishable from the single letter “m”. Again, programmers' fonts
make these pairs of confusables noticeably different.

However, what is “noticeably” different always depends on the context.
Humans tend to ignore details in longer identifiers: the variable name
accessibi1ity_options can still look indistinguishable from
accessibility_options, while they are distinct for the compiler. The
same can be said for plain typos: most humans will not notice the typo
in responsbility_chain_delegate.

Control Characters

Python generally considers all CR (\r), LF (\n), and CR-LF pairs (\r\n)
as an end of line characters. Most code editors do as well, but there
are editors that display “non-native” line endings as unknown characters
(or nothing at all), rather than ending the line, displaying this
example:

    # Don't call this function:
    fire_the_missiles()

as a harmless comment like:

    # Don't call this function:⬛fire_the_missiles()

CPython may treat the control character NUL (\0) as end of input, but
many editors simply skip it, possibly showing code that Python will not
run as a regular part of a file.

Some characters can be used to hide/overwrite other characters when
source is listed in common terminals. For example:

-   BS (\b, Backspace) moves the cursor back, so the character after it
    will overwrite the character before.
-   CR (\r, carriage return) moves the cursor to the start of line,
    subsequent characters overwrite the start of the line.
-   SUB (\x1A, Ctrl+Z) means “End of text” on Windows. Some programs
    (such as type) ignore the rest of the file after it.
-   ESC (\x1B) commonly initiates escape codes which allow arbitrary
    control of the terminal.

Confusable Characters in Identifiers

Python is not limited to ASCII. It allows characters of all scripts –
Latin letters to ancient Egyptian hieroglyphs – in identifiers (such as
variable names). See PEP 3131 for details and rationale. Only “letters
and numbers” are allowed, so while γάτα is a valid Python identifier, 🐱
is not. (See Identifiers and keywords for details.)

Non-printing control characters are also not allowed in identifiers.

However, within the allowed set there is a large number of
“confusables”. For example, the uppercase versions of the Latin b, Greek
β (Beta), and Cyrillic в (Ve) often look identical: B, Β and В,
respectively.

This allows identifiers that look the same to humans, but not to Python.
For example, all of the following are distinct identifiers:

-   scope (Latin, ASCII-only)
-   scоpe (with a Cyrillic о)
-   scοpe (with a Greek ο)
-   ѕсоре (all Cyrillic letters)

Additionally, some letters can look like non-letters:

-   The letter for the Hawaiian ʻokina looks like an apostrophe; ʻHelloʻ
    is a Python identifier, not a string.
-   The East Asian word for ten looks like a plus sign, so 十= 10 is a
    complete Python statement. (The “十” is a word: “ten” rather than
    “10”.)

Note

The converse also applies – some symbols look like letters – but since
Python does not allow arbitrary symbols in identifiers, this is not an
issue.

Confusable Digits

Numeric literals in Python only use the ASCII digits 0-9 (and non-digits
such as . or e).

However, when numbers are converted from strings, such as in the int and
float constructors or by the str.format method, any decimal digit can be
used. For example ߅ (NKO DIGIT FIVE) or ௫ (TAMIL DIGIT FIVE) work as the
digit 5.

Some scripts include digits that look similar to ASCII ones, but have a
different value. For example:

    >>> int('৪୨')
    42
    >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
    five

Bidirectional Text

Some scripts, such as Hebrew or Arabic, are written right-to-left.
Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems and
their computer representation.

The exact process is complicated, and explained in Unicode Standard
Annex #9, Unicode Bidirectional Algorithm.

Consider the following code, which assigns a 100-character string to the
variable s:

    s = "X" * 100 #    "X" is assigned

When the X is replaced by the Hebrew letter א, the line becomes:

    s = "א" * 100 #    "א" is assigned

This command still assigns a 100-character string to s, but when
displayed as general text following the Bidirectional Algorithm (e.g. in
a browser), it appears as s = "א" followed by a comment.

Other surprising examples include:

-   In the statement ערך = 23, the variable ערך is set to the integer
    23.
-   In the statement قيمة = ערך, the variable قيمة is set to the value
    of ערך.
-   In the statement قيمة - (ערך ** 2), the value of ערך is squared and
    then subtracted from قيمة. The opening parenthesis is displayed as
    ).

Bidirectional Marks, Embeddings, Overrides and Isolates

Default reordering rules do not always yield the intended direction of
text, so Unicode provides several ways to alter it.

The most basic are directional marks, which are invisible but affect
text as a left-to-right (or right-to-left) character would. Continuing
with the s = "X" example above, in the next example the X is replaced by
the Latin x followed or preceded by a right-to-left mark (U+200F). This
assigns a 200-character string to s (100 copies of x interspersed with
100 invisible marks), but under Unicode rules for general text, it is
rendered as s = "x" followed by an ASCII-only comment:

    s = "x‏" * 100 #    "‏x" is assigned

The directional embedding, override and isolate characters are also
invisible, but affect the ordering of all text after them until either
ended by a dedicated character, or until the end of line. (Unicode
specifies the effect to last until the end of a “paragraph” (see Unicode
Bidirectional Algorithm), but allows tools to interpret newline
characters as paragraph ends (see Unicode Newline Guidelines). Most code
editors and terminals do so.)

These characters essentially allow arbitrary reordering of the text that
follows them. Python only allows them in strings and comments, which
does limit their potential (especially in combination with the fact that
Python's comments always extend to the end of a line), but it doesn't
render them harmless.

Normalizing identifiers

Python strings are collections of Unicode codepoints, not “characters”.

For reasons like compatibility with earlier encodings, Unicode often has
several ways to encode what is essentially a single “character”. For
example, all these are different ways of writing Å as a Python string,
each of which is unequal to the others.

-   "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}" (1 codepoint)
-   "\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}" (2 codepoints)
-   "\N{ANGSTROM SIGN}" (1 codepoint, but different)

For another example, the ligature ﬁ has a dedicated Unicode codepoint,
even though it has the same meaning as the two letters fi.

Also, common letters frequently have several distinct variations.
Unicode provides them for contexts where the difference has some
semantic meaning, like mathematics. For example, some variations of n
are:

-   n (LATIN SMALL LETTER N)
-   𝐧 (MATHEMATICAL BOLD SMALL N)
-   𝘯 (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
-   ｎ (FULLWIDTH LATIN SMALL LETTER N)
-   ⁿ (SUPERSCRIPT LATIN SMALL LETTER N)

Unicode includes algorithms to normalize variants like these to a single
form, and Python identifiers are normalized. (There are several normal
forms; Python uses NFKC.)

For example, xn and xⁿ are the same identifier in Python:

    >>> xⁿ = 8
    >>> xn
    8

… as is ﬁ and fi, and as are the different ways to encode Å.

This normalization applies only to identifiers, however. Functions that
treat strings as identifiers, such as getattr, do not perform
normalization:

    >>> class Test:
    ...     def ﬁnalize(self):
    ...         print('OK')
    ...
    >>> Test().finalize()
    OK
    >>> Test().ﬁnalize()
    OK
    >>> getattr(Test(), 'ﬁnalize')
    Traceback (most recent call last):
      ...
    AttributeError: 'Test' object has no attribute 'ﬁnalize'

This also applies when importing:

-   import ﬁnalization performs normalization, and looks for a file
    named finalization.py (and other finalization.* files).
-   importlib.import_module("ﬁnalization") does not normalize, so it
    looks for a file named ﬁnalization.py.

Some filesystems independently apply normalization and/or case folding.
On some systems, ﬁnalization.py, finalization.py and FINALIZATION.py are
three distinct filenames; on others, some or all of these name the same
file.

Source Encoding

The encoding of Python source files is given by a specific regex on the
first two lines of a file, as per Encoding declarations. This mechanism
is very liberal in what it accepts, and thus easy to obfuscate.

This can be misused in combination with Python-specific special-purpose
encodings (see Text Encodings). For example, with
encoding: unicode_escape, characters like quotes or braces can be hidden
in an (f-)string, with many tools (syntax highlighters, linters, etc.)
considering them part of the string. For example:

    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.

    import os

    message = '''
    This is "Hello World" in Japanese:
    \u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c

    This runs `echo WHOA` in your shell:
    \u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
    \u0073\u0079\u0073\u0074\u0065\u006d\u0028
    \u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
    \u0029\u0029\u002c\u0027\u0027\u0027
    '''

Here, encoding: unicode_escape in the initial comment is an encoding
declaration. The unicode_escape encoding instructs Python to treat
\u0027 as a single quote (which can start/end a string), \u002c as a
comma (punctuator), etc.

Open Issues

We should probably write and publish:

-   Recommendations for Text Editors and Code Tools
-   Recommendations for Programmers and Teams
-   Possible Improvements in Python

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

tr36

    Unicode Technical Report #36: Unicode Security Considerations
    http://www.unicode.org/reports/tr36/

tr39

    Unicode® Technical Standard #39: Unicode Security Mechanisms
    http://www.unicode.org/reports/tr39/