PEP: 756 Title: Add PyUnicode_Export() and PyUnicode_Import() C
functions Author: Victor Stinner <vstinner@python.org> PEP-Delegate: C
API Working Group Discussions-To: https://discuss.python.org/t/63891
Status: Draft Type: Standards Track Created: 13-Sep-2024 Python-Version:
3.14 Post-History: 14-Sep-2024

Abstract

Add functions to the limited C API version 3.14:

-   PyUnicode_Export(): export a Python str object as a Py_buffer view.
-   PyUnicode_Import(): import a Python str object.

On CPython, PyUnicode_Export() has an O(1) complexity: no memory is
copied and no conversion is done.

Rationale

PEP 393

PEP 393 "Flexible String Representation" changed string internals in
Python 3.3 to use three formats:

-   PyUnicode_1BYTE_KIND: Unicode range [U+0000; U+00ff], UCS-1, 1
    byte/character.
-   PyUnicode_2BYTE_KIND: Unicode range [U+0000; U+ffff], UCS-2, 2
    bytes/character.
-   PyUnicode_4BYTE_KIND: Unicode range [U+0000; U+10ffff], UCS-4, 4
    bytes/character.

A Python str object must always use the most compact format. For
example, a string which only contains ASCII characters must use the
UCS-1 format.

The PyUnicode_KIND() function can be used to know the format used by a
string.

One of the following functions can be used to access data:

-   PyUnicode_1BYTE_DATA() for PyUnicode_1BYTE_KIND.
-   PyUnicode_2BYTE_DATA() for PyUnicode_2BYTE_KIND.
-   PyUnicode_4BYTE_DATA() for PyUnicode_4BYTE_KIND.

To get the best performance, a C extension should have 3 code paths for
each of these 3 string native formats.

Limited C API

PEP 393 functions such as PyUnicode_KIND() and PyUnicode_1BYTE_DATA()
are excluded from the limited C API. It's not possible to write code
specialized for UCS formats. A C extension using the limited C API can
only use less efficient code paths and string formats.

For example, the MarkupSafe project has a C extension specialized for
UCS formats for best performance, and so cannot use the limited C API.

Specification

API

Add the following API to the limited C API version 3.14:

    int32_t PyUnicode_Export(
        PyObject *unicode,
        int32_t requested_formats,
        Py_buffer *view);
    PyObject* PyUnicode_Import(
        const void *data,
        Py_ssize_t nbytes,
        int32_t format);

    #define PyUnicode_FORMAT_UCS1  0x01   // Py_UCS1*
    #define PyUnicode_FORMAT_UCS2  0x02   // Py_UCS2*
    #define PyUnicode_FORMAT_UCS4  0x04   // Py_UCS4*
    #define PyUnicode_FORMAT_UTF8  0x08   // char*
    #define PyUnicode_FORMAT_ASCII 0x10   // char* (ASCII string)

The int32_t type is used instead of int to have a well defined type size
and not depend on the platform or the compiler. See Avoid C-specific
Types for the longer rationale.

PyUnicode_Export()

API:

    int32_t PyUnicode_Export(
        PyObject *unicode,
        int32_t requested_formats,
        Py_buffer *view)

Export the contents of the unicode string in one of the
requested_formats.

-   On success, fill view, and return a format (greater than 0).
-   On error, set an exception, and return -1. view is left unchanged.

After a successful call to PyUnicode_Export(), the view buffer must be
released by PyBuffer_Release(). The contents of the buffer are valid
until they are released.

The buffer is read-only and must not be modified.

The view->len member must be used to get the string length. The buffer
should end with a trailing NUL character, but it's not recommended to
rely on that because of embedded NUL characters.

unicode and view must not be NULL.

Available formats:

  Constant Identifier      Value   Description
  ------------------------ ------- -------------------------
  PyUnicode_FORMAT_UCS1    0x01    UCS-1 string (Py_UCS1*)
  PyUnicode_FORMAT_UCS2    0x02    UCS-2 string (Py_UCS2*)
  PyUnicode_FORMAT_UCS4    0x04    UCS-4 string (Py_UCS4*)
  PyUnicode_FORMAT_UTF8    0x08    UTF-8 string (char*)
  PyUnicode_FORMAT_ASCII   0x10    ASCII string (Py_UCS1*)

UCS-2 and UCS-4 use the native byte order.

requested_formats can be a single format or a bitwise combination of the
formats in the table above. On success, the returned format will be set
to a single one of the requested formats.

Note that future versions of Python may introduce additional formats.

No memory is copied and no conversion is done.

Export complexity

On CPython, an export has a complexity of O(1): no memory is copied and
no conversion is done.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats:

    (PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4 \
     | PyUnicode_FORMAT_UTF8)

PyPy uses UTF-8 natively and so the PyUnicode_FORMAT_UTF8 format is
recommended. It requires a memory copy, since PyPy str objects can be
moved in memory (PyPy uses a moving garbage collector).

Py_buffer format and item size

Py_buffer uses the following format and item size depending on the
export format:

  Export format            Buffer format   Item size
  ------------------------ --------------- -----------
  PyUnicode_FORMAT_UCS1    "B"             1 byte
  PyUnicode_FORMAT_UCS2    "=H"            2 bytes
  PyUnicode_FORMAT_UCS4    "=I"            4 bytes
  PyUnicode_FORMAT_UTF8    "B"             1 byte
  PyUnicode_FORMAT_ASCII   "B"             1 byte

PyUnicode_Import()

API:

    PyObject* PyUnicode_Import(
        const void *data,
        Py_ssize_t nbytes,
        int32_t format)

Create a Unicode string object from a buffer in a supported format.

-   Return a reference to a new string object on success.
-   Set an exception and return NULL on error.

data must not be NULL. nbytes must be positive or zero.

See PyUnicode_Export() for the available formats.

UTF-8 format

CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
exporting a string as UTF-8. The PyUnicode_AsUTF8AndSize() function can
be used instead.

The PyUnicode_FORMAT_UTF8 format is provided for compatibility with
alternate implementations which may use UTF-8 natively for strings.

ASCII format

When the PyUnicode_FORMAT_ASCII format is request for export, the
PyUnicode_FORMAT_UCS1 export format is used for ASCII strings.

The PyUnicode_FORMAT_ASCII format is mostly useful for
PyUnicode_Import() to validate that a string only contains ASCII
characters.

Surrogate characters and embedded NUL characters

Surrogate characters are allowed: they can be imported and exported.

Embedded NUL characters are allowed: they can be imported and exported.

Implementation

https://github.com/python/cpython/pull/123738

Backwards Compatibility

There is no impact on the backward compatibility, only new C API
functions are added.

Usage of PEP 393 C APIs

A code search on PyPI top 7,500 projects (in March 2024) shows that
there are many projects importing and exporting UCS formats with the
regular C API.

PyUnicode_FromKindAndData()

25 projects call PyUnicode_FromKindAndData():

-   Cython (3.0.9)
-   Levenshtein (0.25.0)
-   PyICU (2.12)
-   PyICU-binary (2.7.4)
-   PyQt5 (5.15.10)
-   PyQt6 (6.6.1)
-   aiocsv (1.3.1)
-   asyncpg (0.29.0)
-   biopython (1.83)
-   catboost (1.2.3)
-   cffi (1.16.0)
-   mojimoji (0.0.13)
-   mwparserfromhell (0.6.6)
-   numba (0.59.0)
-   numpy (1.26.4)
-   orjson (3.9.15)
-   pemja (0.4.1)
-   pyahocorasick (2.0.0)
-   pyjson5 (1.6.6)
-   rapidfuzz (3.6.2)
-   regex (2023.12.25)
-   srsly (2.4.8)
-   tokenizers (0.15.2)
-   ujson (5.9.0)
-   unicodedata2 (15.1.0)

PyUnicode_4BYTE_DATA()

21 projects call PyUnicode_2BYTE_DATA() and/or PyUnicode_4BYTE_DATA():

-   Cython (3.0.9)
-   MarkupSafe (2.1.5)
-   Nuitka (2.1.2)
-   PyICU (2.12)
-   PyICU-binary (2.7.4)
-   PyQt5_sip (12.13.0)
-   PyQt6_sip (13.6.0)
-   biopython (1.83)
-   catboost (1.2.3)
-   cement (3.0.10)
-   cffi (1.16.0)
-   duckdb (0.10.0)
-   mypy (1.9.0)
-   numpy (1.26.4)
-   orjson (3.9.15)
-   pemja (0.4.1)
-   pyahocorasick (2.0.0)
-   pyjson5 (1.6.6)
-   pyobjc-core (10.2)
-   sip (6.8.3)
-   wxPython (4.2.1)

Rejected Ideas

Reject embedded NUL characters and require trailing NUL character

In C, it's convenient to have a trailing NUL character. For example, the
for (; *str != 0; str++) loop can be used to iterate on characters and
strlen() can be used to get a string length.

The problem is that a Python str object can embed NUL characters.
Example: "ab\0c". If a string contains an embedded NUL character, code
relying on the NUL character to find the string end truncates the
string. It can lead to bugs, or even security vulnerabilities. See a
previous discussion in the issue Change PyUnicode_AsUTF8() to return
NULL on embedded null characters.

Rejecting embedded NUL characters require to scan the string which has
an O(n) complexity.

Reject surrogate characters

Surrogate characters are characters in the Unicode range [U+D800;
U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python str
object can contain arbitrary lone surrogate characters. Example:
"\uDC80".

Rejecting surrogate characters prevents exporting a string which
contains such a character. It can be surprising and annoying since the
PyUnicode_Export() caller doesn't control the string contents.

Allowing surrogate characters allows to export any string and so avoid
this issue. For example, the UTF-8 codec can be used with the
surrogatepass error handler to encode and decode surrogate characters.

Conversions on demand

It would be convenient to convert formats on demand. For example,
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
requested.

The problem is that most users expect an export to require no memory
copy and no conversion: an O(1) complexity. It is better to have an API
where all operations have an O(1) complexity.

Export to UTF-8

CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
allow exporting to UTF-8.

The problem is that the UTF-8 cache doesn't support surrogate
characters. An export is expected to provide the whole string content,
including embedded NUL characters and surrogate characters. To export
surrogate characters, a different code path using the surrogatepass
error handler is needed and each export operation has to allocate a
temporary buffer: O(n) complexity.

An export is expected to have an O(1) complexity, so the idea to export
UTF-8 in CPython was abadonned.

Discussions

-   https://discuss.python.org/t/63891
-   https://github.com/capi-workgroup/decisions/issues/33
-   https://github.com/python/cpython/issues/119609

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.