PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions
- Author:
- Victor Stinner <vstinner at python.org>
- PEP-Delegate:
- C API Working Group
- Status:
- Draft
- Type:
- Standards Track
- Created:
- 13-Sep-2024
- Python-Version:
- 3.14
Abstract
Add functions to the limited C API version 3.14:
PyUnicode_Export()
: export a Python str object as aPy_buffer
view.PyUnicode_Import()
: import a Python str object.
In general, PyUnicode_Export()
has an O(1) complexity: no memory
copy is needed. See the specification for
cases when a copy is needed.
Rationale
PEP 393
PEP 393 “Flexible String Representation” changed string internals in Python 3.3 to use three formats:
PyUnicode_1BYTE_KIND
: Unicode range [U+0000; U+00ff], UCS-1, 1 byte/character.PyUnicode_2BYTE_KIND
: Unicode range [U+0000; U+ffff], UCS-2, 2 bytes/character.PyUnicode_4BYTE_KIND
: Unicode range [U+0000; U+10ffff], UCS-4, 4 bytes/character.
A Python str
object must always use the most compact format. For
example, a string which only contains ASCII characters must use the
UCS-1 format.
The PyUnicode_KIND()
function can be used to know the format used by
a string.
One of the following functions can be used to access data:
PyUnicode_1BYTE_DATA()
forPyUnicode_1BYTE_KIND
.PyUnicode_2BYTE_DATA()
forPyUnicode_2BYTE_KIND
.PyUnicode_4BYTE_DATA()
forPyUnicode_4BYTE_KIND
.
To get the best performance, a C extension should have 3 code paths for each of these 3 string native formats.
Limited C API
PEP 393 functions such as PyUnicode_KIND()
and
PyUnicode_1BYTE_DATA()
are excluded from the limited C API. It’s not
possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.
For example, the MarkupSafe project has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API.
Specification
API
Add the following API to the limited C API version 3.14:
int32_t PyUnicode_Export(
PyObject *unicode,
int32_t requested_formats,
Py_buffer *view);
PyObject* PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
int32_t format);
#define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1*
#define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2*
#define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4*
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
The int32_t
type is used instead of int
to have a well defined
type size and not depend on the platform or the compiler.
See Avoid C-specific Types for the
longer rationale.
PyUnicode_Export()
API:
int32_t PyUnicode_Export(
PyObject *unicode,
int32_t requested_formats,
Py_buffer *view)
Export the contents of the unicode string in one of the requested_formats.
- On success, fill view, and return a format (greater than
0
). - On error, set an exception, and return
-1
. view is left unchanged.
After a successful call to PyUnicode_Export()
,
the view buffer must be released by PyBuffer_Release()
.
The contents of the buffer are valid until they are released.
The buffer is read-only and must not be modified.
The view->len
member must be used to get the string length. The
buffer should end with a trailing NUL character, but it’s not
recommended to rely on that because of embedded NUL characters.
unicode and view must not be NULL.
Available formats:
Constant Identifier | Value | Description |
---|---|---|
PyUnicode_FORMAT_UCS1 |
0x01 |
UCS-1 string (Py_UCS1* ) |
PyUnicode_FORMAT_UCS2 |
0x02 |
UCS-2 string (Py_UCS2* ) |
PyUnicode_FORMAT_UCS4 |
0x04 |
UCS-4 string (Py_UCS4* ) |
PyUnicode_FORMAT_UTF8 |
0x08 |
UTF-8 string (char* ) |
PyUnicode_FORMAT_ASCII |
0x10 |
ASCII string (Py_UCS1* ) |
UCS-2 and UCS-4 use the native byte order.
requested_formats can be a single format or a bitwise combination of the formats in the table above. On success, the returned format will be set to a single one of the requested flags.
Note that future versions of Python may introduce additional formats.
Export complexity
In general, an export has a complexity of O(1): no memory copy is needed. There are cases when a copy is needed, O(n) complexity:
- If only UCS-2 is requested and the native format is UCS-1.
- If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
- If only UTF-8 is requested: the string is encoded to UTF-8 at the first call, and then the encoded UTF-8 string is cached.
To get the best performance on CPython and PyPy, it’s recommended to support these 4 formats:
(PyUnicode_FORMAT_UCS1 \
| PyUnicode_FORMAT_UCS2 \
| PyUnicode_FORMAT_UCS4 \
| PyUnicode_FORMAT_UTF8)
PyPy uses UTF-8 natively and so the PyUnicode_FORMAT_UTF8
format is
recommended. It requires a memory copy, since PyPy str
objects can
be moved in memory (PyPy uses a moving garbage collector).
Py_buffer format and item size
Py_buffer
uses the following format and item size depending on the
export format:
Export format | Buffer format | Item size |
---|---|---|
PyUnicode_FORMAT_UCS1 |
"B" |
1 byte |
PyUnicode_FORMAT_UCS2 |
"=H" |
2 bytes |
PyUnicode_FORMAT_UCS4 |
"=I" |
4 bytes |
PyUnicode_FORMAT_UTF8 |
"B" |
1 byte |
PyUnicode_FORMAT_ASCII |
"B" |
1 byte |
PyUnicode_Import()
API:
PyObject* PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
int32_t format)
Create a Unicode string object from a buffer in a supported format.
- Return a reference to a new string object on success.
- Set an exception and return
NULL
on error.
data must not be NULL. nbytes must be positive or zero.
See PyUnicode_Export()
for the available formats.
UTF-8 format
CPython 3.14 doesn’t use the UTF-8 format internally. The format is provided for compatibility with PyPy which uses UTF-8 natively for strings. However, in CPython, the encoded UTF-8 string is cached which makes it convenient to be exported.
On CPython, the UTF-8 format has the lowest priority: ASCII and UCS formats are preferred.
ASCII format
When the PyUnicode_FORMAT_ASCII
format is request for export, the
PyUnicode_FORMAT_UCS1
export format is used for ASCII and Latin-1
strings.
The PyUnicode_FORMAT_ASCII
format is mostly useful for
PyUnicode_Import()
to validate that the string only contains ASCII
characters.
Surrogate characters and NUL characters
Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the surrogatepass
error handler.
Embedded NUL characters are allowed: they can be imported and exported.
Implementation
Backwards Compatibility
There is no impact on the backward compatibility, only new C API functions are added.
Usage of PEP 393 C APIs
A code search on PyPI top 7,500 projects (in March 2024) shows that there are many projects importing and exporting UCS formats with the regular C API.
PyUnicode_FromKindAndData()
25 projects call PyUnicode_FromKindAndData()
:
- Cython (3.0.9)
- Levenshtein (0.25.0)
- PyICU (2.12)
- PyICU-binary (2.7.4)
- PyQt5 (5.15.10)
- PyQt6 (6.6.1)
- aiocsv (1.3.1)
- asyncpg (0.29.0)
- biopython (1.83)
- catboost (1.2.3)
- cffi (1.16.0)
- mojimoji (0.0.13)
- mwparserfromhell (0.6.6)
- numba (0.59.0)
- numpy (1.26.4)
- orjson (3.9.15)
- pemja (0.4.1)
- pyahocorasick (2.0.0)
- pyjson5 (1.6.6)
- rapidfuzz (3.6.2)
- regex (2023.12.25)
- srsly (2.4.8)
- tokenizers (0.15.2)
- ujson (5.9.0)
- unicodedata2 (15.1.0)
PyUnicode_4BYTE_DATA()
21 projects call PyUnicode_2BYTE_DATA()
and/or
PyUnicode_4BYTE_DATA()
:
- Cython (3.0.9)
- MarkupSafe (2.1.5)
- Nuitka (2.1.2)
- PyICU (2.12)
- PyICU-binary (2.7.4)
- PyQt5_sip (12.13.0)
- PyQt6_sip (13.6.0)
- biopython (1.83)
- catboost (1.2.3)
- cement (3.0.10)
- cffi (1.16.0)
- duckdb (0.10.0)
- mypy (1.9.0)
- numpy (1.26.4)
- orjson (3.9.15)
- pemja (0.4.1)
- pyahocorasick (2.0.0)
- pyjson5 (1.6.6)
- pyobjc-core (10.2)
- sip (6.8.3)
- wxPython (4.2.1)
Rejected Ideas
Reject embedded NUL characters and require trailing NUL character
In C, it’s convenient to have a trailing NUL character. For example,
the for (; *str != 0; str++)
loop can be used to iterate on
characters and strlen()
can be used to get a string length.
The problem is that a Python str
object can embed NUL characters.
Example: "ab\0c"
. If a string contains an embedded NUL character,
code relying on the NUL character to find the string end truncates the
string. It can lead to bugs, or even security vulnerabilities.
See a previous discussion in the issue Change PyUnicode_AsUTF8()
to return NULL on embedded null characters.
Rejecting embedded NUL characters require to scan the string which has an O(n) complexity.
Reject surrogate characters
Surrogate characters are characters in the Unicode range [U+D800;
U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python
str
object can contain arbitrary lone surrogate characters. Example:
"\uDC80"
.
Rejecting surrogate characters prevents exporting a string which contains
such a character. It can be surprising and annoying since the
PyUnicode_Export()
caller doesn’t control the string contents.
Allowing surrogate characters allows to export any string and so avoid
this issue. For example, the UTF-8 codec can be used with the
surrogatepass
error handler to encode and decode surrogate
characters.
Discussions
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0756.rst
Last modified: 2024-09-17 13:34:14 GMT