PEP: 624 Title: Remove Py_UNICODE encoder APIs Author: Inada Naoki
<songofacandy@gmail.com> Status: Final Type: Standards Track
Content-Type: text/x-rst Created: 06-Jul-2020 Python-Version: 3.11
Post-History: 08-Jul-2020

Abstract

This PEP proposes to remove deprecated Py_UNICODE encoder APIs in Python
3.11:

-   PyUnicode_Encode()
-   PyUnicode_EncodeASCII()
-   PyUnicode_EncodeLatin1()
-   PyUnicode_EncodeUTF7()
-   PyUnicode_EncodeUTF8()
-   PyUnicode_EncodeUTF16()
-   PyUnicode_EncodeUTF32()
-   PyUnicode_EncodeUnicodeEscape()
-   PyUnicode_EncodeRawUnicodeEscape()
-   PyUnicode_EncodeCharmap()
-   PyUnicode_TranslateCharmap()
-   PyUnicode_EncodeDecimal()
-   PyUnicode_TransformDecimalToASCII()

Note

PEP 623 propose to remove Unicode object APIs relating to Py_UNICODE. On
the other hand, this PEP is not relating to Unicode object. These PEPs
are split because they have different motivations and need different
discussions.

Motivation

In general, reducing the number of APIs that have been deprecated for a
long time and have few users is a good idea for not only it improves the
maintainability of CPython, but it also helps API users and other Python
implementations.

Rationale

Deprecated since Python 3.3

Py_UNICODE and APIs using it has been deprecated since Python 3.3.

Inefficient

All of these APIs are implemented using PyUnicode_FromWideChar. So these
APIs are inefficient when user want to encode Unicode object.

Not used widely

When searching from the top 4000 PyPI packages[1], only pyodbc use these
APIs.

-   PyUnicode_EncodeUTF8()
-   PyUnicode_EncodeUTF16()

pyodbc uses these APIs to encode Unicode object into bytes object. So it
is easy to fix it.[2]

Alternative APIs

There are alternative APIs to accept PyObject *unicode instead of
Py_UNICODE *. Users can migrate to them.

+-------------------------------------+--------------------------------------+
| Deprecated API                      | Alternative APIs                     |
+=====================================+======================================+
| PyUnicode_Encode()                  | PyUnicode_AsEncodedString()          |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeASCII()             | PyUnicode_AsASCIIString() (1)        |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeLatin1()            | PyUnicode_AsLatin1String() (1)       |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeUTF7()              | (2)                                  |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeUTF8()              | PyUnicode_AsUTF8String() (1)         |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeUTF16()             | PyUnicode_AsUTF16String() (3)        |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeUTF32()             | PyUnicode_AsUTF32String() (3)        |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeUnicodeEscape()     | PyUnicode_AsUnicodeEscapeString()    |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeRawUnicodeEscape()  | PyUnicode_AsRawUnicodeEscapeString() |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeCharmap()           | PyUnicode_AsCharmapString() (1)      |
+-------------------------------------+--------------------------------------+
| PyUnicode_TranslateCharmap()        | PyUnicode_Translate()                |
+-------------------------------------+--------------------------------------+
| PyUnicode_EncodeDecimal()           |   (4)                                |
+-------------------------------------+--------------------------------------+
| PyUnicode_TransformDecimalToASCII() |   (4)                                |
+-------------------------------------+--------------------------------------+

Notes:

(1)

    const char *errors parameter is missing.

(2)

    There is no public alternative API. But user can use generic
    PyUnicode_AsEncodedString() instead.

(3)

    const char *errors, int byteorder parameters are missing.

(4)

    There is no direct replacement. But Py_UNICODE_TODECIMAL can be used
    instead. CPython uses _PyUnicode_TransformDecimalAndSpaceToASCII for
    converting from Unicode to numbers instead.

Plan

Remove these APIs in Python 3.11. They have been deprecated already.

-   PyUnicode_Encode()
-   PyUnicode_EncodeASCII()
-   PyUnicode_EncodeLatin1()
-   PyUnicode_EncodeUTF7()
-   PyUnicode_EncodeUTF8()
-   PyUnicode_EncodeUTF16()
-   PyUnicode_EncodeUTF32()
-   PyUnicode_EncodeUnicodeEscape()
-   PyUnicode_EncodeRawUnicodeEscape()
-   PyUnicode_EncodeCharmap()
-   PyUnicode_TranslateCharmap()
-   PyUnicode_EncodeDecimal()
-   PyUnicode_TransformDecimalToASCII()

Alternative Ideas

Replace Py_UNICODE* with PyObject*

As described in the "Alternative APIs" section, some APIs don't have
public alternative APIs accepting PyObject *unicode input. And some
public alternative APIs have restrictions like missing errors and
byteorder parameters.

Instead of removing deprecated APIs, we can reuse their names for
alternative public APIs.

Since we have private alternative APIs already, it is just renaming from
private name to public and deprecated names.

+--------------------------+-------------------------------+
| Rename to                | Rename from                   |
+==========================+===============================+
| PyUnicode_EncodeASCII()  |   _PyUnicode_AsASCIIString()  |
+--------------------------+-------------------------------+
| PyUnicode_EncodeLatin1() |   _PyUnicode_AsLatin1String() |
+--------------------------+-------------------------------+
| PyUnicode_EncodeUTF7()   |   _PyUnicode_EncodeUTF7()     |
+--------------------------+-------------------------------+
| PyUnicode_EncodeUTF8()   |   _PyUnicode_AsUTF8String()   |
+--------------------------+-------------------------------+
| PyUnicode_EncodeUTF16()  |   _PyUnicode_EncodeUTF16()    |
+--------------------------+-------------------------------+
| PyUnicode_EncodeUTF32()  |   _PyUnicode_EncodeUTF32()    |
+--------------------------+-------------------------------+

Pros:

-   We have a more consistent API set.

Cons:

-   Backward incompatible.
-   We have more public APIs to maintain for rare use cases.
-   Existing public APIs are enough for most use cases, and
    PyUnicode_AsEncodedString() can be used in other cases.

Replace Py_UNICODE* with Py_UCS4*

We can replace Py_UNICODE with Py_UCS4 and undeprecate these APIs.

UTF-8, UTF-16, UTF-32 encoders support Py_UCS4 internally. So
PyUnicode_EncodeUTF8(), PyUnicode_EncodeUTF16(), and
PyUnicode_EncodeUTF32() can avoid to create a temporary Unicode object.

Pros:

-   We can avoid creating temporary Unicode object when encoding from
    Py_UCS4* into bytes object with UTF-8, UTF-16, UTF-32 codecs.

Cons:

-   Backward incompatible.
-   We have more public APIs to maintain for rare use cases.
-   Other Python implementations that want to support Python/C API need
    to support these APIs too.
-   If we change the Unicode internal representation to UTF-8 in the
    future, we need to keep UCS-4 support only for these APIs.

Replace Py_UNICODE* with wchar_t*

We can replace Py_UNICODE with wchar_t. Since Py_UNICODE is typedef of
wchar_t already, this is status quo.

On platforms where sizeof(wchar_t) == 4, we can avoid to create a
temporary Unicode object when encoding from wchar_t* to bytes objects
using UTF-8, UTF-16, and UTF-32 codec, like the "Replace Py_UNICODE*
with Py_UCS4*" idea.

Pros:

-   Backward compatible.
-   We can avoid creating temporary Unicode object when encode from
    Py_UCS4* into bytes object with UTF-8, UTF-16, UTF-32 codecs on
    platform where sizeof(wchar_t) == 4.

Cons:

-   Although Windows is the most major platform that uses wchar_t
    heavily, these APIs need to create a temporary Unicode object always
    because sizeof(wchar_t) == 2 on Windows.
-   We have more public APIs to maintain for rare use cases.
-   Other Python implementations that want to support Python/C API need
    to support these APIs too.
-   If we change the Unicode internal representation to UTF-8 in the
    future, we need to keep UCS-4 support only for these APIs.

Rejected Ideas

Emit runtime warning

In addition to existing compiler warning, emitting runtime
DeprecationWarning is suggested.

But these APIs doesn't release GIL for now. Emitting a warning from such
APIs is not safe. See this example.

    PyObject *u = PyList_GET_ITEM(list, i);  // u is borrowed reference.
    PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
            PyUnicode_GET_SIZE(u), NULL);
    // Assumes u is still living reference.
    PyObject *t = PyTuple_Pack(2, u, b);
    Py_DECREF(b);
    return t;

If we emit Python warning from PyUnicode_EncodeUTF8(), warning filters
and other threads may change the list and u can be a dangling reference
after PyUnicode_EncodeUTF8() returned.

Discussions

-   [python-dev] Plan to remove Py_UNICODE APis except PEP 623
-   bpo-41123: Remove Py_UNICODE APIs except PEP 623
-   [python-dev] PEP 624: Remove Py_UNICODE encoder APIs

Objections

-   Removing these APIs removes ability to use codec without temporary
    Unicode.
    -   Codecs can not encode Unicode buffer directly without temporary
        Unicode object since Python 3.3. All these APIs creates
        temporary Unicode object for now. So removing them doesn't
        reduce any abilities.
-   Why not remove decoder APIs too?
    -   They are part of stable ABI.
    -   PyUnicode_DecodeASCII() and PyUnicode_DecodeUTF8() are used very
        widely. Deprecating them is not worth enough.
    -   Decoder APIs can decode from byte buffer directly, without
        creating temporary bytes object. On the other hand, encoder APIs
        can not avoid temporary Unicode object.

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

[1] Source package list chosen from top 4000 PyPI packages.
(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)

[2] pyodbc -- Don't use PyUnicode_Encode API #792
(https://github.com/mkleehammer/pyodbc/pull/792)