PEP: 623 Title: Remove wstr from Unicode Author: Inada Naoki
<songofacandy@gmail.com> BDFL-Delegate: Victor Stinner
<vstinner@python.org> Discussions-To:
https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
25-Jun-2020 Python-Version: 3.10 Resolution:
https://mail.python.org/archives/list/python-dev@python.org/thread/VQKDIZLZ6HF2MLTNCUFURK2IFTXVQEYA/

Abstract

PEP 393 deprecated some unicode APIs, and introduced wchar_t *wstr, and
Py_ssize_t wstr_length in the Unicode structure to support these
deprecated APIs.

This PEP is planning removal of wstr, and wstr_length with deprecated
APIs using these members by Python 3.12.

Deprecated APIs which doesn't use the members are out of scope because
they can be removed independently.

Motivation

Memory usage

str is one of the most used types in Python. Even most simple ASCII
strings have a wstr member. It consumes 8 bytes per string on 64-bit
systems.

Runtime overhead

To support legacy Unicode object, many Unicode APIs must call
PyUnicode_READY().

We can remove this overhead too by dropping support of legacy Unicode
object.

Simplicity

Supporting legacy Unicode object makes the Unicode implementation more
complex. Until we drop legacy Unicode object, it is very hard to try
other Unicode implementation like UTF-8 based implementation in PyPy.

Rationale

Python 4.0 is not scheduled yet

PEP 393 introduced efficient internal representation of Unicode and
removed border between "narrow" and "wide" build of Python.

PEP 393 was implemented in Python 3.3 which is released in 2012. Old
APIs were deprecated since then, and the removal was scheduled in Python
4.0.

Python 4.0 was expected as next version of Python 3.9 when PEP 393 was
accepted. But the next version of Python 3.9 is Python 3.10, not 4.0.
This is why this PEP schedule the removal plan again.

Python 2 reached EOL

Since Python 2 didn't have PEP 393 Unicode implementation, legacy APIs
might help C extension modules supporting both of Python 2 and 3.

But Python 2 reached the EOL in 2020. We can remove legacy APIs kept for
compatibility with Python 2.

Plan

Python 3.9

These macros and functions are marked as deprecated, using Py_DEPRECATED
macro.

-   Py_UNICODE_WSTR_LENGTH()
-   PyUnicode_GET_SIZE()
-   PyUnicode_GetSize()
-   PyUnicode_GET_DATA_SIZE()
-   PyUnicode_AS_UNICODE()
-   PyUnicode_AS_DATA()
-   PyUnicode_AsUnicode()
-   _PyUnicode_AsUnicode()
-   PyUnicode_AsUnicodeAndSize()
-   PyUnicode_FromUnicode()

Python 3.10

-   Following macros, enum members are marked as deprecated.
    Py_DEPRECATED(3.10) macro are used as possible. But they are
    deprecated only in comment and document if the macro can not be used
    easily.
    -   PyUnicode_WCHAR_KIND
    -   PyUnicode_READY()
    -   PyUnicode_IS_READY()
    -   PyUnicode_IS_COMPACT()
-   PyUnicode_FromUnicode(NULL, size) and
    PyUnicode_FromStringAndSize(NULL, size) emit DeprecationWarning when
    size > 0.
-   PyArg_ParseTuple() and PyArg_ParseTupleAndKeywords() emit
    DeprecationWarning when u, u#, Z, and Z# formats are used.

Python 3.12

-   Following members are removed from the Unicode structures:
    -   wstr
    -   wstr_length
    -   state.compact
    -   state.ready
-   The PyUnicodeObject structure is removed.
-   Following macros and functions, and enum members are removed:
    -   Py_UNICODE_WSTR_LENGTH()
    -   PyUnicode_GET_SIZE()
    -   PyUnicode_GetSize()
    -   PyUnicode_GET_DATA_SIZE()
    -   PyUnicode_AS_UNICODE()
    -   PyUnicode_AS_DATA()
    -   PyUnicode_AsUnicode()
    -   _PyUnicode_AsUnicode()
    -   PyUnicode_AsUnicodeAndSize()
    -   PyUnicode_FromUnicode()
    -   PyUnicode_WCHAR_KIND
    -   PyUnicode_READY()
    -   PyUnicode_IS_READY()
    -   PyUnicode_IS_COMPACT()
-   PyUnicode_FromStringAndSize(NULL, size)) raises RuntimeError when
    size > 0.
-   PyArg_ParseTuple() and PyArg_ParseTupleAndKeywords() raise
    SystemError when u, u#, Z, and Z# formats are used, as other
    unsupported format character.

Discussion

-   Draft PEP: Remove wstr from Unicode
-   When can we remove wchar_t* cache from string?
-   PEP 623: Remove wstr from Unicode object #1462

References

-   bpo-38604: Schedule Py_UNICODE API removal
-   bpo-36346: Prepare for removing the legacy Unicode C API
-   bpo-30863: Rewrite PyUnicode_AsWideChar() and
    PyUnicode_AsWideCharString(): They no longer cache the wchar_t*
    representation of string objects.

Copyright

This document has been placed in the public domain.