Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

Merged
merged 9 commits into from
Jun 21, 2024

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jun 17, 2024

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.


📚 Documentation preview 📚: https://cpython-previews--120639.org.readthedocs.build/

@vstinner
Copy link
Member Author

vstinner commented Jun 17, 2024

PR to discuss extensions to the PyUnicodeWriter API:

PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar(
    PyUnicodeWriter *writer,
    wchar_t *str,
    Py_ssize_t size);

PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,         /* UTF-8 encoded string */
    Py_ssize_t length,          /* size of string */
    const char *errors,         /* error handling */
    Py_ssize_t *consumed);      /* bytes consumed */

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.
@vstinner vstinner changed the title [WIP] gh-119182: Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() Jun 17, 2024
@vstinner vstinner marked this pull request as ready for review June 17, 2024 15:56
@vstinner
Copy link
Member Author

if (size < 0) {
size = wcslen(str);
}
PyObject *obj = PyUnicode_FromWideChar(str, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this API will be used a lot to build Python Unicode objects from wchar_t input, I think it's better to try to optimize it and avoid creating a temporary object.

The PyUnicode_FromWideChar() could be refactored using a private helper shared by both PyUnicode_FromWideChar () and this PyUnicodeWriter_WriteWideChar() to make this possible: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1794

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I optimized PyUnicodeWriter_WriteWideChar(). I ran a benchmark on _testcapi.test_unicodewriter_widechar():

$ env/bin/python -m pyperf timeit -s 'from _testcapi import test_unicodewriter_widechar' 'test_unicodewriter_widechar()' -o ref.json -v

(...)

$ python3 -m pyperf compare_to ref.json optim.json 
Mean +- std dev: [ref] 203 ns +- 9 ns -> [optim] 150 ns +- 3 ns: 1.35x faster

It's a 1.4x faster, so it's worth it. It saves around 53 ns for 3 calls to PyUnicodeWriter_WriteWideChar().

Avoid a temporary Unicode object, write directly into the writer.
@vstinner
Copy link
Member Author

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

@malemburg
Copy link
Member

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

Yes, thanks for adding that.

@serhiy-storchaka serhiy-storchaka self-requested a review June 19, 2024 14:55
PyObject *
PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size)
static inline int
unicode_fromwidechar(const wchar_t *u, Py_ssize_t size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that more than a half of this function is now specific to the caller. This is a mess. I wonder, would not it be simpler if write it as two different functions specialized for their case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar(): I added unicode_write_widechar() and removed unicode_convert_wchar_to_ucs4(). Does it look better?

Remove unicode_convert_wchar_to_ucs4().

Refactor PyUnicode_FromWideChar() and
PyUnicodeWriter_WriteWideChar().
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

Comment on lines 1779 to 1780
// This code assumes that unicode can hold one more code point than
// wstr characters for a terminating null character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer true, after adding the (iter+1) < end check.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Modules/_testcapi/unicode.c Show resolved Hide resolved

// consumed is 0 if write fails
consumed = 12345;
assert(PyUnicodeWriter_DecodeUTF8Stateful(writer, "invalid\xFF", -1, NULL, &consumed) < 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This do nothing in non-debug build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions are always built in _testcapi.c: the NDEBUG macro is undefined early in parts.h.

Modules/_testcapi/unicode.c Show resolved Hide resolved
Modules/_testcapi/unicode.c Show resolved Hide resolved
if (PyUnicodeWriter_WriteWideChar(writer, L"-", 1) < 0) {
goto error;
}
if (PyUnicodeWriter_WriteWideChar(writer, L"euro=\u20AC", -1) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test surrogate pairs and non-BMP characters.

Since the code depends on the kind of the buffer string, you need to test different combinations: write different strings after writing a UCS2 or UCS4 string.

I suggest to implement in C a function which creates a PyUnicodeWriter, write the first argument as a Python string, then covert the second argument to the wchar_t* string and write it with size specified as optional third argument, and return the result. This helper function can be called in Python code with different arguments. The result will be checked even in non-debug build. You can test much more cases.

@vstinner
Copy link
Member Author

@serhiy-storchaka: I tried to address most of your reviews. Would you mind to review the updated PR?

For tests, it's really complicated to write tests in C. I think that I will try to expose the C API PyUnicodeWriter in Python to write tests in Python in a following PR. I wanted to do that at the beginning, but it was quicker to start with C. Now the C test suite of PyUnicodeWriter is already quite big!

@serhiy-storchaka:

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

Right, I prefer to leave it as it is for now and write a following PR.

@vstinner vstinner merged commit 4123226 into python:main Jun 21, 2024
36 checks passed
@vstinner vstinner deleted the WIP_unicode_writer_more branch June 21, 2024 17:33
@vstinner
Copy link
Member Author

Ok, I merged this PR as a starting point. I will rework tests in a follow-up PR.

Thanks @serhiy-storchaka and @malemburg for your reviews.

@vstinner
Copy link
Member Author

I will rework tests in a follow-up PR.

Rewrite tests in Python: #120845

mrahtz pushed a commit to mrahtz/cpython that referenced this pull request Jun 30, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants