Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

Closed
andy-ms mannequin opened this issue Aug 16, 2019 · 9 comments · Fixed by #101103
Closed

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

andy-ms mannequin opened this issue Aug 16, 2019 · 9 comments · Fixed by #101103
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes OS-windows topic-IO type-bug An unexpected behavior, bug, or error

Comments

@andy-ms
Copy link
Mannequin

andy-ms mannequin commented Aug 16, 2019

BPO 37871
Nosy @gpshead, @pfmoore, @tjguk, @zware, @eryksun, @zooba, @andy-ms

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-08-16.00:36:04.457>
labels = ['type-bug', '3.8', '3.9', 'expert-IO', '3.10', 'OS-windows']
title = 'Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB'
updated_at = <Date 2021-03-20.23:35:16.331>
user = 'https://github.com/andy-ms'

bugs.python.org fields:

activity = <Date 2021-03-20.23:35:16.331>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Windows', 'IO']
creation = <Date 2019-08-16.00:36:04.457>
creator = 'anhans'
dependencies = []
files = []
hgrepos = []
issue_num = 37871
keywords = []
message_count = 5.0
messages = ['349837', '349844', '349872', '389191', '389199']
nosy_count = 7.0
nosy_names = ['gregory.p.smith', 'paul.moore', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'anhans']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue37871'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

Linked PRs

@andy-ms
Copy link
Mannequin Author

andy-ms mannequin commented Aug 16, 2019

To reproduce:

Put this text in a file a.py and run py a.py.

Or just run: py -c "print(('é' * 40 + '\n') * 473)"

Scroll up for a while. One of the lines will be:

éééééééééééééééééééé��ééééééééééééééééééé

(You can spot this because it's slightly longer than the other lines.)

The error is consistently on line 237, column 21 (1-indexed).

# The error reproduces on Windows but not Linux. Tested in both powershell and CMD.
# (Failed to reproduce on either a real Linux machine or on Ubuntu with WSL.)
# On Windows, the error reproduces every time consistently.

# There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s)

# There is no error if we use N separate print statements
# instead of printing a single string with N newlines.

# Similar scripts written in Groovy, JS and Ruby have no error.
# Groovy: System.out.println(("é" * 40 + "\n") * 473)
# JS: console.log(("é".repeat(40) + "\n").repeat(473))
# Ruby: puts(("é" * 40 + "\n") * 473)

@andy-ms andy-ms mannequin added 3.7 (EOL) end of life OS-windows type-bug An unexpected behavior, bug, or error labels Aug 16, 2019
@eryksun
Copy link
Contributor

eryksun commented Aug 16, 2019

To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough.

    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

With ('é' * 40 + '\n') * 473, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two.

    >>> 38786 // 2
    19393
    >>> 19393 // 82
    236
    >>> 19393 % 82
    41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.

@zooba
Copy link
Member

zooba commented Aug 16, 2019

I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence).

Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :)

@vstinner vstinner changed the title 40 * 473 grid of "é" has a single wrong character on Windows Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB Aug 21, 2019
@eryksun eryksun added 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes and removed 3.7 (EOL) end of life labels Mar 20, 2021
@gpshead
Copy link
Member

gpshead commented Mar 20, 2021

Steve's approach makes sense and should be robust.

side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it?

@eryksun
Copy link
Contributor

eryksun commented Mar 20, 2021

side note: do we need to care about Windows 7 anymore in
3.10 given that microsoft no longer supports it?

If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed.

The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@pfmoore
Copy link
Member

pfmoore commented Dec 23, 2022

The following change to _io__WindowsConsoleIO_write_impl appears to fix the issue, by implementing @zooba's suggestion.

    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        /* Reduce the length until we hit the final byte of a UTF-8 sequence
         * (top bit is unset). Fix for github issue 82052.
         */
        while (len > 0 && (((char *)b->buf)[len-1] & 0x80) != 0)
            --len;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

I can make this into a PR (likely sometime after the holidays), but I'm not entirely clear how I'd create a test for it. And my C skills are rather rusty, so if someone would check I haven't made a stupid error that would be great 🙂

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

cc @willmcgugan who flagged this issue on Twitter...

@zooba
Copy link
Member

zooba commented Jan 9, 2023

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

It's not possible in a valid string, so probably if len hits zero then we can either raise a Unicode error of some kind, or just restore the original length and let the console deal with it.

(It's not possible because each character in a multi-byte sequence has progressively more top-bits set. So even theoretically, you can only have 9 bytes in a row representing a single character, but IIRC only 3 are allowed. We're at over 16,000, so we're good.)

@ofek
Copy link
Sponsor Contributor

ofek commented Jan 10, 2023

Is there an open pull request for the fix?

@pfmoore
Copy link
Member

pfmoore commented Jan 10, 2023

Not yet. I was planning on doing one, but haven’t had the time yet.

pfmoore added a commit that referenced this issue Jan 17, 2023
…01103)

Don't send partial UTF-8 sequences to the Windows API
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 17, 2023
…pythonGH-101103)

Don't send partial UTF-8 sequences to the Windows API
(cherry picked from commit f34176b)

Co-authored-by: Paul Moore <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 17, 2023
…pythonGH-101103)

Don't send partial UTF-8 sequences to the Windows API
(cherry picked from commit f34176b)

Co-authored-by: Paul Moore <[email protected]>
miss-islington added a commit that referenced this issue Jan 17, 2023
…01103)

Don't send partial UTF-8 sequences to the Windows API
(cherry picked from commit f34176b)

Co-authored-by: Paul Moore <[email protected]>
miss-islington added a commit that referenced this issue Jan 17, 2023
…01103)

Don't send partial UTF-8 sequences to the Windows API
(cherry picked from commit f34176b)

Co-authored-by: Paul Moore <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes OS-windows topic-IO type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants