Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471

Merged
merged 10 commits into from
Dec 4, 2017

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Nov 19, 2017

Also fixed searching patterns that could match an empty string.

This will fix bpo-852532, bpo-1647489, bpo-3262, bpo-25054, and maybe others.

https://bugs.python.org/issue25054

…rns.

Fixed searching patterns that could match an empty string.
@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement labels Nov 19, 2017
@AraHaan
Copy link
Contributor

AraHaan commented Nov 20, 2017

Looks like the tests are failing for some reason.

@michaelkarlcoleman
Copy link

Ha ha ha ha!!! Bless you--you are my hero for today! :-)

@@ -768,6 +772,22 @@ Changes in the Python API
avoid a warning escape them with a backslash.
(Contributed by Serhiy Storchaka in :issue:`30349`.)

* The result of splitting a string on a :mod:`regular expression <re>`
that could match an empty string like has been changed. For example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop like

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', 'Words, words, words.')
['', 'W', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', '']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example with a shorter result may be easier to understand:

>>> re.split(r'\W*', 'Wo, rd.')
['', 'W', 'o', 'r', 'd', '']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use an example similar to the above examples (but with * instead of +) for taking the example of the incorrect usage which "worked" in previous versions due to a bug. I'll simplify the example.

self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-')
self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use a callback to verify that the second match is empty, not the third.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use the replacement template that includes the matched string.

def test_zerowidth(self):
# Issues 852532, 1647489, 3262, 25054.
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps break this down so I can infer what is going on here :)

re.split(r"\b|:", "a:")  # How many matches after "a"?
re.split(r"\b|:", ":b")  # Is there an empty match before "b"?
re.split(r":??", ":")  # Does it match the colon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\b matches too much. I'll add separate tests for beginning and ending of words. They are less ambiguous.

But the main purpose of this test is testing that the new behavior differs from the old one. In older Python (2.7 and 3.4) re.split(r"\b|:+", "a::bc") returns ['a:', 'bc'] that doesn't look sane.

@@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
can be set within the scope of a group.
(Contributed by Serhiy Storchaka in :issue:`31690`.)

:func:`re.split` now supports splitting on a pattern that matches an empty
string like ``r'\b'``, ``'^$'`` or ``(?=-)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a pattern like . . . that matches an empty string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

non-empty strings also can be changed. For example ``r'(?m)^\s*?$'``
will match in string ``'a\n\n'`` not only empty strings at positions
2 and 3, but also the string ``'\n'`` at positions 2--3. For matching
only blank lines the pattern should be rewritten as ``r'(?m)^[\S\n]*?$'``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewritten expression seems wrong. Won’t the \S (uppercase S) match the non-whitespace a at the start of the string, which is not a blank line? The first expression (which you say is wrong) seemed the most obvious; failing that perhaps r"(?m)^[^\S\n]*$". I.e. complement the character set, and no need for the non-greedy *? repetition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yes, the negation was missed here. The actual pattern in doctest.py contains it.

There is no difference between greedy and non-greedy repetitions here, but I'll change the pattern to use the greedy repetition as you have suggested.

The result of repeated searching patterns that could match empty and
non-empty strings also can be changed. For example ``r'(?m)^\s*?$'``
will match in string ``'a\n\n'`` not only empty strings at positions
2 and 3, but also the string ``'\n'`` at positions 2--3. For matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to clearly document the new behaviour, at least on the module page. But without knowing the exact new behaviour, I suggest to reword this paragraph something like this:

“For patterns that match both empty and non-empty strings, the result of searching for all matches may also be changed in other cases. For example in the string 'a\n\n', the pattern r'(?m)^\s*?$' will not only match empty strings at positions 2 and 3, but also the string '\n' at positions 2–3. To match only blank lines, the pattern should be rewritten as . . .”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks!

@serhiy-storchaka serhiy-storchaka merged commit 70d56fb into python:master Dec 4, 2017
@serhiy-storchaka serhiy-storchaka deleted the re-search-zerowidth branch December 4, 2017 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants