bpo-31714: Improved regular expression documentation. #3907

serhiy-storchaka · 2017-10-06T14:24:29Z

https://bugs.python.org/issue31714

ezio-melotti

I added a few comments.
There are also a couple of paragraphs with long lines that you could rewrap.

ezio-melotti · 2017-10-06T15:30:19Z

Doc/howto/regex.rst

-   lowercasing doesn't take the current locale into account; it will if you also
-   set the :const:`LOCALE` flag.
+   letters, too. Full Unicode matching also works unless the :const:`re.ASCII`
+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match


disable non-ASCII matches makes it sounds like it won't match any non-ASCII characters, but that is false:

>>> re.match('负鼠', '负鼠', re.ASCII) <_sre.SRE_Match object; span=(0, 2), match='负鼠'>

I think it would be more correct to say that regex sets will only match characters in the ASCII range.

This was a copy from existing re module documentation. :(

regex sets will only match characters in the ASCII range.

This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.

re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>

But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.

ezio-melotti · 2017-10-06T15:36:29Z

Doc/howto/regex.rst

+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match
+   letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
+   Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
+   'K' (U+212a, Kelvin sign) in Unicode mode.  ``Spam`` will match ``Spam``,


I think this is wrong: unless I'm mistaken [A-Z] should be limited to upper case ASCII letters, even with re.UNICODE.
Perhaps you meant to use \w?
It would also be better to specify what Unicode categories are matched, instead of just providing a few examples and letting the user figure it out from there.
I think this is already explained below, so a link or a mention to that section is fine.

Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.

I did some tests:

>>> unichars = ''.join(chr(cp) for cp in range(0x110000)) >>> ''.join(re.findall('[a-z]', bmp)) 'abcdefghijklmnopqrstuvwxyz' >>> ''.join(re.findall('[A-Z]', bmp)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> ''.join(re.findall('[a-z]', bmp, re.I)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK' >>> ''.join(re.findall('[A-Z]', bmp, re.I)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'

Now I understand what the paragraph means: [a-z] and [A-Z] will only match ASCII lower and upper case letters respectively (so 26 chars each), however if re.I is used with either one, since those 4 letters (and only those 4) are valid capitalization of the 26 ASCII letters, they will be matched as well (bringing the total up to 26 + 26 + 4 == 56).
Do you think we should rephrase it to make it clearer?

It would be very good to make it clearer. Now, since you understand what the paragraph means, could you please suggest a clear wording?

This example is not artificial. See bpo-31672. I think this caveat should be documented specially.

When the patterns ``[a-z]`` or ``[A-Z]`` are used in combination with the re.I and re.U flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).

ezio-melotti · 2017-10-06T15:43:50Z

Doc/howto/regex.rst

+   that take account of language differences.  For example, if you're
+   processing encoded French text, you'd want to be able to write ``\w+`` to
+   match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
+   bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.


I would remove the '...' here and perhaps add b'\xe9' and b'\xe7' within parenthesis.

It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.

The reason why I suggested that, is because 'é' and 'ç' are Unicode str in Python 3, whereas without quotes they are just letters. The addition of b'\xe9' and b'\xe7' might help clarify what is being matched, but it's not essential.

ezio-melotti · 2017-10-06T15:47:52Z

Doc/howto/regex.rst

+   bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
+   If your system is configured properly and a French locale is selected,
+   certain C functions will tell the program that the byte corresponding
+   ``'é'`` should also be considered a letter.


corresponding to é
I also don't like certain C functions will tell the program too much.

Do you suggest remove quotes?

certain C functions will tell the program already was here, it looks correct to me, and I don't know how improve it.

I suggest to remove quotes, since we are talking about the character, not about a Unicode string.
Fair enough about the wording being there already.

ezio-melotti · 2017-10-06T15:52:42Z

Doc/howto/regex.rst

+   is very unreliable, and it only handles one "culture" at a time anyway;
+   and it only works with 8-bit locales;
+   you should use Unicode matching instead, which is the default in Python 3
+   for Unicode (str) patterns.


... is very unreliable, it only handles one "culture" at a time, and it only works with 8-bit locales. You should use ...

Since Unicode matching is the default, I wouldn't say "you should use", but just something like "Unicode matching is already enabled by default in Python 3, and it is able to handle different locales/languages."

ezio-melotti · 2017-10-06T15:54:42Z

Doc/howto/regex.rst

-   Alternation, or the "or" operator.   If A and B are regular expressions,
-   ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
+   Alternation, or the "or" operator.   If *A* and *B* are regular expressions,
+   ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very


Why * instead of ``?

Because A and B are not literal regexpes, but variables.

ezio-melotti · 2017-10-06T15:59:22Z

Doc/library/re.rst

-   matches.
-
+   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
+   ``ü``) also works unless the :const:`re.ASCII` flag is also used to disable


is ~~also~~ used to
(you already have one before)

ezio-melotti · 2017-10-06T16:00:35Z

Doc/library/re.rst

+   only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ'
+   (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small
+   letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a,
+   Kelvin sign).  If the :const:`ASCII` flag is used, only letters 'a' to 'z'


This is duplicated from above, so the same comments applied (maybe it shouldn't be duplicated?).

Could you suggest how to avoid duplication?

I didn't notice the duplicated part was on two separate files, so it's probably ok to leave it.

ezio-melotti · 2017-10-06T16:01:39Z

Doc/library/re.rst

   you should use Unicode matching instead, which is the default in Python 3
-   for Unicode (str) patterns. This flag can be used only with bytes patterns.
+   for Unicode (str) patterns.
+   Correcsponds the inline flag ``(?L)``.


Correcsponds to
(extra c and missing to)

Also applies below.

ezio-melotti · 2017-10-06T16:06:55Z

Doc/library/re.rst

+
+      >>> int_re = r'\d+'
+      >>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?'))
+      \d+(\.\d+)?(e[+-]?\d+)?


I don't find this example particularly clear. Why would someone want to use re.escape() on the replacement string? Wouldn't using int_re = r'\\d+' (and a normal str.replace on INT) be easier?

I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.

I'll replace this example with the better one.

bedevere-bot · 2017-10-06T16:07:59Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I didn't expect the Spanish Inquisition!. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

cryvate · 2017-10-06T16:11:11Z

Doc/howto/regex.rst

@@ -1284,12 +1292,12 @@ doesn't work because of the greedy nature of ``.*``. ::
   >>> print(re.match('<.*>', s).group())
   <html><head><title>Title</title>

-The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of
+The RE matches the ``<`` in ``'<html>'``, and the ``.*`` consumes the rest of


I think you do want '<' as in the original here. See also line 1299 in this commit.

cryvate · 2017-10-06T16:11:16Z

Doc/howto/regex.rst

-   Alternation, or the "or" operator.   If A and B are regular expressions,
-   ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
+   Alternation, or the "or" operator.   If *A* and *B* are regular expressions,
+   ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
   low precedence in order to make it work reasonably when you're alternating
   multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``,


everything after "will match either" have '[word]' instead of [word]?

cryvate · 2017-10-06T16:11:18Z

Doc/howto/regex.rst

+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match
+   letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
+   Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
+   'K' (U+212a, Kelvin sign) in Unicode mode.  ``Spam`` will match ``Spam``,


shouldn't all those after "will match" be '[word]' instead of [word]?

Sorry, I'm not sure that I understood your comment.

Spam, spam, spAM, or ſpam -> 'Spam', 'spam', 'spAM', or 'ſpam'? I might have gotten it wrong though.

I think he's referring to the 4 non-ASCII letters: they should be enclosed in `'..'`

cryvate · 2017-10-06T16:11:23Z

Doc/howto/regex.rst

@@ -229,24 +226,23 @@ Another repeating metacharacter is ``+``, which matches one or more times.  Pay
 careful attention to the difference between ``*`` and ``+``; ``*`` matches
 *zero* or more times, so whatever's being repeated may not be present at all,
 while ``+`` requires at least *one* occurrence.  To use a similar example,
-``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match
-``ct``.
+``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``a``'s), but won't


(3 'a's)?

serhiy-storchaka

Thank you @ezio-melotti and @cryvate! I'll rewrap long lines after resolving all issues with wording.

serhiy-storchaka · 2017-10-06T16:55:28Z

Doc/howto/regex.rst

-   lowercasing doesn't take the current locale into account; it will if you also
-   set the :const:`LOCALE` flag.
+   letters, too. Full Unicode matching also works unless the :const:`re.ASCII`
+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match


This was a copy from existing re module documentation. :(

regex sets will only match characters in the ASCII range.

This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.

re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>

But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.

serhiy-storchaka · 2017-10-06T16:58:23Z

Doc/howto/regex.rst

+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match
+   letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
+   Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
+   'K' (U+212a, Kelvin sign) in Unicode mode.  ``Spam`` will match ``Spam``,


Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.

serhiy-storchaka · 2017-10-06T16:58:58Z

Doc/howto/regex.rst

+   flag is also used to disable non-ASCII matches.  ``[A-Z]`` will also match
+   letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
+   Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
+   'K' (U+212a, Kelvin sign) in Unicode mode.  ``Spam`` will match ``Spam``,


Sorry, I'm not sure that I understood your comment.

serhiy-storchaka · 2017-10-06T17:01:36Z

Doc/howto/regex.rst

+   that take account of language differences.  For example, if you're
+   processing encoded French text, you'd want to be able to write ``\w+`` to
+   match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
+   bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.


It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.

serhiy-storchaka · 2017-10-06T17:04:19Z

Doc/howto/regex.rst

+   bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
+   If your system is configured properly and a French locale is selected,
+   certain C functions will tell the program that the byte corresponding
+   ``'é'`` should also be considered a letter.


Do you suggest remove quotes?

certain C functions will tell the program already was here, it looks correct to me, and I don't know how improve it.

serhiy-storchaka · 2017-10-06T17:10:08Z

Doc/howto/regex.rst

-   Alternation, or the "or" operator.   If A and B are regular expressions,
-   ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
+   Alternation, or the "or" operator.   If *A* and *B* are regular expressions,
+   ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very


Because A and B are not literal regexpes, but variables.

serhiy-storchaka · 2017-10-06T17:15:18Z

Doc/library/re.rst

+   only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ'
+   (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small
+   letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a,
+   Kelvin sign).  If the :const:`ASCII` flag is used, only letters 'a' to 'z'


Could you suggest how to avoid duplication?

serhiy-storchaka · 2017-10-06T17:35:41Z

Doc/library/re.rst

+
+      >>> int_re = r'\d+'
+      >>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?'))
+      \d+(\.\d+)?(e[+-]?\d+)?


I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.

I'll replace this example with the better one.

warsaw

Looks pretty good to me. I have only one dumb comment; feel free to ignore it :)

warsaw · 2017-10-06T18:19:45Z

Doc/howto/regex.rst

@@ -526,7 +522,7 @@ of each one.
 +=================================+============================================+
 | :const:`ASCII`, :const:`A`      | Makes several escapes like ``\w``, ``\b``, |
 |                                 | ``\s`` and ``\d`` match only on ASCII      |
-|                                 | characters with the respective property.   |
+|                                 | characters with the respective property    |


I guess this is done for consistency? Any reason these shouldn't be complete sentences? I'd have rather added missing periods instead of removed them. Does this make the diff smaller? Is that worth it?

No other reasons besides consistency.

If it is single sentences in a table, I would normally not have a period, but that's just me.

serhiy-storchaka · 2017-10-12T16:15:20Z

I didn't expect the Spanish Inquisition! (Neither Italian)

bedevere-bot · 2017-10-12T16:15:22Z

Nobody expects the Spanish Inquisition!

@ezio-melotti: please review the changes made to this pull request.

ezio-melotti · 2017-10-13T12:52:10Z

There is still (at least) a comment that is not addressed, but GitHub decided to hide it. You can see by clicking the second "show outdated" link from the top, or by checking the latest emails for this issue to make sure you don't miss any comment.

serhiy-storchaka · 2017-10-13T13:39:04Z

Oh, sorry, @ezio-melotti, I missed this comment. Please check whether I understood you correctly.

serhiy-storchaka · 2017-10-14T08:13:51Z

Thank you all for your reviews and suggestions!

miss-islington · 2017-10-14T08:14:29Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖

miss-islington · 2017-10-14T08:14:37Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker cd195e2a7ac5c9b2574d5462752b7939641de4a9 3.6

…-3907). (cherry picked from commit cd195e2)

bedevere-bot · 2017-10-14T08:42:47Z

GH-3994 is a backport of this pull request to the 3.6 branch.

#3994) (cherry picked from commit cd195e2)

bpo-31714: Improved regular expression documentation.

67afa80

serhiy-storchaka added needs backport to 3.6 skip news docs Documentation in the Doc dir labels Oct 6, 2017

serhiy-storchaka requested a review from warsaw October 6, 2017 14:24

the-knights-who-say-ni added the CLA signed label Oct 6, 2017

bedevere-bot added the awaiting merge label Oct 6, 2017

ezio-melotti requested changes Oct 6, 2017

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting merge labels Oct 6, 2017

cryvate reviewed Oct 6, 2017

View reviewed changes

serhiy-storchaka commented Oct 6, 2017

View reviewed changes

serhiy-storchaka added 3 commits October 6, 2017 20:40

Address review comments.

25aed19

Address a comment on the tracker.

aefa161

Add missed "to".

81623a6

warsaw reviewed Oct 6, 2017

View reviewed changes

serhiy-storchaka added 2 commits October 6, 2017 21:32

Add periods in the table.

e826629

Tweak quotes.

8316598

bedevere-bot added awaiting change review and removed awaiting changes labels Oct 12, 2017

Improve wording.

8934ee8

ezio-melotti approved these changes Oct 14, 2017

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Oct 14, 2017

serhiy-storchaka merged commit cd195e2 into python:master Oct 14, 2017

bedevere-bot removed the awaiting merge label Oct 14, 2017

serhiy-storchaka deleted the re-docs branch October 14, 2017 08:14

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 14, 2017

[3.6] bpo-31714: Improved regular expression documentation. (pythonGH…

a12cf92

…-3907). (cherry picked from commit cd195e2)

bedevere-bot removed the needs backport to 3.6 label Oct 14, 2017

serhiy-storchaka added a commit that referenced this pull request Oct 14, 2017

[3.6] bpo-31714: Improved regular expression documentation. (GH-3907). (

69ed5b6

#3994) (cherry picked from commit cd195e2)

bpo-31714: Improved regular expression documentation. #3907

bpo-31714: Improved regular expression documentation. #3907

Conversation

serhiy-storchaka commented Oct 6, 2017 • edited by bedevere-bot Loading

ezio-melotti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-bot commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warsaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka commented Oct 12, 2017

bedevere-bot commented Oct 12, 2017

ezio-melotti commented Oct 13, 2017

serhiy-storchaka commented Oct 13, 2017

serhiy-storchaka commented Oct 14, 2017

miss-islington commented Oct 14, 2017

miss-islington commented Oct 14, 2017

bedevere-bot commented Oct 14, 2017

serhiy-storchaka commented Oct 6, 2017 •

edited by bedevere-bot

Loading