Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-31714: Improved regular expression documentation. #3907

Merged
merged 7 commits into from
Oct 14, 2017

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Oct 6, 2017

Copy link
Member

@ezio-melotti ezio-melotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments.
There are also a couple of paragraphs with long lines that you could rewrap.

lowercasing doesn't take the current locale into account; it will if you also
set the :const:`LOCALE` flag.
letters, too. Full Unicode matching also works unless the :const:`re.ASCII`
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable non-ASCII matches makes it sounds like it won't match any non-ASCII characters, but that is false:

>>> re.match('负鼠', '负鼠', re.ASCII)
<_sre.SRE_Match object; span=(0, 2), match='负鼠'>

I think it would be more correct to say that regex sets will only match characters in the ASCII range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy from existing re module documentation. :(

regex sets will only match characters in the ASCII range.

This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.

re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>

But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.

flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: unless I'm mistaken [A-Z] should be limited to upper case ASCII letters, even with re.UNICODE.
Perhaps you meant to use \w?
It would also be better to specify what Unicode categories are matched, instead of just providing a few examples and letting the user figure it out from there.
I think this is already explained below, so a link or a mention to that section is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests:

>>> unichars = ''.join(chr(cp) for cp in range(0x110000))
>>> ''.join(re.findall('[a-z]', bmp))
'abcdefghijklmnopqrstuvwxyz'
>>> ''.join(re.findall('[A-Z]', bmp))
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> ''.join(re.findall('[a-z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'
>>> ''.join(re.findall('[A-Z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'

Now I understand what the paragraph means: [a-z] and [A-Z] will only match ASCII lower and upper case letters respectively (so 26 chars each), however if re.I is used with either one, since those 4 letters (and only those 4) are valid capitalization of the 26 ASCII letters, they will be matched as well (bringing the total up to 26 + 26 + 4 == 56).
Do you think we should rephrase it to make it clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to make it clearer. Now, since you understand what the paragraph means, could you please suggest a clear wording?

This example is not artificial. See bpo-31672. I think this caveat should be documented specially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the patterns ``[a-z]`` or ``[A-Z]`` are used in combination with the re.I and re.U flag,
they will match the 52 ASCII letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin
capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F,
Latin small letter long s) and 'K' (U+212A, Kelvin sign).

that take account of language differences. For example, if you're
processing encoded French text, you'd want to be able to write ``\w+`` to
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the '...' here and perhaps add b'\xe9' and b'\xe7' within parenthesis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I suggested that, is because 'é' and 'ç' are Unicode str in Python 3, whereas without quotes they are just letters. The addition of b'\xe9' and b'\xe7' might help clarify what is being matched, but it's not essential.

bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
If your system is configured properly and a French locale is selected,
certain C functions will tell the program that the byte corresponding
``'é'`` should also be considered a letter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corresponding to é
I also don't like certain C functions will tell the program too much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest remove quotes?

certain C functions will tell the program already was here, it looks correct to me, and I don't know how improve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove quotes, since we are talking about the character, not about a Unicode string.
Fair enough about the wording being there already.

is very unreliable, and it only handles one "culture" at a time anyway;
and it only works with 8-bit locales;
you should use Unicode matching instead, which is the default in Python 3
for Unicode (str) patterns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... is very unreliable, it only handles one "culture" at a time, and it only works with 8-bit locales. You should use ...

Since Unicode matching is the default, I wouldn't say "you should use", but just something like "Unicode matching is already enabled by default in Python 3, and it is able to handle different locales/languages."

Alternation, or the "or" operator. If A and B are regular expressions,
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
Alternation, or the "or" operator. If *A* and *B* are regular expressions,
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why * instead of ``?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because A and B are not literal regexpes, but variables.

matches.

match lowercase letters. Full Unicode matching (such as ``Ü`` matching
``ü``) also works unless the :const:`re.ASCII` flag is also used to disable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is also used to
(you already have one before)

only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ'
(U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small
letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a,
Kelvin sign). If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated from above, so the same comments applied (maybe it shouldn't be duplicated?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you suggest how to avoid duplication?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice the duplicated part was on two separate files, so it's probably ok to leave it.

you should use Unicode matching instead, which is the default in Python 3
for Unicode (str) patterns. This flag can be used only with bytes patterns.
for Unicode (str) patterns.
Correcsponds the inline flag ``(?L)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcsponds to
(extra c and missing to)

Also applies below.


>>> int_re = r'\d+'
>>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?'))
\d+(\.\d+)?(e[+-]?\d+)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this example particularly clear. Why would someone want to use re.escape() on the replacement string? Wouldn't using int_re = r'\\d+' (and a normal str.replace on INT) be easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.

I'll replace this example with the better one.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I didn't expect the Spanish Inquisition!. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@@ -1284,12 +1292,12 @@ doesn't work because of the greedy nature of ``.*``. ::
>>> print(re.match('<.*>', s).group())
<html><head><title>Title</title>

The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of
The RE matches the ``<`` in ``'<html>'``, and the ``.*`` consumes the rest of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you do want '<' as in the original here. See also line 1299 in this commit.

Alternation, or the "or" operator. If A and B are regular expressions,
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
Alternation, or the "or" operator. If *A* and *B* are regular expressions,
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
low precedence in order to make it work reasonably when you're alternating
multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything after "will match either" have '[word]' instead of [word]?

flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't all those after "will match" be '[word]' instead of [word]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure that I understood your comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spam, spam, spAM, or ſpam -> 'Spam', 'spam', 'spAM', or 'ſpam'? I might have gotten it wrong though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he's referring to the 4 non-ASCII letters: they should be enclosed in `'..'`

@@ -229,24 +226,23 @@ Another repeating metacharacter is ``+``, which matches one or more times. Pay
careful attention to the difference between ``*`` and ``+``; ``*`` matches
*zero* or more times, so whatever's being repeated may not be present at all,
while ``+`` requires at least *one* occurrence. To use a similar example,
``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match
``ct``.
``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``a``'s), but won't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(3 'a's)?

Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ezio-melotti and @cryvate! I'll rewrap long lines after resolving all issues with wording.

lowercasing doesn't take the current locale into account; it will if you also
set the :const:`LOCALE` flag.
letters, too. Full Unicode matching also works unless the :const:`re.ASCII`
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy from existing re module documentation. :(

regex sets will only match characters in the ASCII range.

This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.

re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>

But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.

flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.

flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure that I understood your comment.

that take account of language differences. For example, if you're
processing encoded French text, you'd want to be able to write ``\w+`` to
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.

bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
If your system is configured properly and a French locale is selected,
certain C functions will tell the program that the byte corresponding
``'é'`` should also be considered a letter.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest remove quotes?

certain C functions will tell the program already was here, it looks correct to me, and I don't know how improve it.

Alternation, or the "or" operator. If A and B are regular expressions,
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
Alternation, or the "or" operator. If *A* and *B* are regular expressions,
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because A and B are not literal regexpes, but variables.

only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ'
(U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small
letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a,
Kelvin sign). If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you suggest how to avoid duplication?


>>> int_re = r'\d+'
>>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?'))
\d+(\.\d+)?(e[+-]?\d+)?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.

I'll replace this example with the better one.

Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. I have only one dumb comment; feel free to ignore it :)

@@ -526,7 +522,7 @@ of each one.
+=================================+============================================+
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
| | ``\s`` and ``\d`` match only on ASCII |
| | characters with the respective property. |
| | characters with the respective property |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is done for consistency? Any reason these shouldn't be complete sentences? I'd have rather added missing periods instead of removed them. Does this make the diff smaller? Is that worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other reasons besides consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is single sentences in a table, I would normally not have a period, but that's just me.

@serhiy-storchaka
Copy link
Member Author

I didn't expect the Spanish Inquisition! (Neither Italian)

@bedevere-bot
Copy link

Nobody expects the Spanish Inquisition!

@ezio-melotti: please review the changes made to this pull request.

@ezio-melotti
Copy link
Member

There is still (at least) a comment that is not addressed, but GitHub decided to hide it. You can see by clicking the second "show outdated" link from the top, or by checking the latest emails for this issue to make sure you don't miss any comment.

@serhiy-storchaka
Copy link
Member Author

Oh, sorry, @ezio-melotti, I missed this comment. Please check whether I understood you correctly.

@serhiy-storchaka
Copy link
Member Author

Thank you all for your reviews and suggestions!

@serhiy-storchaka serhiy-storchaka merged commit cd195e2 into python:master Oct 14, 2017
@miss-islington
Copy link
Contributor

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the re-docs branch October 14, 2017 08:14
@miss-islington
Copy link
Contributor

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker cd195e2a7ac5c9b2574d5462752b7939641de4a9 3.6

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 14, 2017
@bedevere-bot
Copy link

GH-3994 is a backport of this pull request to the 3.6 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants