Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-35808: Retire pgen and use pgen2 to generate the parser #11814

Merged
merged 9 commits into from
Mar 1, 2019

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Feb 11, 2019

This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate graminit.c and graminit.h). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.

This PR implements the following:

  • Removes pgen and all related files from the source tree and Makefile.

  • Substitutes pgen for a modified and reduced (only pgen functionality is
    included) version of Lib/lib2to3/pgen2:

    • Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
      synchronize with the current Grammar.

    • All dfa names are included in the label list (this has the effect
      of including 'file_input', 'eval_input', ... in the list of
      labels to guarantee compatibility with the old pgen generated
      files).

    • The order in which dfas are generated is preserved to maintain
      compatibility with old pgen generated files and avoid empty stacks
      when the parser processes the rule. If this change is not made,
      the parser fails with:

      """' ... It's a token we know
      DFA 'and_expr', state 0: Push 'shift_expr'
      DFA 'shift_expr', state 0: Push 'arith_expr'
      DFA 'arith_expr', state 0: Push 'term'
      DFA 'term', state 0: Push 'factor'
      DFA 'factor', state 0: Push 'power'
      DFA 'power', state 0: Push 'atom_expr'
      DFA 'atom_expr', state 0: Push 'atom'
      DFA 'atom', state 0: Shift.
      Token NEWLINE/'' ... It's a token we know
      DFA 'atom', state 5: Pop ...
      DFA 'atom_expr', state 2: Pop ...
      DFA 'power', state 1: Pop ...
      DFA 'factor', state 2: Pop ...
      DFA 'term', state 1: Pop ...
      DFA 'arith_expr', state 1: Pop ...
      DFA 'shift_expr', state 1: Pop ...
      DFA 'and_expr', state 1: Pop ...
      Error: bottom of stack.

      This seem to happen because instead of starting with
      'single_input' / 'file_input' ... etc it starts with
      'and_expr' just because its the second element in the
      dfa list in gramminit.c. This makes the parser to push
      tokens from 'and_expr' so before the shift can happen,
      it arrives at an empty stack when popping the nonterminals.

    • First sets are represented with python sets intead of dictionaries
      with 1 in the values.

    • Added (optional) verbose output that dumps different elements of
      the grammar and displays the first sets of every nonterminal.

    • pgen2 now is invoked as:

      python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c

      (optionally add '-v' for verbose output).

    • pgen included "<>" in the list of labels but pgen2 does not as it
      thinks is already in the list of tokens because it collides with
      the value of "!=". This is not a problem because this label is a
      terminal and is handled manually and Parser/token.c produces the
      token NOTEQUAL anyway). This is the reason there are now 183 labels
      where pgen generates 184).

  • Changes the regen-grammar to use pgen2 instead of pgen.

  • Regenerates Makefiles using autoreconf.

https://bugs.python.org/issue35808

This is just a (working) first proposal to start discussing the issue

Copy link
Member

@ned-deily ned-deily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure?

@gvanrossum
Copy link
Member

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@pablogsal pablogsal requested a review from ambv February 11, 2019 17:27
@eamanu
Copy link
Contributor

eamanu commented Feb 11, 2019

Hi, in the next line:

https://github.com/python/cpython/blob/54487675ffab7998898939e9fa578576dc8b4862/Makefile.pre.in#L1736

there is a reference to $(PGEN) and that does not exist.

By the way, that don't give any error because is empty

@pablogsal
Copy link
Member Author

there is a reference to $(PGEN) and that does not exist.

Good catch!

@pablogsal
Copy link
Member Author

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong.

@vstinner
Copy link
Member

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" (if PGEN: ... ?).

Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code".

@pablogsal
Copy link
Member Author

pablogsal commented Feb 11, 2019

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Parser/gen2 makes sense and helps with discovery as opposed to invoke something in Lib/lib2to3/pgen2. Also, only the relevant files are ported (the parser and the driver and the rest of the suite was not moved) so is not the full pgen2 that will need double maintainance. (Also, we retired the original pgen, so you can think about pgen2 as having the maintainance of the original pgen). Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

Being said that, if we think is ok to add branching to pgen2 and invoke that from lib2to3 I am happy to do so :)

P.S. Thanks for reviewing this big PR! 😄

Copy link
Contributor

@eamanu eamanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching PGEN on CPython root I have the some issues:

  • .gititgnore: i think that we have to delete the Parser/pgen and Parser/pge.exe, because now we wont have pgen binary, isn't?
  • on parsetok.h On the line 15 I don't know if is necessary the #ifndef PGEN ...
  • There are several ```#ifndef````on files: tokenizer.c parsetok.c, ... are they necessary?

Makefile.pre.in Show resolved Hide resolved
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no?

I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry.

The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no?

Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-)

Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/?

Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py.

Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about from . import grammar, token, tokenize in pgen.py for example.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal
Copy link
Member Author

@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :)

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing off on the changes to the Windows build files. Haven't looked at the rest.

**{f'{prefix}"""': double3prog for prefix in _strprefixes},
**{prefix: None for prefix in _strprefixes}}
"'''": single3prog, '"""': double3prog}
endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes})
Copy link
Member Author

@pablogsal pablogsal Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed because Travis uses an old version of python for the regenerations step.

@pablogsal
Copy link
Member Author

pablogsal commented Feb 11, 2019

I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening.

@pablogsal
Copy link
Member Author

pablogsal commented Feb 20, 2019

As I have rebased, the old conversations that were attached to commits can be found here:

80c1217#commitcomment-32381091

104b8e7#commitcomment-32380993

@gvanrossum Ok, I have decided to go to the option you initially suggested and leave lib2to3 intact (at least in this PR). The implementation of this is in commit 2bc3198. The reasons are the following:

  1. Changing lib2to3's pgen can be backwards incompatible without really knowing due to the test suite not checking everything.
  2. Modifying pgen2 to allow to use Grammar/Tokens involves a bigger diff because we need to make every file compatible with a token file (optionally). Also, Grammar/Tokens and the tokens in lib2to3 are slightly different and make this a bit more complex, therefore generating a bigger diff.
  3. I don't feel comfortable just with a comment stating that 'lib2to3' 's pgen needs to be backwards compatible and it needs to be compatible with a top-level package because comments can be ignored and this can lead to future problems.
  4. This allows evolving the parser independently of lib2to3 if in the future we decide to change the Parser or the parser generator.

The reasons there is a full implementation of the parser generator in Parser/pgen/pgen.py is the following:

  1. Subclassing the parser generator in Lib2to3/pgen2/pgen.py is not enough as several methods need to be changed to stabilize the output files. This is because some of the containers and classes that the parser generator uses produce a list of dfas, names, labels, arcs and states that changes between runs.
  2. There are several modifications that need to be done anyway to the parser generator, to make sure that the dfas are produced in the order that the Parser expects, making it compatible with the old pgen output.
  3. This PR is already very big. If in the future we want to unify lib2to3 and Parser/pgen we can do it but I suggest doing it in another PR with another issue because this is a different task in its own right. This way, we can discuss all the problems with the different approach without producing even bigger diffs.
  4. I have added a --verbose flag to the generator step that dumps some information of the process (lib2to3 does not do that). This helps a lot if there are problems when generating a modification of the Grammar.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Honestly I think at this point you might as well have started over with a new PR. But alla.)

Makefile.pre.in Outdated Show resolved Hide resolved
Parser/pgen/__main__.py Show resolved Hide resolved
Parser/pgen/grammar.py Outdated Show resolved Hide resolved
Parser/pgen/__main__.py Outdated Show resolved Hide resolved
Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer. Maybe we can we completely independent from lib2to3.

"grammar", type=str, help="The file with the grammar definition in EBNF format"
)
parser.add_argument(
"gramminit_h",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally this is spelled with one ‘m’. (Also below.)

# Add token to the module cache so tokenize.py uses that excact one instead of
# the one in the stdlib of the interpreter executing this file.
sys.modules['token'] = token
tokenize = importlib.machinery.SourceFileLoader('tokenize',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)

Copy link
Member Author

@pablogsal pablogsal Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenize module from Python 2.4 can handle this.

:)

Why do we need to use the latest tokenize to parse the Grammar file?

Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse():

Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/src/Parser/pgen/__main__.py", line 36, in <module>
    main()
  File "/src/Parser/pgen/__main__.py", line 29, in main
    p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
  File "/src/Parser/pgen/pgen.py", line 20, in __init__
    self.dfas, self.startsymbol = self.parse()
  File "/src/Parser/pgen/pgen.py", line 173, in parse
    self.expect(self.tokens['OP'], ":")
  File "/src/Parser/pgen/pgen.py", line 337, in expect
    type, value, self.type, self.value)
  File "/src/Parser/pgen/pgen.py", line 356, in raise_error
    self.end[1], self.line))
  File "./Grammar/Grammar", line 13
    single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
                ^
SyntaxError: expected 54/:, got 52/:

This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens (or in Lib/tokens.py). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator) in gettoken. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(

@@ -0,0 +1,97 @@
from lib2to3.pgen2 import grammar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also copy this, so we’re completely independent from lib2to3?

import collections
import importlib.machinery

# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get them directly from Grammar/Tokens

@gvanrossum
Copy link
Member

gvanrossum commented Feb 22, 2019 via email

@gvanrossum
Copy link
Member

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

@pablogsal
Copy link
Member Author

pablogsal commented Feb 22, 2019

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :)

With this, if we read from Grammar/Token as you suggested, we don't need any import hacks. I have pushed 8ff1e44 including this patch and some changes adressing the rest of your feedback.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@gvanrossum
Copy link
Member

@vstinner can you re-review this?

@gvanrossum
Copy link
Member

@pablogsal You can remove the other requested reviewers who haven't said anything yet.

@vstinner vstinner dismissed their stale review March 1, 2019 17:24

Code changed a lot since I reviewed the PR

@vstinner
Copy link
Member

vstinner commented Mar 1, 2019

@gvanrossum @pablogsal: Sorry, I was busy with other stuff last days and so I failed to follow this PR. Well, I reviewed an earlier version of the code. The code changed a lot in the meanwhile. I trust Guido to be able to review such change properly. Pablo: please don't wait for me if you want to merge this PR.

@gvanrossum
Copy link
Member

@pablogsal, feel free to merge.

@pablogsal pablogsal merged commit 1f24a71 into python:master Mar 1, 2019
@gvanrossum
Copy link
Member

Whee!

@pablogsal pablogsal deleted the pgen2 branch May 19, 2021 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants