Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing robustness #8436

Closed
wants to merge 11 commits into from
Closed

Parsing robustness #8436

wants to merge 11 commits into from

Conversation

Traumflug
Copy link

@Traumflug Traumflug commented Apr 19, 2021

Fallout of a hacking Sunday. Tackles #7567 and #8417.

Tested against my own project, only, as I'm not aware how the testsuite works.

Not so sure about commit 0d91578, "don't shortcut on previously seen elements", as I'm not sure how much performance impact it has. It solves handling ignored tags, but probably needs either removal of then dead code or a different solution for dealing with ignored tags appearing multiple times.

@CLAassistant
Copy link

CLAassistant commented Apr 19, 2021

CLA assistant check
All committers have signed the CLA.

@Traumflug
Copy link
Author

In case somebody wants to enhance the testsuite, here are additional cases:

<script type="text/javascript">
    for (var i = 0; i < imgDefer.length; i++) {
        doNothing();
    }
</script>
<script
  type="text/javascript"
>
  if (a<b && a>c) {
    haha();
  }
</script>
<form
	action="[URL]"
	method="post"
>
<hr	id=a>

Yes, newlines and tab vs. space matters :-)

Failure of each is recognized by non-alphabetical characters appearing in the hugo_stats.json.

@bep
Copy link
Member

bep commented Apr 20, 2021

Tested against my own project, only, as I'm not aware how the testsuite works.

go test ./publisher

@bep bep closed this Apr 20, 2021
@bep bep reopened this Apr 20, 2021
@bep
Copy link
Member

bep commented Apr 20, 2021

Thanks for this.

There are failing tests. It looks mostly to be elements inside pre/textarea, which I don't see any reason to collect (but please prove me wrong).

Also, it would be great if you could

  1. Add some test cases for the cases this PR is supposed to fix.
  2. Run the benchmark inside the ./publisher package and compare it to master (see below)
name                     old time/op    new time/op    delta
ClassCollectorWriter-16    21.2µs ± 2%    47.3µs ± 1%  +123.21%  (p=0.029 n=4+4)

name                     old alloc/op   new alloc/op   delta
ClassCollectorWriter-16    35.3kB ± 0%    86.4kB ± 0%  +144.95%  (p=0.029 n=4+4)

name                     old allocs/op  new allocs/op  delta
ClassCollectorWriter-16       155 ± 0%       329 ± 0%  +112.26%  (p=0.029 n=4+4)

Note that I don't mind it getting slower if really needed, but I fail to understand the motivation behind the removal of elementSet, which I suspect is the reason why it is now so much slower and memory hungry.

@bep
Copy link
Member

bep commented Apr 20, 2021

Can you check whether the latest commits re this in the master branch covers your needs?

@Traumflug
Copy link
Author

Traumflug commented Apr 22, 2021

Thanks for the hint on how to run these testcases easily. I just pushed a reviewed set of commits, rebased on current master.

There are failing tests. It looks mostly to be elements inside pre/textarea, which I don't see any reason to collect

Well, the simple reason for these failures was me removing special handling of pre/textarea, but not knowing how to remove these tests as well.

That said, removing these was a mistake, I misunderstood the intention of this code earlier. It's not about not collecting non-styleable tags, but about ignoring stuff inside them. Accordingly, I melted down the first commit to just adding another two such tags with preformatted code. Where <title> isn't exactly preformatted, but of no use for styling anyways.

Can you check whether the latest commits re this in the master branch covers your needs?

Well, some, but not all. This time, knowing how to deal with these tests, I managed to find a testcase for each. Committed testcases separately to allow seeing them failing. Luckily, I also found a fix for each.

Also one or two testcases which fell into the keyboard somehow, without a matching reported issue I'd be aware of :-)

Note that I don't mind it getting slower if really needed, but I fail to understand the motivation behind the removal of elementSet

This time I moved the commit with this removal, along with a commit with a demonstrating (previously failing, now working) testcase just before it, all to the end, to allow (cherry-)picking earlier commits.

The problem to solve appears when such a preformatted tag appears twice and contains a < (happens rarely in testcases, often in real world usage). First round everything is fine and dandy. On the second round, the processing time saver misses the part where w.inPreTag gets set. Accordingly, the following parsing doesn't recognize the preformatted tag as such and treats stuff inside the preformatted tag like normal tags.

Perhaps you have a better idea on how to get the best of both, quick processing as well as recognizing repeated preformatted tags.

These are two of the symptoms reported with #7567. The second test
passes already.

The first test fails, because the '<' lets the parsing logic start
a new tag, which only ends on the actual closing tag. This means
garbage in front of the actual closing tag, making parseEndTag()
fail to recognize it as a closing tag.
This stops parsing of 'preformatted' tags not on any tag found,
but only on finding the matching end tag. Key element here is
looking at the end of a tag (strings.HasSuffix()), rather than
the entire tag. Remainder is simplification of the logic flow.

On what failed before, see also previous commit.

This is related to issue #7567 and fixes the failing test case
introduced with the previous commit.
First case actually fails. Second one tests against a situation I
had during development of the previous commit (forgotten ToLower()
in parseEndTag()).
This makes the previously introduced testcase pass.
A regular whitespace, a tab and a newline are all equivelant and
perfectly valid whitespace characters. Solve this by using
strings.Fields(), which considers all these.

This appears to solve #8417
Same for the other table-related tags, of course. This makes the
testcase introduced with the previous commit pass.
Here, the strategy to avoid processing tags twice fails. On the
first round, everything works properly. Next round, the same tag
is no longer recognized as a preformatted tag due to the
shortcut taken, the mess happens again.
This fixes the previously introduced testcase, details see there.
Also helps with issue #7567.
@ghost
Copy link

ghost commented May 16, 2021

I just rebased the pull request branch to tag v0.83.1 and force-pushed it.

@bep
Copy link
Member

bep commented May 16, 2021

Thanks, but I've already spent a considerably amount of time simplifying this in another. Would welcome any failed test cases after that PR is merged.

@bep bep closed this May 16, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants