Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hugo --minify breaks HTML element extraction #7567

Closed
Traumflug opened this issue Aug 15, 2020 · 21 comments
Closed

hugo --minify breaks HTML element extraction #7567

Traumflug opened this issue Aug 15, 2020 · 21 comments

Comments

@Traumflug
Copy link

$ ../hugo/hugo version
Hugo Static Site Generator v0.74.3/extended linux/amd64 BuildDate: unknown

... where 'unknown' should be 'today'.

Let code speak:

$ ../hugo/hugo
[...]
$ wc -l hugo_stats.json 
196 hugo_stats.json
$ ../hugo/hugo --minify
[...]
$ wc -l hugo_stats.json 
176 hugo_stats.json

Which means, minification looses 20 HTML elements for unknown reasons. Which breaks my site :-)

Further diagnosis:

This shorter list contains this line:

{
  "htmlElements": {
    "tags": [
      [...]
      "img",
      "imgdefer.length;i++){if(imgdefer[i].hasattribute('data-src')){imgdefer[i].setattribute('src',imgdefer[i].getattribute('data-src'));imgdefer[i].removeattribute('data-src');}}}\u003c/script",
      "li",
[...]

This long line is certainly not a HTML tag, but similar to this chunk in one of the partials:

<script type="text/javascript">
  window.onload = function () {
    var imgDefer = document.getElementsByTagName('img');

    for (var i = 0; i < imgDefer.length; i++) {
      if (imgDefer[i].hasAttribute('data-src')) {
        imgDefer[i].setAttribute('src', imgDefer[i].getAttribute('data-src'));
        imgDefer[i].removeAttribute('data-src');
      }
    }
  }
</script>
@Traumflug
Copy link
Author

It looks like HTML parsing trips over the JavaScript <. It also looks like HTML extraction happens after minification, which might be not the best idea for consistent results.

@stale
Copy link

stale bot commented Dec 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@stale stale bot added the Stale label Dec 19, 2020
@Traumflug
Copy link
Author

This still happens with latest master, Hugo 0.80.0-DEV.

@stale stale bot removed the Stale label Dec 19, 2020
@davidsneighbour
Copy link
Contributor

Some more triage ;)

  • What is the diff of both runs on the json file? Meaning, not what is inside the minified one, but what actual differences are there?
  • Are you aware, that there is a < in for (var i = 0; i < imgDefer.length; i++) { - I think that might be the issue here.

As far as I remember you need to somehow printf the javascript code in connection with a safeJS.

If you can, move the javascript into a static file outside of the layout file.

I think this is a limitation of what is going on, not a bug per se. Maybe opening a thread over at discourse.gohugo.io might bring a speedy-er solution.

@Traumflug
Copy link
Author

Well, the simple solution is to not minify, distinction is less than a kilobyte for a small site. Workarounds around this bug are of no help, because the next user will run into it again. That's not how a reliable software should deal with issues.

That said, here's a diff with Hugo 0.79.0, IIRC it used to be larger with earlier versions:

$ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json	2020-12-19 21:30:30.559248875 +0100
+++ hugo_stats.json.minify	2020-12-19 21:30:20.487292074 +0100
@@ -1,7 +1,6 @@
 {
   "htmlElements": {
     "tags": [
-      "!--",
       "!doctype",
       "a",
       "article",
@@ -23,7 +22,7 @@
       "html",
       "i",
       "img",
-      "img\n",
+      "imgdefer.length;i++){if(imgdefer[i].hasattribute('data-src')){imgdefer[i].setattribute('src',imgdefer[i].getattribute('data-src'));imgdefer[i].removeattribute('data-src');}}}\u003c/script",
       "li",
       "link",
       "meta",

@davidsneighbour
Copy link
Contributor

Isn't it weird, that there is an "img" and an "img\n" in that list on un-minified pages? Could it be that some weird line ending is introduced in one of the partials and while browsers display the image nicely the HTML(XML) is not valid? Can you provide a unminified page sample and test it in a validator? https://validator.w3.org/

Hugo is only running the command. The problem is upstream in how the tags get extracted.

@Traumflug
Copy link
Author

Well, in HTML, a newline is perfectly valid whitespace. I often spread tags across multiple lines to make them readable.

Hugo is only running the command.

Which command or project would this be?

@davidsneighbour
Copy link
Contributor

Which command or project would this be?

Let's ask @bep... internals are not something I can compute 👍

And regarding your answer to the two img tags... i would expect the parser to know, that a newline is a whitespace. so it should know that img and img-newline is the same. that is what I am saying. for the parser to acknowledge this as two separate instances of something might be an error in the parser, or a known inability depending on how the files get parsed. If some windowless browser is involved, it might fix things somehow.

I am not saying the newline is an error in your document, I am saying it might be an indicator that the parser somehow mis-qualifies one specific img tag because there is some weird new lining going on that irritates it.

@davidsneighbour
Copy link
Contributor

There is no note in the code about how this get's extracted when I search for write stats

https://github.com/gohugoio/hugo/search?q=writeStats

@philoserf
Copy link
Contributor

I believe #8180 fixed this.

@ghost
Copy link

ghost commented Apr 5, 2021

With Hugo v0.81.0 (and v0.82.0), which includes #8180, it gets not perfect, but closer:

$ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json	2021-04-05 16:23:07.790034356 +0200
+++ hugo_stats.json.minify	2021-04-05 16:23:17.842074077 +0200
@@ -1,10 +1,10 @@
 {
   "htmlElements": {
     "tags": [
-      "!--",
       "!doctype",
       "a",
       "article",
+      "b.length;a++)b[a].hasattribute('data-src')\u0026\u0026(b[a].setattribute('src',b[a].getattribute('data-src')),b[a].removeattribute('data-src'))}\u003c/script",
       "body",
       "button",
       "code",

@Traumflug
Copy link
Author

D'oh. I just see I was logged in with my other account. @merchantsedition, that's me :-)

@davidsneighbour
Copy link
Contributor

The truth comes to light ;D

@bep bep added the Bug label Apr 6, 2021
@bep bep added this to the v0.83 milestone Apr 6, 2021
@bep
Copy link
Member

bep commented Apr 6, 2021

It also looks like HTML extraction happens after minification, which might be not the best idea for consistent results.

We just listen to the stream written to disk and look for HTML elements, which is extremely fast compared to doing a full scan.

@bep
Copy link
Member

bep commented Apr 6, 2021

I have added a PR patch that fixes the obvious part of this bug, but without any "failing input" example, this is not possible to verify. It could be possible to move this before minify, but anyone who wants to take on that challenge needs to also write a proper benchmark.

@ghost
Copy link

ghost commented Apr 7, 2021

Thanks for the work. I just gave it a go and find no distinction in the output.

To find out what's going on, I added some printf-debugging (I'm a PHP guy, so this is normal):

diff --git a/publisher/htmlElementsCollector.go b/publisher/htmlElementsCollector.go
index d9479aaf..f2c35190 100644
--- a/publisher/htmlElementsCollector.go
+++ b/publisher/htmlElementsCollector.go
@@ -19,6 +19,7 @@ import (
 	"sort"
 	"strings"
 	"sync"
+	"fmt"
 
 	"github.com/gohugoio/hugo/helpers"
 	"golang.org/x/net/html"
@@ -118,7 +119,9 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
 					continue
 				}
 
+				fmt.Print("\n");
 				s := w.buff.String()
+				fmt.Print(s, "\n");
 
 				w.buff.Reset()
 
@@ -129,6 +132,7 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
 				key := s
 
 				s, tagName := w.insertStandinHTMLElement(s)
+				fmt.Print("tagName: ", tagName, "\n");
 				el := parseHTMLElement(s)
 				el.Tag = tagName
 				if w.isPreFormatted(tagName) {

This gives the following output without --minify, note the empty tagName:

$ ../hugo/hugo | grep -C8 'has[aA]ttribute'

</i

</a

</div

< imgDefer.length; i++) {
      if (imgDefer[i].hasAttribute('data-src')) {
        imgDefer[i].setAttribute('src', imgDefer[i].getAttribute('data-src'));
        imgDefer[i].removeAttribute('data-src');
      }
    }
  }
</script
tagName: 

... and with --minify:

$ ../hugo/hugo --minify | grep -C4 'has[aA]ttribute'
</a

</div

<b.length;a++)b[a].hasAttribute('data-src')&&(b[a].setAttribute('src',b[a].getAttribute('data-src')),b[a].removeAttribute('data-src'))}</script
tagName: b.length;a++)b[a].hasattribute('data-src')&&(b[a].setattribute('src',b[a].getattribute('data-src')),b[a].removeattribute('data-src'))}</script

<script type=text/javascript src=../js/jquery.min.js
tagName: script

To me it looks like the mismatch happens earlier already, this entire JS snippet is recognized as tag. Only with minification, because minification removes the space between < and the following word. Accordingly, the new isPreFormatted() doesn't recognize the script snippet as such.

@dirkolbrich
Copy link
Contributor

first find:
tdewolff/minify/v2/html.Minifier should be configured with KeepQuotes: true, the default is false:

in minifiers/config.go:

var defaultTdewolffConfig = tdewolffConfig{
	HTML: html.Minifier{
		KeepDocumentTags:        true,
		KeepConditionalComments: true,
		KeepEndTags:             true,
		KeepDefaultAttrVals:     true,
		KeepQuotes:              true,  // <- should be added
		KeepWhitespace:          false,
	},

For the other part of the "garbled" string from minifiy, I was mistaken and didn't understood the stream of data. Minify does not send the complete string or complete html tags to the html extraction. It sends token chunks of the start tag, if it has ids, classes or other attributes. It will leave end tags complete.

@dirkolbrich
Copy link
Contributor

Reactivating my former deleted comment: for the skipped test on minified <script>...</script> html content.
If adding test fmt.Printf()statements:

func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
	n = len(p)
	i := 0
	fmt.Printf("%v\n", string(p)) // <-- added test print
for _, minify := range []bool{false, true} {
			c.Run(fmt.Sprintf("%s--minify-%t", test.name, minify), func(c *qt.C) {
				w := newHTMLElementsCollectorWriter(newHTMLElementsCollector())
				fmt.Printf("%v\n", test.html) // <- added test print

the incoming stream from minify for <script>...</script> html content just breaks off. Minify seem to delete anything behind the start tag.

=== RUN   TestClassCollector/Script_tags_content_should_be_skipped--minify-false
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
=== RUN   TestClassCollector/Script_tags_content_should_be_skipped--minify-true
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script
>
    htmlElementsCollector_test.go:126: 
        error:
          values are not deep equal
        diff (-got +want):
            publisher.HTMLElements{
                Tags: []string{
          +             "div",
                        "script",
                },
          -     Classes: nil,
          +     Classes: []string{"foo"},
                IDs:     nil,
            }
        got:
          publisher.HTMLElements{
              Tags:    {"script"},
              Classes: nil,
              IDs:     nil,
          }
        want:
          publisher.HTMLElements{
              Tags:    {"div", "script"},
              Classes: {"foo"},
              IDs:     nil,
          }
        stack:
          /Users/dirkolbrich/Coding/gohugo/hugo/publisher/htmlElementsCollector_test.go:126
            c.Assert(got, qt.DeepEquals, test.expect)

Using:

➜  publisher git:(master) ✗ hugo version
hugo v0.82.0+extended darwin/amd64 BuildDate=unknown
➜  publisher git:(master) ✗ git rev-parse --short HEAD        
fa432b17

@bep bep modified the milestones: v0.104.0, v0.105.0 Sep 23, 2022
@bep bep modified the milestones: v0.105.0, v0.106.0 Oct 26, 2022
@bep bep modified the milestones: v0.106.0, v0.107.0 Nov 18, 2022
@bep bep modified the milestones: v0.107.0, v0.108.0 Dec 3, 2022
@bep bep modified the milestones: v0.108.0, v0.109.0 Dec 14, 2022
@bep bep modified the milestones: v0.109.0, v0.111.0, v0.110.0 Jan 26, 2023
@bep bep modified the milestones: v0.111.0, v0.112.0 Feb 15, 2023
@bep bep modified the milestones: v0.112.0, v0.113.0 Apr 15, 2023
@bep bep modified the milestones: v0.113.0, v0.115.0 Jun 13, 2023
@bep bep modified the milestones: v0.115.0, v0.116.0 Jun 30, 2023
@jmooring
Copy link
Member

jmooring commented Jul 6, 2023

Closing. If you feel this is still a problem, please open a new issue with a minimal reproducible example (i.e., a repository we can clone and test with).

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants