hugo --minify breaks HTML element extraction #7567

Traumflug · 2020-08-15T22:48:04Z

$ ../hugo/hugo version
Hugo Static Site Generator v0.74.3/extended linux/amd64 BuildDate: unknown

... where 'unknown' should be 'today'.

Let code speak:

$ ../hugo/hugo
[...]
$ wc -l hugo_stats.json 
196 hugo_stats.json
$ ../hugo/hugo --minify
[...]
$ wc -l hugo_stats.json 
176 hugo_stats.json

Which means, minification looses 20 HTML elements for unknown reasons. Which breaks my site :-)

Further diagnosis:

This shorter list contains this line:

{
  "htmlElements": {
    "tags": [
      [...]
      "img",
      "imgdefer.length;i++){if(imgdefer[i].hasattribute('data-src')){imgdefer[i].setattribute('src',imgdefer[i].getattribute('data-src'));imgdefer[i].removeattribute('data-src');}}}\u003c/script",
      "li",
[...]

This long line is certainly not a HTML tag, but similar to this chunk in one of the partials:

<script type="text/javascript">
  window.onload = function () {
    var imgDefer = document.getElementsByTagName('img');

    for (var i = 0; i < imgDefer.length; i++) {
      if (imgDefer[i].hasAttribute('data-src')) {
        imgDefer[i].setAttribute('src', imgDefer[i].getAttribute('data-src'));
        imgDefer[i].removeAttribute('data-src');
      }
    }
  }
</script>

The text was updated successfully, but these errors were encountered:

Traumflug · 2020-08-15T22:51:40Z

It looks like HTML parsing trips over the JavaScript <. It also looks like HTML extraction happens after minification, which might be not the best idea for consistent results.

stale · 2020-12-19T08:30:38Z

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

Traumflug · 2020-12-19T15:36:24Z

This still happens with latest master, Hugo 0.80.0-DEV.

davidsneighbour · 2020-12-19T16:25:44Z

Some more triage ;)

What is the diff of both runs on the json file? Meaning, not what is inside the minified one, but what actual differences are there?
Are you aware, that there is a < in for (var i = 0; i < imgDefer.length; i++) { - I think that might be the issue here.

As far as I remember you need to somehow printf the javascript code in connection with a safeJS.

If you can, move the javascript into a static file outside of the layout file.

I think this is a limitation of what is going on, not a bug per se. Maybe opening a thread over at discourse.gohugo.io might bring a speedy-er solution.

Traumflug · 2020-12-19T20:38:01Z

Well, the simple solution is to not minify, distinction is less than a kilobyte for a small site. Workarounds around this bug are of no help, because the next user will run into it again. That's not how a reliable software should deal with issues.

That said, here's a diff with Hugo 0.79.0, IIRC it used to be larger with earlier versions:

$ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json	2020-12-19 21:30:30.559248875 +0100
+++ hugo_stats.json.minify	2020-12-19 21:30:20.487292074 +0100
@@ -1,7 +1,6 @@
 {
   "htmlElements": {
     "tags": [
-      "!--",
       "!doctype",
       "a",
       "article",
@@ -23,7 +22,7 @@
       "html",
       "i",
       "img",
-      "img\n",
+      "imgdefer.length;i++){if(imgdefer[i].hasattribute('data-src')){imgdefer[i].setattribute('src',imgdefer[i].getattribute('data-src'));imgdefer[i].removeattribute('data-src');}}}\u003c/script",
       "li",
       "link",
       "meta",

davidsneighbour · 2020-12-20T04:42:37Z

Isn't it weird, that there is an "img" and an "img\n" in that list on un-minified pages? Could it be that some weird line ending is introduced in one of the partials and while browsers display the image nicely the HTML(XML) is not valid? Can you provide a unminified page sample and test it in a validator? https://validator.w3.org/

Hugo is only running the command. The problem is upstream in how the tags get extracted.

Traumflug · 2020-12-20T10:24:43Z

Well, in HTML, a newline is perfectly valid whitespace. I often spread tags across multiple lines to make them readable.

Hugo is only running the command.

Which command or project would this be?

davidsneighbour · 2020-12-20T11:13:53Z

Which command or project would this be?

Let's ask @bep... internals are not something I can compute 👍

And regarding your answer to the two img tags... i would expect the parser to know, that a newline is a whitespace. so it should know that img and img-newline is the same. that is what I am saying. for the parser to acknowledge this as two separate instances of something might be an error in the parser, or a known inability depending on how the files get parsed. If some windowless browser is involved, it might fix things somehow.

I am not saying the newline is an error in your document, I am saying it might be an indicator that the parser somehow mis-qualifies one specific img tag because there is some weird new lining going on that irritates it.

davidsneighbour · 2020-12-20T11:18:27Z

There is no note in the code about how this get's extracted when I search for write stats

https://github.com/gohugoio/hugo/search?q=writeStats

philoserf · 2021-03-27T10:30:50Z

I believe #8180 fixed this.

ghost · 2021-04-05T14:46:40Z

With Hugo v0.81.0 (and v0.82.0), which includes #8180, it gets not perfect, but closer:

$ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json	2021-04-05 16:23:07.790034356 +0200
+++ hugo_stats.json.minify	2021-04-05 16:23:17.842074077 +0200
@@ -1,10 +1,10 @@
 {
   "htmlElements": {
     "tags": [
-      "!--",
       "!doctype",
       "a",
       "article",
+      "b.length;a++)b[a].hasattribute('data-src')\u0026\u0026(b[a].setattribute('src',b[a].getattribute('data-src')),b[a].removeattribute('data-src'))}\u003c/script",
       "body",
       "button",
       "code",

Traumflug · 2021-04-06T13:16:05Z

D'oh. I just see I was logged in with my other account. @merchantsedition, that's me :-)

davidsneighbour · 2021-04-06T13:30:08Z

The truth comes to light ;D

bep · 2021-04-06T15:31:18Z

It also looks like HTML extraction happens after minification, which might be not the best idea for consistent results.

We just listen to the stream written to disk and look for HTML elements, which is extremely fast compared to doing a full scan.

…L elements Updates gohugoio#7567

bep · 2021-04-06T17:10:56Z

I have added a PR patch that fixes the obvious part of this bug, but without any "failing input" example, this is not possible to verify. It could be possible to move this before minify, but anyone who wants to take on that challenge needs to also write a proper benchmark.

…L elements Updates #7567

Updates gohugoio#7567

ghost · 2021-04-07T14:30:11Z

Thanks for the work. I just gave it a go and find no distinction in the output.

To find out what's going on, I added some printf-debugging (I'm a PHP guy, so this is normal):

diff --git a/publisher/htmlElementsCollector.go b/publisher/htmlElementsCollector.go
index d9479aaf..f2c35190 100644
--- a/publisher/htmlElementsCollector.go
+++ b/publisher/htmlElementsCollector.go
@@ -19,6 +19,7 @@ import (
 	"sort"
 	"strings"
 	"sync"
+	"fmt"
 
 	"github.com/gohugoio/hugo/helpers"
 	"golang.org/x/net/html"
@@ -118,7 +119,9 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
 					continue
 				}
 
+				fmt.Print("\n");
 				s := w.buff.String()
+				fmt.Print(s, "\n");
 
 				w.buff.Reset()
 
@@ -129,6 +132,7 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
 				key := s
 
 				s, tagName := w.insertStandinHTMLElement(s)
+				fmt.Print("tagName: ", tagName, "\n");
 				el := parseHTMLElement(s)
 				el.Tag = tagName
 				if w.isPreFormatted(tagName) {

This gives the following output without --minify, note the empty tagName:

$ ../hugo/hugo | grep -C8 'has[aA]ttribute'

</i

</a

</div

< imgDefer.length; i++) {
      if (imgDefer[i].hasAttribute('data-src')) {
        imgDefer[i].setAttribute('src', imgDefer[i].getAttribute('data-src'));
        imgDefer[i].removeAttribute('data-src');
      }
    }
  }
</script
tagName:

... and with --minify:

$ ../hugo/hugo --minify | grep -C4 'has[aA]ttribute'
</a

</div

<b.length;a++)b[a].hasAttribute('data-src')&&(b[a].setAttribute('src',b[a].getAttribute('data-src')),b[a].removeAttribute('data-src'))}</script
tagName: b.length;a++)b[a].hasattribute('data-src')&&(b[a].setattribute('src',b[a].getattribute('data-src')),b[a].removeattribute('data-src'))}</script

<script type=text/javascript src=../js/jquery.min.js
tagName: script

To me it looks like the mismatch happens earlier already, this entire JS snippet is recognized as tag. Only with minification, because minification removes the space between < and the following word. Accordingly, the new isPreFormatted() doesn't recognize the script snippet as such.

Updates #7567

dirkolbrich · 2021-04-14T23:13:01Z

first find:
tdewolff/minify/v2/html.Minifier should be configured with KeepQuotes: true, the default is false:

in minifiers/config.go:

var defaultTdewolffConfig = tdewolffConfig{
	HTML: html.Minifier{
		KeepDocumentTags:        true,
		KeepConditionalComments: true,
		KeepEndTags:             true,
		KeepDefaultAttrVals:     true,
		KeepQuotes:              true,  // <- should be added
		KeepWhitespace:          false,
	},

For the other part of the "garbled" string from minifiy, I was mistaken and didn't understood the stream of data. Minify does not send the complete string or complete html tags to the html extraction. It sends token chunks of the start tag, if it has ids, classes or other attributes. It will leave end tags complete.

dirkolbrich · 2021-04-16T09:44:26Z

Reactivating my former deleted comment: for the skipped test on minified <script>...</script> html content.
If adding test fmt.Printf()statements:

func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
	n = len(p)
	i := 0
	fmt.Printf("%v\n", string(p)) // <-- added test print

for _, minify := range []bool{false, true} {
			c.Run(fmt.Sprintf("%s--minify-%t", test.name, minify), func(c *qt.C) {
				w := newHTMLElementsCollectorWriter(newHTMLElementsCollector())
				fmt.Printf("%v\n", test.html) // <- added test print

the incoming stream from minify for <script>...</script> html content just breaks off. Minify seem to delete anything behind the start tag.

=== RUN   TestClassCollector/Script_tags_content_should_be_skipped--minify-false
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
=== RUN   TestClassCollector/Script_tags_content_should_be_skipped--minify-true
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script
>
    htmlElementsCollector_test.go:126: 
        error:
          values are not deep equal
        diff (-got +want):
            publisher.HTMLElements{
                Tags: []string{
          +             "div",
                        "script",
                },
          -     Classes: nil,
          +     Classes: []string{"foo"},
                IDs:     nil,
            }
        got:
          publisher.HTMLElements{
              Tags:    {"script"},
              Classes: nil,
              IDs:     nil,
          }
        want:
          publisher.HTMLElements{
              Tags:    {"div", "script"},
              Classes: {"foo"},
              IDs:     nil,
          }
        stack:
          /Users/dirkolbrich/Coding/gohugo/hugo/publisher/htmlElementsCollector_test.go:126
            c.Assert(got, qt.DeepEquals, test.expect)

Using:

➜  publisher git:(master) ✗ hugo version
hugo v0.82.0+extended darwin/amd64 BuildDate=unknown
➜  publisher git:(master) ✗ git rev-parse --short HEAD        
fa432b17

jmooring · 2023-07-06T19:16:15Z

Closing. If you feel this is still a problem, please open a new issue with a minimal reproducible example (i.e., a repository we can clone and test with).

github-actions · 2023-07-28T01:51:23Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

stale bot added the Stale label Dec 19, 2020

stale bot removed the Stale label Dec 19, 2020

bep added the Bug label Apr 6, 2021

bep added this to the v0.83 milestone Apr 6, 2021

bep added the NeedsInvestigation label Apr 6, 2021

bep added a commit to bep/hugo that referenced this issue Apr 6, 2021

publisher: Skip script, pre and textarea content when looking for HTM…

5a7502f

…L elements Updates gohugoio#7567

bep mentioned this issue Apr 6, 2021

publisher: Skip script, pre and textarea content when looking for HTML elements #8391

Merged

bep added a commit that referenced this issue Apr 6, 2021

publisher: Skip script, pre and textarea content when looking for HTM…

8a30894

…L elements Updates #7567

bep added a commit to bep/hugo that referenced this issue Apr 7, 2021

publisher: Also test minified HTML in the element collector

4098136

Updates gohugoio#7567

bep mentioned this issue Apr 7, 2021

publisher: Also test minified HTML in the element collector #8393

Merged

bep added a commit that referenced this issue Apr 7, 2021

publisher: Also test minified HTML in the element collector

3d5dbdc

Updates #7567

Traumflug mentioned this issue Apr 19, 2021

Parsing robustness #8436

Closed

bep modified the milestones: v0.104.0, v0.105.0 Sep 23, 2022

bep modified the milestones: v0.105.0, v0.106.0 Oct 26, 2022

bep modified the milestones: v0.106.0, v0.107.0 Nov 18, 2022

bep modified the milestones: v0.107.0, v0.108.0 Dec 3, 2022

bep modified the milestones: v0.108.0, v0.109.0 Dec 14, 2022

bep modified the milestones: v0.109.0, v0.111.0, v0.110.0 Jan 26, 2023

bep modified the milestones: v0.111.0, v0.112.0 Feb 15, 2023

bep modified the milestones: v0.112.0, v0.113.0 Apr 15, 2023

bep modified the milestones: v0.113.0, v0.115.0 Jun 13, 2023

bep modified the milestones: v0.115.0, v0.116.0 Jun 30, 2023

jmooring closed this as completed Jul 6, 2023

dirkolbrich mentioned this issue Jul 7, 2023

Fixed minify and fingerprint for production/export dirkolbrich/hugo-tailwindcss-starter-theme#48

Merged

github-actions bot added the Outdated label Jul 28, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hugo --minify breaks HTML element extraction #7567

hugo --minify breaks HTML element extraction #7567

Traumflug commented Aug 15, 2020

Traumflug commented Aug 15, 2020

stale bot commented Dec 19, 2020

Traumflug commented Dec 19, 2020

davidsneighbour commented Dec 19, 2020

Traumflug commented Dec 19, 2020

davidsneighbour commented Dec 20, 2020

Traumflug commented Dec 20, 2020

davidsneighbour commented Dec 20, 2020

davidsneighbour commented Dec 20, 2020

philoserf commented Mar 27, 2021

ghost commented Apr 5, 2021

Traumflug commented Apr 6, 2021

davidsneighbour commented Apr 6, 2021

bep commented Apr 6, 2021 •

edited

Loading

bep commented Apr 6, 2021 •

edited

Loading

ghost commented Apr 7, 2021

dirkolbrich commented Apr 14, 2021

dirkolbrich commented Apr 16, 2021

jmooring commented Jul 6, 2023

github-actions bot commented Jul 28, 2023

hugo --minify breaks HTML element extraction #7567

hugo --minify breaks HTML element extraction #7567

Comments

Traumflug commented Aug 15, 2020

Further diagnosis:

Traumflug commented Aug 15, 2020

stale bot commented Dec 19, 2020

Traumflug commented Dec 19, 2020

davidsneighbour commented Dec 19, 2020

Traumflug commented Dec 19, 2020

davidsneighbour commented Dec 20, 2020

Traumflug commented Dec 20, 2020

davidsneighbour commented Dec 20, 2020

davidsneighbour commented Dec 20, 2020

philoserf commented Mar 27, 2021

ghost commented Apr 5, 2021

Traumflug commented Apr 6, 2021

davidsneighbour commented Apr 6, 2021

bep commented Apr 6, 2021 • edited Loading

bep commented Apr 6, 2021 • edited Loading

ghost commented Apr 7, 2021

dirkolbrich commented Apr 14, 2021

dirkolbrich commented Apr 16, 2021

jmooring commented Jul 6, 2023

github-actions bot commented Jul 28, 2023

bep commented Apr 6, 2021 •

edited

Loading

bep commented Apr 6, 2021 •

edited

Loading