Save 24% With The Last Frontier of Minification: HTML!

As a web developer, front end developer, or web performance enthusiast (or all of those), it’s likely that you’re already minifying your JavaScript (or uglifying it) and most likely your css too.

Why do we do this?

We minify specifically to reduce the bytes going over the wire; trying to get our websites as tiny as possible in order to shoot them through the internet faster than our competitor’s.

We obsess over optimising our images, carefully choosing the appropriate format and tweaking the quality percentage, hoping to achieve the balance between clarity and file size.

So we have teeny tiny JavaScript; we have clean, minified, uncssed css; we have perfectly small images (being lazy loaded, no doubt).

So what’s left?…

The ever-overlooked … HTML minification!

Thats right! HTML! BOOM!

Seriously though; HTML may be the one remaining frontier for optimisation. If you’ve covered off all the other static files – the css, js, and images – then not optimising html seems like a wasted opportunity.

If you view the source of most websites you visit you’ll probably find line after line of white space, acres of indents, reams of padding, novels in the forms of comments.

Every single one of these is a wasted opportunity to save bytes; to squeeze the last few bits out of your HTTP response, turning it from a hulking oil tanker into a zippy land speeder.

html-minifier

There is a wonderful package on npm and github called html-minifier (and its handy online html minification tool) to ease html minification into your existing automation flow, which I’ll be using to demonstrate the potential savings.

html-minifier has a ton of options, so plenty of scope for tweaking. Most of them can be enabled without breaking the resulting page, but you really need to test them out for yourself. Table taken from the html-minifier github page:

Option Description Default
removeComments Strip HTML comments false
removeCommentsFromCDATA Strip HTML comments from scripts and styles false
removeCDATASectionsFromCDATA Remove CDATA sections from script and style elements false
collapseWhitespace Collapse white space that contributes to text nodes in a document tree. false
conservativeCollapse Always collapse to 1 space (never remove it entirely). Must be used in conjunction with collapseWhitespace=true false
preserveLineBreaks Always collapse to 1 line break (never remove it entirely) when whitespace between tags include a line break. Must be used in conjunction with collapseWhitespace=true false
collapseBooleanAttributes Omit attribute values from boolean attributes false
removeAttributeQuotes Remove quotes around attributes when possible. false
removeRedundantAttributes Remove attributes when value matches default. false
preventAttributesEscaping Prevents the escaping of the values of attributes. false
useShortDoctype Replaces the doctype with the short (HTML5) doctype false
removeEmptyAttributes Remove all attributes with whitespace-only values false
removeScriptTypeAttributes Remove type="text/JavaScript" from script tags. Other type attribute values are left intact. false
removeStyleLinkTypeAttributes Remove type="text/css" from style and link tags. Other type attribute values are left intact. false
removeOptionalTags Remove unrequired tags false
removeIgnored Remove all tags starting and ending with <%, %>, <?, ?> false
removeEmptyElements Remove all elements with empty contents false
lint Toggle linting false
keepClosingSlash Keep the trailing slash on singleton elements false
caseSensitive Treat attributes in case sensitive manner (useful for custom HTML tags.) false
minifyJS Minify JavaScript in script elements and on* attributes (uses UglifyJS) false (could be true, false, Object (options))
minifyCSS Minify CSS in style elements and style attributes (uses clean-css) false (could be true, false, Object (options))
minifyURLs Minify URLs in various attributes (uses relateurl) false (could be Object (options))
ignoreCustomComments Array of regex’es that allow to ignore certain comments, when matched [ ]
ignoreCustomFragments Array of regex’es that allow to ignore certain fragments, when matched (e.g. <?php ... ?>, {{ ... }}, etc.) [ ]
processScripts Array of strings corresponding to types of script elements to process through minifier (e.g. text/ng-template, text/x-handlebars-template, etc.) [ ]
maxLineLength Specify a maximum line length. Compressed output will be split by newlines at valid HTML split-points.
customAttrAssign Arrays of regex’es that allow to support custom attribute assign expressions (e.g. '<div flex?="{{mode != cover}}"></div>') [ ]
customAttrSurround Arrays of regex’es that allow to support custom attribute surround expressions (e.g. <input {{#if value}}checked="checked"{{/if}}>) [ ]
customAttrCollapse Regex that specifies custom attribute to strip newlines from (e.g. /ng\-class/)
quoteCharacter Type of quote to use for attribute values (’ or “)

But how much can you save?

As a big fan of the HTTPArchive and BigQuery, I took this opportunity to quickly pull down a few thousand urls (just under 9000 in fact) from a recent run of HTTPArchive.

I then implemented a very crude bit of code to request each url, measure the original HTML response size, process it with html-minifer, and measure the difference.

Just by implementing html minification, the average web page can shrink by more than 24%! That’s almost a quarter off for free! Nice.

That was an easy read, wasn’t it? Now go turn on html minification.

Code?

Ok, fine, let’s embarrass me by looking at my hacky code. It’s most likely possible to implement this whole thing with a single async, online, blah blah, nodejs script, but I took the old-school route of using distinctly different, mainly offline, stages:

  1. install html-minifier via npm
  2. plain old curl commands to download the html to be processed locally,
  3. a dos batch file to run html-minifier in a loop for each un-minified file,
  4. and finally a powershell script to work out the difference in unprocessed vs processed file sizes, resulting in an average saving value

1. Install html-minifier – npm

npm install -g html-minifier

Easy.

Admittedly you need npm for this step, but you get that for free with nodejs. Don’t have nodejs? Go and install it!

2. Getting the original HTML – curl

I used some mad find-and-replace skillz to turn a list of 8,900 URLs from the HTTPArchive query into a batch file full of curl commands, so I could download the HTML for processing locally. The file ended up looking a bit like this repeated a few thousand times with loads of different URLs:

curl -L http://www.google.com/ -o www.google.com.html
curl -L http://www.facebook.com/ -o www.facebook.com.html
curl -L http://www.youtube.com/ -o www.youtube.com.html
curl -L http://www.baidu.com/ -o www.baidu.com.html

This resulted in 8,900 HTML files in my “html” directory; henceforth to be known as “the original html directory”!

Don’t have curl? Just install it via chocolatey with a quick choco install curl! Don’t have chocolatey?! Go and install it!

3. Minifying the original HTML – DOS batch file

I then process each file with html-minifier (checking if it hasn’t already been processed, in case I need to run the script multiple times), putting the results in a min directory using this batch file:

for /f "delims=|" %%f in ('dir /b *.html') do (
    if not exist ..\min\%%f (
        html-minifier %%f -o ..\min\%%f -c ../html-minifier.config
    )
)

This just loops through all the files in a directory with a “.html” extension and checks if it already exists in an output directory. If not, process the file with html-minifier.

The key command here is:

html-minifier %%f -o ..\min\%%f -c ../html-minifier.config

Those parameters are:

  • %%f – the input file name from the outer for loop (something like “www.google.com.html”)
  • -o ..\min\%%f – the output location and file name (e.g. “..\min\www.google.com.html”)
  • -c – specifies a config file which defines how aggressive html-minifier is, using the list of options further up this article

This is the config file I used for the test:

{
  "collapseBooleanAttributes": true,
  "collapseInlineTagWhitespace" : true,
  "collapseWhitespace" : true,
  "includeAutoGeneratedTags" : false,
  "minifyCSS" : true,
  "minifyJS" : true,
  "minifyURLs" : true,
  "quoteCharacter": "'",
  "removeAttributeQuotes": true,
  "removeComments": true,
  "removeEmptyAttributes" : true,
  "removeEmptyElements": true,
  "removeOptionalTags": true,
  "removeRedundantAttributes": true,
  "removeScriptTypeAttributes": true,
  "removeStyleLinkTypeAttributes": true,
  "removeTagWhitespace" :true
}

Theoretically you should be able to just pass --input-dir <directory> and --output-dir <directory> as params to html-minifier, however this didn’t seem to work for me, hence the batch file. These params aren’t documented anywhere, but if you run html-minifier -h you can see them as options

4. Comparing original HTML to minified HTML – Powershell

I used some Powershell to compare the original to the minified and give an overall average difference in size. In case of empty files where minification completely failed, or perhaps the original file was itself empty, I’ve added a a simple “Length” check near the start of this script to ignore them:

$origDir = "\path\to\original\files\"
$minDir = "\path\to\minified\files"

$processedFiles = gci -Path $minDir | where {($_.Length) -gt 0}
$origFiles = gci -Path $origDir

$results = New-Object System.Collections.ArrayList

foreach ($processedFile in $processedFiles) {
    foreach ($origFile in $origFiles) {
        if ($processedFile.Name -eq $origFile.Name) {
            $percentDifference = (100/$origFile.Length)*$processedFile.Length
            $results.Add($percentDifference) > $null
        }
    }
}

Write-Host (100-($results | Measure-Object -Average).Average)%

Here I’m getting the child items (gci – a.k.a. Get-ChildItems) of the processed html directory, ignoring those with 0 bytes; i.e., empty files.

$processedFiles = gci -Path $minDir | where {($_.Length) -gt 0}

Then for each processed file I’m finding the original HTML file by comparing their name.

if ($processedFile.Name -eq $origFile.Name) {

When I find a match, I’m getting the percentage difference in size and adding it to an array.

$percentDifference = (100/$origFile.Length)*$processedFile.Length
$results.Add($percentDifference) > $null

Finally I average all the values in this array to get a result and output it.

Write-Host (100-($results | Measure-Object -Average).Average)%

I’m sure there’s a better way of doing this that doesn’t involve a nested for loop, but I don’t really mind inefficiency for a one-off task.

Summary

The output of this process across the 8,900 HTML files downloaded and successfully processed was approximately 24.32%!

That’s surely worth investigating for your own site, isn’t it?

There are gulp and grunt wrappers for this process, so you can even automate it as part of your build or deploy process.

Alternatives

html-minifier is not the only html minifier; you can even try to write your own if you like, which is what Dean Hume has previously done, specifically for .Net MVC Razor files. If you try this you need to be aware of the many edge case scenarios (IE conditional comments, for example).

Good luck!

3 thoughts on “Save 24% With The Last Frontier of Minification: HTML!

  1. If you are using GZIP then I assume the gains are much less impressive?

    A re-run with gzip enabled would be useful 🙂

  2. It would be interesting to compare pages with compression. How big they were compressed, how big they were uncompressed, then both of those statistics after minification. I once worked on a web service with XML in/out. Management wanted me to change my clean/easy to read tag names to single characters “for efficiency”. Once I showed (using real numbers) that it made ~1% difference because of compression, the issue was dropped.

  3. I tested it. Pages on https://www.peterbe.com are generated by Django, stored to disk and served by Nginx as flat files. Nginx will gzip compress if the client supports it.

    The https://www.peterbe.com/plog page is huge. It’s 298KB of just pure HTML. On disk. After I ran html-minifier on it, it became 250KB on disk.

    Then I gzipp’ed both files. 64KB vs. 62KB. In other words, a saving of a whopping 2KB!

    Conclusion, not really worth it considering the risk that it might break something I haven’t foreseen.

Leave a Reply

Your email address will not be published. Required fields are marked *