As a web developer, front end developer, or web performance enthusiast (or all of those), it’s likely that you’re already minifying your JavaScript (or uglifying it) and most likely your css too.
Why do we do this?
We minify specifically to reduce the bytes going over the wire; trying to get our websites as tiny as possible in order to shoot them through the internet faster than our competitor’s.
We obsess over optimising our images, carefully choosing the appropriate format and tweaking the quality percentage, hoping to achieve the balance between clarity and file size.
So we have teeny tiny JavaScript; we have clean, minified, uncssed css; we have perfectly small images (being lazy loaded, no doubt).
So what’s left?…
The ever-overlooked … HTML minification!
Thats right! HTML! BOOM!
Seriously though; HTML may be the one remaining frontier for optimisation. If you’ve covered off all the other static files – the css, js, and images – then not optimising html seems like a wasted opportunity.
If you view the source of most websites you visit you’ll probably find line after line of white space, acres of indents, reams of padding, novels in the forms of comments.
Every single one of these is a wasted opportunity to save bytes; to squeeze the last few bits out of your HTTP response, turning it from a hulking oil tanker into a zippy land speeder.
html-minifier
There is a wonderful package on npm and github called html-minifier
(and its handy online html minification tool) to ease html minification into your existing automation flow, which I’ll be using to demonstrate the potential savings.
html-minifier
has a ton of options, so plenty of scope for tweaking. Most of them can be enabled without breaking the resulting page, but you really need to test them out for yourself. Table taken from the html-minifier
github page:
Option | Description | Default |
---|---|---|
removeComments |
Strip HTML comments | false |
removeCommentsFromCDATA |
Strip HTML comments from scripts and styles | false |
removeCDATASectionsFromCDATA |
Remove CDATA sections from script and style elements | false |
collapseWhitespace |
Collapse white space that contributes to text nodes in a document tree. | false |
conservativeCollapse |
Always collapse to 1 space (never remove it entirely). Must be used in conjunction with collapseWhitespace=true |
false |
preserveLineBreaks |
Always collapse to 1 line break (never remove it entirely) when whitespace between tags include a line break. Must be used in conjunction with collapseWhitespace=true |
false |
collapseBooleanAttributes |
Omit attribute values from boolean attributes | false |
removeAttributeQuotes |
Remove quotes around attributes when possible. | false |
removeRedundantAttributes |
Remove attributes when value matches default. | false |
preventAttributesEscaping |
Prevents the escaping of the values of attributes. | false |
useShortDoctype |
Replaces the doctype with the short (HTML5) doctype | false |
removeEmptyAttributes |
Remove all attributes with whitespace-only values | false |
removeScriptTypeAttributes |
Remove type="text/JavaScript" from script tags. Other type attribute values are left intact. |
false |
removeStyleLinkTypeAttributes |
Remove type="text/css" from style and link tags. Other type attribute values are left intact. |
false |
removeOptionalTags |
Remove unrequired tags | false |
removeIgnored |
Remove all tags starting and ending with <% , %> , <? , ?> |
false |
removeEmptyElements |
Remove all elements with empty contents | false |
lint |
Toggle linting | false |
keepClosingSlash |
Keep the trailing slash on singleton elements | false |
caseSensitive |
Treat attributes in case sensitive manner (useful for custom HTML tags.) | false |
minifyJS |
Minify JavaScript in script elements and on* attributes (uses UglifyJS) | false (could be true , false , Object (options)) |
minifyCSS |
Minify CSS in style elements and style attributes (uses clean-css) | false (could be true , false , Object (options)) |
minifyURLs |
Minify URLs in various attributes (uses relateurl) | false (could be Object (options)) |
ignoreCustomComments |
Array of regex’es that allow to ignore certain comments, when matched | [ ] |
ignoreCustomFragments |
Array of regex’es that allow to ignore certain fragments, when matched (e.g. <?php ... ?> , {{ ... }} , etc.) |
[ ] |
processScripts |
Array of strings corresponding to types of script elements to process through minifier (e.g. text/ng-template , text/x-handlebars-template , etc.) |
[ ] |
maxLineLength |
Specify a maximum line length. Compressed output will be split by newlines at valid HTML split-points. | |
customAttrAssign |
Arrays of regex’es that allow to support custom attribute assign expressions (e.g. '<div flex?="{{mode != cover}}"></div>' ) |
[ ] |
customAttrSurround |
Arrays of regex’es that allow to support custom attribute surround expressions (e.g. <input {{#if value}}checked="checked"{{/if}}> ) |
[ ] |
customAttrCollapse |
Regex that specifies custom attribute to strip newlines from (e.g. /ng\-class/ ) |
|
quoteCharacter |
Type of quote to use for attribute values (’ or “) | “ |
But how much can you save?
As a big fan of the HTTPArchive and BigQuery, I took this opportunity to quickly pull down a few thousand urls (just under 9000 in fact) from a recent run of HTTPArchive.
I then implemented a very crude bit of code to request each url, measure the original HTML response size, process it with html-minifer
, and measure the difference.
Just by implementing html minification, the average web page can shrink by more than 24%! That’s almost a quarter off for free! Nice.
That was an easy read, wasn’t it? Now go turn on html minification.
Code?
Ok, fine, let’s embarrass me by looking at my hacky code. It’s most likely possible to implement this whole thing with a single async, online, blah blah, nodejs
script, but I took the old-school route of using distinctly different, mainly offline, stages:
- install
html-minifier
vianpm
- plain old
curl
commands to download the html to be processed locally, - a dos
batch
file to runhtml-minifier
in a loop for each un-minified file, - and finally a
powershell
script to work out the difference in unprocessed vs processed file sizes, resulting in an average saving value
1. Install html-minifier
– npm
npm install -g html-minifier
Easy.
Admittedly you need
npm
for this step, but you get that for free withnodejs
. Don’t havenodejs
? Go and install it!
2. Getting the original HTML – curl
I used some mad find-and-replace skillz to turn a list of 8,900 URLs from the HTTPArchive query into a batch file full of curl
commands, so I could download the HTML for processing locally. The file ended up looking a bit like this repeated a few thousand times with loads of different URLs:
curl -L http://www.google.com/ -o www.google.com.html
curl -L http://www.facebook.com/ -o www.facebook.com.html
curl -L http://www.youtube.com/ -o www.youtube.com.html
curl -L http://www.baidu.com/ -o www.baidu.com.html
This resulted in 8,900 HTML files in my “html” directory; henceforth to be known as “the original html directory”!
Don’t have
curl
? Just install it viachocolatey
with a quickchoco install curl
! Don’t havechocolatey
?! Go and install it!
3. Minifying the original HTML – DOS
batch file
I then process each file with html-minifier
(checking if it hasn’t already been processed, in case I need to run the script multiple times), putting the results in a min
directory using this batch file:
for /f "delims=|" %%f in ('dir /b *.html') do (
if not exist ..\min\%%f (
html-minifier %%f -o ..\min\%%f -c ../html-minifier.config
)
)
This just loops through all the files in a directory with a “.html” extension and checks if it already exists in an output directory. If not, process the file with html-minifier
.
The key command here is:
html-minifier %%f -o ..\min\%%f -c ../html-minifier.config
Those parameters are:
%%f
– the input file name from the outerfor
loop (something like “www.google.com.html”)-o ..\min\%%f
– the output location and file name (e.g. “..\min\www.google.com.html”)-c
– specifies a config file which defines how aggressivehtml-minifier
is, using the list of options further up this article
This is the config file I used for the test:
{
"collapseBooleanAttributes": true,
"collapseInlineTagWhitespace" : true,
"collapseWhitespace" : true,
"includeAutoGeneratedTags" : false,
"minifyCSS" : true,
"minifyJS" : true,
"minifyURLs" : true,
"quoteCharacter": "'",
"removeAttributeQuotes": true,
"removeComments": true,
"removeEmptyAttributes" : true,
"removeEmptyElements": true,
"removeOptionalTags": true,
"removeRedundantAttributes": true,
"removeScriptTypeAttributes": true,
"removeStyleLinkTypeAttributes": true,
"removeTagWhitespace" :true
}
Theoretically you should be able to just pass
--input-dir <directory>
and--output-dir <directory>
as params tohtml-minifier
, however this didn’t seem to work for me, hence the batch file. These params aren’t documented anywhere, but if you runhtml-minifier -h
you can see them as options
4. Comparing original HTML to minified HTML – Powershell
I used some Powershell to compare the original to the minified and give an overall average difference in size. In case of empty files where minification completely failed, or perhaps the original file was itself empty, I’ve added a a simple “Length” check near the start of this script to ignore them:
$origDir = "\path\to\original\files\"
$minDir = "\path\to\minified\files"
$processedFiles = gci -Path $minDir | where {($_.Length) -gt 0}
$origFiles = gci -Path $origDir
$results = New-Object System.Collections.ArrayList
foreach ($processedFile in $processedFiles) {
foreach ($origFile in $origFiles) {
if ($processedFile.Name -eq $origFile.Name) {
$percentDifference = (100/$origFile.Length)*$processedFile.Length
$results.Add($percentDifference) > $null
}
}
}
Write-Host (100-($results | Measure-Object -Average).Average)%
Here I’m getting the child items (gci
– a.k.a. Get-ChildItems
) of the processed html directory, ignoring those with 0 bytes; i.e., empty files.
$processedFiles = gci -Path $minDir | where {($_.Length) -gt 0}
Then for each processed file I’m finding the original HTML file by comparing their name.
if ($processedFile.Name -eq $origFile.Name) {
When I find a match, I’m getting the percentage difference in size and adding it to an array.
$percentDifference = (100/$origFile.Length)*$processedFile.Length
$results.Add($percentDifference) > $null
Finally I average all the values in this array to get a result and output it.
Write-Host (100-($results | Measure-Object -Average).Average)%
I’m sure there’s a better way of doing this that doesn’t involve a nested for
loop, but I don’t really mind inefficiency for a one-off task.
Summary
The output of this process across the 8,900 HTML files downloaded and successfully processed was approximately 24.32%!
That’s surely worth investigating for your own site, isn’t it?
There are gulp
and grunt
wrappers for this process, so you can even automate it as part of your build or deploy process.
Alternatives
html-minifier
is not the only html minifier; you can even try to write your own if you like, which is what Dean Hume has previously done, specifically for .Net MVC Razor files. If you try this you need to be aware of the many edge case scenarios (IE conditional comments, for example).
Good luck!
If you are using GZIP then I assume the gains are much less impressive?
A re-run with gzip enabled would be useful 🙂
It would be interesting to compare pages with compression. How big they were compressed, how big they were uncompressed, then both of those statistics after minification. I once worked on a web service with XML in/out. Management wanted me to change my clean/easy to read tag names to single characters “for efficiency”. Once I showed (using real numbers) that it made ~1% difference because of compression, the issue was dropped.
I tested it. Pages on https://www.peterbe.com are generated by Django, stored to disk and served by Nginx as flat files. Nginx will gzip compress if the client supports it.
The https://www.peterbe.com/plog page is huge. It’s 298KB of just pure HTML. On disk. After I ran html-minifier on it, it became 250KB on disk.
Then I gzipp’ed both files. 64KB vs. 62KB. In other words, a saving of a whopping 2KB!
Conclusion, not really worth it considering the risk that it might break something I haven’t foreseen.