Code, Computers & Random Junk

Converting “.webarchive” to “.mht”

I found a great PHP script to convert all my *.webarchive’s to .mht.

When I started to use Firefox as my default browser in OS X (some years ago), I also changed my webarchives to .mht (mhtml). That is so I can use the file in both Linux and OS X. Now, when moving to Linux as my main OS - I needed to convert all my old archives since all my work will take place there, and OS X will (initially) stay as a backup/fallback/compatibility system for some time.

I have a good/bad habit of saving a lot of pages when browsing around. Installing/compiling most things manually, one need to find guides, howto’s and resources where I can learn from first. A lot of research. So, it’s always nice to have some references even if offline, or not finding that page again. +1 It has also saved me a few times when the page isn’t to be found anymore - they might have removed it, or the domain is gone. Anyway, over the years it has become quite a large number of files that I now need to convert. While still using OS X on the side - they will work there, but not in Linux. I do have some of the files already converted manually, but it will be time consuming considered to how many they are.

So, I found octocat this gist - a PHP script that is using the octocat CFPropertyList classes. But, webarchive2mhtml.php needs to be edited first to work. It’s ~4 years old while the other one is more up2date.

The changes I made was:

  • Added namespace to work with the “CFPropertyList” files.
  • Changed the date to use the webarchive’s timestamp instead.

    When opening a .mht there’s an icon in the address bar you can click on. It’ll show you the file location, where it was saved from …and the date. Don’t want today’s date added. Seeing the original date also tell you how old &/or irrelevant the file may be.

  • Removed (any) linebreaks in output of $cData.

    After the first test run (≈ 10 archives) I noticed that most styling was broken, and randomly half broken < /tags> was displayed at a few places. It was the converted code that broke in to 80 character lines, kind of… and broke the URL’s, styling etc.

  • And also processed the file with fmt to format the code (PSR-2).

So, here’s how to use it…

Getting the files

First, download or clone the octocat CFPropertyList files. We’re not going to use all of them, just the classes. And then a gist I made of this (incl the new: webarchive2mht.php).

Example setup using ~/Repos/w2m

cd ~/Repos

Clone CFPropertyList

git clone

Clone my gist, as w2m

git clone w2m

Symlink the class folder from CFPropertyList

cd w2m
ln -s ../CFPropertyList/classes/CFPropertyList .

Converting archives

Always run the Terminal commands from the source folder of the PHP script (~/Repos/w2m). I didn’t want to run all at once, but used it on a few selected folders first to see the results and that it looked ok. Also… Sometimes I label the files with a few different colors to mark a good/better page compared to the other ones saved. So, I just needed to get the colors to move along with the .mht archives. A manual process, of course. Then I could run a delete command to delete the .webarchives when the .mht was created and ok.

Here are the commands I used:


$ find /path/to/foobar -name '*.webarchive' -exec php webarchive2mht.php \{\} \;


If you want to delete your old .webarchive‘s…

$ find /path/to/foobar -name '*.webarchive' -exec rm -v '{}' \;

To use on a full disk - same thing:

$ find /Volumes/DiskName -name '*.webarchive' -exec php webarchive2mht.php \{\} \;
$ find /Volumes/DiskName -name '*.webarchive' -exec rm -v '{}' \;


Here’s an image showing a random old file, but with correct date (when it was saved as a webarchive), instead of when it was converted into .mht.


To use and integrate MHT into OS X there are a few xtras to use.

  • In Firefox I use this add-on: unMHT. There is another one, but I think this one is doing a better job. I also like the “multi-save” function.
  • There’s a QuickLook-plugin as well. Perfect to Quicklook the archives, and it also integrates with Spotlight …both search/indexing and previews: ql_unmht

    Note: There’s a notice about the QuickLook plugin and Gatekeeper with >= 10.9.5. See known problems.

Since I’m using both Linux and OS X, though I’m on my way to Linux… It was really great to find this script. Now I can move over the last parts, without the stress about of finding the time to convert all. All my webarchives are gone now. +1

Updated post: 2016-01-06 - Included the gist info, and matching instructions.

Happy hacking…