In a previous post about the LZMA
compression algorithm, I made a negative comment about zip archives
and moved on. I would like to go into more detail about it now.
A zip archive serves three functions all-in-one: compression, archive,
and encryption. On a unix-like system, these functions would normally
provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The
unix philosophy says to "write programs that do one thing and do it
well".
So in the case of zip archives, we are doing three things poorly when,
instead, we should be using three separate tools that each do one
thing well.
When we use three different tools, our encrypted archive is a lot like
an onion. On the outside we have encryption. After we peel that off by
decrypting it, we have compression, and after removing that lair,
finally the archive. This is reflected in the filename:
.tar.gz.gpg. As a side note, if GPG didn't already
support it, we could add base-64 encoding if needed as another layer
on the onion: .tar.gz.gpg.b64.
By using separate tools, we can also swap different tools in and out
without breaking any spec. Previously I mentioned using LZMA, which
could be used in place of gzip or bzip2. Instead of
.tar.gz.gpg you can have .tar.lzma.gpg. Or
you can swap out GPG for encryption and use, say, CipherSaber as
.tar.lzma.cs2. If we use a single one-size-fits-all
format, we are limited by the spec.
Compression
Both zip and gzip basically use the same compression algorithm. The
zip spec actually allows for a variety of other compression
algorithms, but you cannot rely on other tools to support them.
Zip archives are also inside out. Instead of solid
compression, which is what happens in tarballs, each file is
compressed individually. Redundancy between different files cannot be
exploited. The equivalent would be an inside out tarball:
.gz.tar. This would be produced by first individually
gzipping each file in a directory tree, then archiving them with
tar. This results in larger archive sizes.
However, there is an advantage to inside out archives: random
access. We can access a file in the middle of the archive without
having to take the whole thing apart. In general use, this sort of
thing isn't really needed, and solid compression would be more useful.
Archive
In a zip archive, timestamp resolution is limited to 2 seconds, which
is based on the old FAT filesystem time resolution. If your system
supports finer timestamps, you will lose information. But really, this
isn't a big deal.
It also does not store file ownership information, but this is also
not a big deal. It may even be desirable as a privacy measure.
Actually, the archive part of zip seems to be pretty reasonable, and
better than I thought it was. There don't seem to be any really
annoying problems with it.
Tar is still has advantages over zip. Zip doesn't quite allow the same
range of filenames as unix-like systems do, but it does allow
characters like * and ?. What happens when you extract files with
names containing these characters on an inferior operating system that
forbids them will depend on the tool.
Encryption
Encryption is where zip has been awful in the past. The original
spec's encryption algorithm had serious flaws and no one should even
consider using them today.
Since then, AES encryption has been worked into the standard and
implemented differently by different tools. Unless the same zip tool
is used on each end, you can't be sure AES encryption will work.
By placing encryption as part of the file spec, each tool has to
implement its own encryption, probably leaving out considerations like
using secure memory. These tools are concentrating on archiving and
compression, and so encryption will likely not be given a solid
effort.
In the implementations I know of, the archive index isn't encrypted,
so someone could open it up and see lots of file metadata, including
filenames.
When you encrypt a tarball with GnuPG, you have all the flexibility of
PGP available. Asymmetric encryption, web of trust, multiple strong
encryption algorithms, digital signatures, strong key management,
etc. It would be unreasonable for an archive format to have this kind
of thing built in.
Conclusion
You are almost always better off using a tarball rather than a zip
archive. Unfortunately the receiver of an archive will often be unable
to open anything else, so you may have no choice.
Any developer that uses a non-toy operating system will be familiar
with gzip and bzip2 tarballs (.tar.gz, .tgz, and
.tar.bz2). Most places will provide both versions so that the user can
use his preferred decompresser.
Both types are useful because they make tradeoffs at different points:
gzip is very fast with low memory requirements and bzip2 has much
better compression ratios at the cost of more memory and CPU
time. Users of older hardware will prefer gzip, because the benefits
of bzip2 are negated by the long decompression times, around 6 times
longer. This is why OpenBSD prefers
gzip.
But there is a new compression algorithm in town. Well, it has been
around for about 10 years now, but, if I understand correctly, was
patent encumbered (aka useless) for awhile. It is called the Lempel-Ziv-Markov chain
algorithm (LZMA). It is still maturing and different software that
uses LZMA still can't quite talk to each other. 7-zip and LZMA Utils are a couple examples.
GNU tar added an
--lzma option just last April, and finally gave it a
short option, -J, this past December. I take this as a
sign that LZMA tarballs (.tar.lzma) are going to become common over
the next several years. It also would seem that the GNU project has
officially blessed LZMA.
And not only that, I think LZMA tarballs will supplant bzip2
tarballs. The reason is because it is even more asymmetric than bzip2.
According to the LZMA Utils page, LZMA compression ratios are 15%
better than those of bzip2, but at the cost of being 4 to 12 times
slower on compression. In many applications, including tarball
distribution, this is completely acceptable because decompression
is faster than bzip2! There is an extreme asymmetry here that can
readily be exploited.
So, when a developer has a new release he tells his version control
system, or maybe his build system, to make a tar archive and compress
it with LZMA. If he has a computer from this millennium, it won't take
a lifetime to do, but it will still take some time. Since it could
take as much as two orders of magnitude longer to make than a gzip
tarball, he could make a gzip tarball first and put it up for
distribution. When the LZMA tarball is done, it will be about 30%
smaller and decompress almost as fast as the gzip tarball (but while
using a large amount of memory).
At this point, why would someone download a bzip2 archive? It's bigger
and slower. Right now possible reasons may be a lack of an LZMA
decompresser and/or lack of familiarity. Over time, these will both be
remedied.
Don't get me wrong. I don't hate bzip2. It is a very interesting
algorithm. In fact, I was breathless when I first understood the
Burrows-Wheeler transform, which bzip2 uses at one stage. I would
argue that bzip2 is more elegant than gzip and LZMA because it is less
arbitrary. But I do think it will become obsolete.
Unfortunately, the confused zip archive is here to stay for now
because it is the only compression tool that a certain popular, but
inferior, operating system ships with. I say "confused" because it
makes the mistake of combining three tools into one: archive,
compression, and encryption. As a result, instead of doing one thing
well it does three things poorly. Cell phone designers also make the
same mistake. Fortunately I don't have to touch zip archives often.
Finally, don't forget that LZMA is mostly useful where the asymmetry
can be exploited: data is compressed once and decompressed many
times. Take the gitweb interface, which provides access to a git
repository through a browser. I
run one myself. It will provide a gzip tarball of any commit on
the fly. It doesn't do this by having all these tarballs lying around,
but creates them on demand. Data is compressed once and decompressed
once. Because of this, gzip is, and will remain, the best option for
this setting.
In conclusion, consider creating LZMA tarballs next time, and don't be
afraid to use them when you come across them.
Another useful program I use every week is GNU Screen. It
provides virtual terminals at a single terminal. It's a bit like a
window manager for a text terminal. If you are a command line junkie
(and if you are at all serious about computing, you should be), this
is an essential piece of software.
The main reason I use screen is for its persistence. If I am running a
long-running job on a remote machine (i.e. over ssh), like a large
apt-get upgrade, I'll put it in a screen session. This
way I can log out and, later, log in from anywhere and check on it. I
have even used it to persist nethack
sessions, though this isn't really necessary.
The only annoying part is that all of it's mappings are underneath C-a
(ctrl+a), which is a very common Emacs/bash command, which I use a
lot. To get the effect of C-a inside screen, you have to do it twice
in a row because screen captures the first one.
If you don't already use it, try it out sometime.
Don't stop here! This isn't everything. Check out the archives
(on the left) for more posts. Or just have a look at
the index.