Git is Better

I finally finished dumping the rest of my lingering Subversion repositories. I have converted them all to Git repositories. If you manage a Subversion (or CVS, or Perforce, etc) repository, you should consider doing the same. Git became my version control system (VCS) of choice in June and I haven't looked back since.

Why? Because Git is better.

Yes, it really is. Much better.

Git is faster, smaller, more secure, and more powerful. This is a virtue of decentralized version control systems. Subversion is Blub.

It all starts with to Source Code Control System (SCCS) and Revision Control System (RCS). These systems could only track single files and created headaches for projects with multiple files being worked on by multiple people.

Then came Concurrent Versions System (CVS), which improved things slightly, but still sucked. It still really only tracks individual files.

Now, CVS did anonymous reads, allowing anyone to access the repository and see code history. OpenBSD was the first code base to take advantage of this. These days, coming across projects that don't give public read access to their repository seems backwards.

Not using any of these systems would probably be better than using them. Their flaws are obvious as soon as you start using them.

Finally in the year 2000, Subversion arrives. It's a huge step up from CVS, fixing many of its problems. It has a much better interface and uses atomic commits — finally tracking more than one file at a time. We still need to talk to some server every time we want to do something. Branching and merging also sucks so much no one wants to use it. But branching is overrated, right? Wrong. I use branches all the time, now that they are easy.

The reason branching sucks in Subversion can be explained with a famous quote by Albert Einstein,

Make everything as simple as possible, but not simpler.

Instead of implementing tagging, branching, and merging, the Subversion guys just implemented "cheap copy". It's a pretty clever idea, but in practice it doesn't work out well. It's too simple.

CVS solves the wrong problem, and Subversion solves the right problem wrongly.

Since Subversion, a number of decentralized VCSs have arisen. We have GNU arch (2001), monotone (2002), darcs (2003), Bazaar (2005), Git (2005), Mercurial (2005), and fossil (2007). I played around with all of these when looking for a distributed VCS (except fossil) and none struck me the same way that Git did. I would recommend most of them over Subversion.

Distributed VCS has gotten a lot of attention in the past couple years, which had much to do with the Linux kernel switching to one (Git). In fact, Git and Mercurial were written precisely for this event. Since then, some major projects have been switching to Git, or at least to some sort of distributed VCS: Perl, Ruby on Rails, Android, WINE, Fedora,, and VLC to name a few.

You can also see the chatter on the Internet about Git. It's is really popular with fresh, innovative projects, like the Arc programming language. It's pretty easy to accidentally run into various Git tutorials on the web. It has a real presence.

No Authority. But why distributed VCS? Why are they better?

First of all, when you "checkout" a distributed VCS, it's really a "clone" operation, which is what most of them call it. You get everything. After that, the only reason you need to talk upstream, which really isn't "up" anymore, is if either end has updates to the code they wish to share. The only way one clone might more important than another is human politics. Technically they are equals.

Small. But won't this be huge? A Subversion repository can easily be several gigabytes. That would be a lot to transfer on the initial clone.

Actually, distributed VCSs are extremely efficient. A Git clone will usually be smaller than a Subversion checkout. For example, I once cloned Freeciv's Subversion repository using Git (converting it to Git). It was about 15000 revisions. The bare version of the Git repository, containing all ~15000 commits, was half the size of the Subversion repository, which contained only a singly commit! The non-bare version was still smaller by a few megabytes. I can't even imagine how much space the server was using.

I would have some numbers on this example, but, alas, that clone was lost on a failed hard drive and it took me a week to make. Note, Git clones of Git repositories aren't that slow: Subversion isn't optimized for cloning, and the Freeciv Subversion server is extremely overloaded.

Update: I managed to get another clone, and it only took me a couple hours. The Freeciv Subversion checkout at revision 15574 is 281MB. Remember, this contains just one single revision. The Git clone after a repack and garbage collection, which contains all 15574 revisions, is 225MB. It's 56MB smaller! If I told it to leave out the Subversion metadata it would be even smaller than that. On the server side, the Subversion repository likely takes up gigabytes. And finally, to add insult to injury, the Git "bare" clone is 144MB.

Someone does have an example over here: Git's Major Features Over Subversion.

The Mozilla project's CVS repository is about 3 GB; it's about 12 GB in Subversion's fsfs format. In Git it's around 300 MB.

Git's packing format is fairly simple, yet so effective.

Fast. Well, duh. With everything being local, operations that work on multiple revisions will be fast. Beyond this, decentralized VCS is generally faster on all operations, except the initial clone.

Reduced politics. With a central repository, someone or some group has to decide who has write access and who doesn't. Developers without write access are basically stuck without version control, unless they hack in their own. In the decentralized model, everyone has write access to their own personal repository, and others can choose, on their own, to pull revisions from it.

Secure. A centralized VCS has a central, single point of failure. If that single point is compromised, the server needs to be restored from backups. Or worse, the compromise goes unnoticed and the repository history is modified without anyone ever being able to tell.

In a distributed model, each revision (and in Git's case, the files themselves) is referenced by a hash (SHA-1 in Git's case) of it's contents, a content-addressable storage system. Thanks to this, a file, no matter where it is in the tree or in history, is stored only once. The main purpose is to avoid collisions between revision identifiers on parallel lines of development. It also happens to make the repository tamper-proof.

If you know the revision ID of your HEAD no one will be able to change any of its history. This is because each revision contains the ID of its immediate ancestor, all the way back to the initial commit. If a previous commit changed, it would change the ID of every following commit. An attacker would have to find a desired collision for each one: simply impossible.

The hash addressing also provides integrity, as corruption in the repository is easily detected.

Another security gain, related to the reduced politics note, is the web of trust. This is the same way PGP handles key authentication. In a large project, a single developer may only trust a handful of people to be competent programmers, and therefore only pull from these developer's repositories. Those developers they pull from also have their own set of people they trust. In this way, revisions can safely be pulled from distant strange repositories through the web of trust.

The only reason to interact with a Subversion repository is for legacy reasons. Luckily, you don't have to use Subversion to use Subversion.

That wasn't a typo. Git has a Subversion/Git interfaced called git-svn. I used it to convert my Subversion repositories to Git, but it can be used as a fully functional Subversion client. It can clone the Subversion repository and continue to pull changes from it as it updates.

On your end, you can make commits to your local repository, use cheap branches, and so on,all of which stay local. Changes can be pushed back upstream to Subversion with the dcommit command, which would be done after rebasing any changes on top of the current Subversion HEAD. This provides most of the advantages of Git without worrying about having the central repository change.

One of the major complaints about Git is that it once lacked a plethora of GUIs, like CVS and Subversion have. Git does have GUIs. I looked at a couple of them out of curiosity, so I am not sure how good they are by comparison. I also have barely use any other VCS GUIs. The ones I have used I find incredibly annoying.

I don't understand why people insist on using them anyway. It's like using training wheels on a bike and claiming that it's better that way. No, those training wheels just get in the way.

To be brutally honest, if you don't want to use Git because you are afraid of the command line, what are you doing coding in the first place?

One topic left is issue tracking. In a centralized VCS, you have a centralized tracker. Subversion has Trac, for example. Well, what about distributed VCSs? They should have distributed issue tracking right?

I will go into this in my next post.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/ [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons (PGP)
~skeeto/ (view)