Advanced Git Use

by Alessandro Rubini

Reproduced and translated with permission of Linux & C, Edizioni Vinco.

The git package has been originally written by Linus Torvalds, and has later been maintained by other developers under the lead of Junio Hamano. The program is being adopted by an ever-increasing number of projects, from the kernel and U-Boot to Xorg and busybox. It belongs to the class of distributed version control systems.

A first introduction to using the package has been published on this same magazine by Rodolfo Giometti. This articles introduces to some more advanced features, that are needed to interact with complex products, while trying to uncover the ideas used within the program. Box 1 summarizes the most important commands of git, as quick reference for the least expert readers.

Box 1 - Most Common Git Commands

The following list shows the most important commands for git users. Command arguments are not shown as there often are several alternate uses, with different argument lists:

Distributed version control systems are more and more widespread; git is not the first and won't be the the last one. The ideas described in this article are also present, in various measures, in other packages, like mercurial; the aim of this text is not showing how git is superior to other tools, but rather show interesting features that can be useful in other contexts as wll. Git is the excuse as it has been adopted quite widely.

To avoid ambiguity, we will never use the word tree to refer to a directory with files and subdirectories. The word in the git context is best reserved to a development history of a package, with all its ramifications.

The problem of verification

One of the problems that are most often found in managing big software projects is the imperfect matching of the working directory if different programmers: complete removal of a directory, to restart from a known archive or repository is common practice, though unpleasant. The same problem occurs when there are collisions while applying a patch, or the source file has been damaged by some mishap.

Moving tens or hundreds of megabytes to recover a source package is certainly not a nice experience, verifying that one's copy is exactly the same as the copy of a different developer is even worse if you can't directly transfer the files.

The solution to both such problems is the identification of every object managed within git with a control code: a number that is derived from the whole amount of related data through a non-invertible mathematical algorithm. Such control code, called hash or message digest describes (or summarizes) the datum or data set while nobody is able to create different data with the same hash.

The algorithm used in git is SHA1 (Secure hash algorithm 1) which returns a 160-bit long hash. The number is usually represented as 40 hex digits. For example, if you want to group all files called COPYING in your system that are verbatim copies for the same license file, you can issue the command:

   locate COPYING | xargs sha1sum | sort

Within git, thus, every individual file, every directory, every commit are identified by their hash value. A specific point in the development history of a package cannot be referenced by a sequential version number, but rather by a unique 160-bit number. An object of type commit includes a reference to the contents of the directory it represents, the commit message and the hash of the previous commit. If two programmers have the same commit, they are sure to be accessing the same source package. During technical discussions on public mailing lists, referring to code or patches using the hash value or an abbreviation thereof is common practice.

Box 2 - Uniqueness of hash codes

Message digest algorithms, warrant that the hash they return is unique by statistical probability.. The most common such algorithms are SHA1 and MD5, which return values with a uniform distribution over the value space. By recalling that 2^10 is roughly equivalent to 10^3, we can say there's a chance of 1 in 4 billions for two 32-bit hashes to be equal. The number is on the order of 10^38 for 128-bit MD5 hashes and 10^48 for 16-bit SHA1 values.

Even with 1 million files, the chance for any two of them to have the same SHA1 hash is 1 in 10^36, while for a billion files the chance is 1 in 10^30. 10^30 is roughly the number of sand items in the mass of the whole earth; you can hardly say the hash is not unique.

Hash algorithms are designed such data the difference of even a single bit in the input data will perturb all output bits. So in practice to identify a file the first digits of the hash are enough to be decently sure of not picking the wrong file. Thus, git accepts abbreviated identifiers, as long as the abbreviation is unique within its database. Also, git only shows 7 digits if you ask it to show the abbreviated form of the hash.

Even within a project with 100-thousand objects, like the kernel, the chance for a 7 digits code to be ambiguous is one in 2500. When picking objects within a project you don't need the hash specified to be globally unique, just to be unique within the project: when the 7-digit code is ambiguous, git will show 8 digits or more, to just solve the local ambiguity; it's somehow like what happens at school: teachers call students by surname, and in case of ambiguity they add the initial of the name, or more letters if needed. Please note that internally git is always using the codes in their entirety.

We could test the idea by checking the SHA1 hash of Linux-2.6.30 as extracted from a git repository and that of the official tar file. To do this we need to use git cat-file to look into the relevant commits:

   bash$ cd linux-2.6.git
   bash$ git checkout v2.6.30
   HEAD is now at 07a2039... Linux 2.6.30
   bash$ git cat-file commit 07a2039 | grep tree
   tree 0cea46e43f0625244c3d06a71d6559e5ec5419ca

   bash$ tar xjf linux-2.6.30.tar.bz2
   bash$ cd linux-2.6.30
   bash$ git init
   bash$ git add .
   bash$ git commit -m "from tar file"
   Created initial commit 7a6212e: from  tar file
   bash$ git cat-file commit 7a6212e | grep tree
   tree 0cea46e43f0625244c3d06a71d6559e5ec5419ca

Box 3 - The git cat-file command

Since git records all of its objects in compressed format, blessing them according to the SHA1 hash, the tool offers the command "git cat-file" in order to print to stdout (like cat, as the name suggests) the contents of an object. The command receives two arguments the object's type and its SHA1.

The command returns the contents of the file in case of blob objects, but for commit objects its output is a short text that includes the log message, the author's name, date of the commit in Unix format and the SHA1 codes of both parent and tree. The former is the previous commit in history and the latter is the directory of files described by this commit.

The user interface for git cat-file isn't what you'd define friendly. For example, if you git cat-file a tree object, you'll get a binary file sent to the terminal. This happens because cat-file is one the low-level subcommands, used by other subcommands to get their work done.

In git documentation, low-level commands like this one are called plumbing, while the ones meant to be typed by real users are called porcelain -- the user interface that sits above plumbing. Subcommands listed in box 1, for example, are part of the porcelain set.

As apparent, the tree object contained in the two commits is the same one. This is enough t demonstrate that the two file sets, 400MB each, are identical even if retrieved in a different way.

With "git ls-tree" you can check what the internal git representation for such objects is; just like in a Unix directory, the object includes names and identifiers for other objects. Included objects of type blob are normal files, those of type tree are subdirectories. While a directory associates names to their inode numbers, which are unique within the filesystem, a tree object in git associates names to their hashes, which are unique globally. The content of each object, then, is stored in a file whose name is exactly the associated SHA1 value. A consequence of this approach is that objects in git are immutable: every modification, even of a single bit, creates a new object with a new hash and, thus, a new file.

A secondary effect, and not a trivial one, of massive use of hash codes is the easiness in signing. If the author wants to sign a specific version of the software package, she can just sign the SHA1 that represents the whole source directory. The testify about the whole development, you just need to sign the last commit, which includes the whole directory and previous commit, as already noted. Signing such hashes is done with the usual asymmetric-key tools.

That's why if a developer signs a specific commit, whoever gets hold of the same commit or the tree it refers to, can be certain about source code integrity by just verifying the signature, even if the sources have been retrieved from untrusted sources. The only components that need to be trusted are the tools that create and check SHA1 hashes, i.e. git, gpg and other tools that are usually part of the operating system. All such tools are usually signed by the relevant package maintainers.

Creating a branch

The main difference between distributed and centralized version control system is how easy creation of branches is in distributed ones.

Personally, as a git use, I find that the concept of branch is very similar to the concept of tag, and I hope this idea won't upset those who know the internal representation of them both. We may say that a branch is like a branch because just like "git tag v1.0" binds a meaningful name to the SHA1 name of the current development status, the command "git branch 1.0-fixes" binds a new symbolic name to the current status. Both such names can be used in retrieving the current version by calling "git checkout <name>".

But a branch name is a moving tag: whenever the source directory is modified and the changes are committed, the tag name will keep referring to the original SHA1, while the branch name will move, following the development.

Unless you are using a detached head, which is not common and not covered here, the current source status on the disk (the head position, in git wording) corresponds to one of the development branches. Thus every commit action corresponds to growth of one branch.

The name of the main branch, the one called trunk by other packages, is master. The master branch is created when you make the first commit ever, and it is not a special case at all. All branches are managed in the same way; you can rename or remove any branch you like, including master. Deleting a branch is like deleting a tag: all the git objects remain in the repository and you can retrieve them if you know their SHA1, at least until you garbage collect, a topic not covered in these pages.

To move from one branch to another you can use the command git checkout <branch>, but the program refuses to perform the task if there are local modifications that have not been committed, to prevent loosing your work in unexpected ways.

The idea that a branch is just a label, without any reference to where and how it got detached from the original branch is a remarkable one: if during development you get to a dead end, you can always create a new branch from some place in past history and try a different way to attack your problem; if such new way turns out to be the winning one, you can delete the initial branch without any effect on the new branch. Deleting a branch is like deleting a tag: the only effect is you can't reach the associated status with the mnemonic name any more. The fact that a new branch had been started from the one now deleted is irrelevant, as the tip of the branch that has been kept identifies the whole history the project inception up to there, without any reference to other branches or splitting points.

Obviously, you can know what are the differences, or the log, between your head and another branch, like the one you split from. Because branches refers to no other branch, but only record their own history, what the system does to compare branches is scanning back the history of both branches until it finds a common point, a SHA1 value that is common to both branches. This match in hash values is the only information you can use to know that the two branches have a common ancestor, and such ancestor can now be used as a starting point to find what are the commits that differentiate the two branches.

Such flexibility in branch management can easily lead a developer to have tens of branches in her tree; you must therefore be careful in choosing branch names, and remember to delete inactive branches, or move them to another git tree; otherwise, you'll find it hard to take track of your work.

Box 4 - Version numbers used in git

The identifiers used in git command lines to identify commits or other objects fall in several categories. The most useful and most used are:

Some commands can act on intervals, like git log or git diff). The most common expression for intervals is "v1..v2", that represents all commits that are reachable from v2 but not from v1. Sing each commit allows to find back all of its past history, reachable refers to an ancestor of the specific version. Therefore, the notation .. identifies the history from v1 to v2, if v1 is ancestor of v2, or from the splitting point up to v2 if the versions are in different branches.

Cleaning up and reordering the code history

The developer who writes working code from scratch is not really existing -- there are a very few exceptions, but we can't make tools that work just for them. Moreover, being able to devote to a single problem until it is completely solved, while ignoring other issues, is something few people can afford. The net result of these internal and external limits to development activity is that in practice each person writes code disorderly. On one side we tend to add and then remove diagnostic messages and other tricks one may be ashamed of, on the other the available time is devoted to several issues, alternating between them and temporarily abandoning each of them before it is completely solved, even if eventually it gets solved.

History in a working branch is thus often quite a mess of changes: commits about different logical problems alternate in a seemingly random way, and some diagnostics code fragments get are added and then removed soon after. Before such mess is delivered to the net and becomes part of the official history of computer science, the author needs to clean up. This means changing the relative order of the commits, collapsing several work steps in a single patch that fixes a bug or adds a feature in a single step, removing irrelevant modifications.

The tool git offers to this aim is "git rebase -i", where the i means "interactive". The command allows rewriting the history of the current branch, starting from a specified version. For example, "git rebase -i HEAD~10" allows reordering, collapsing and dropping anything since 10 commits ago. To do that git fires a text editor on a file that hosts both the list of the last commits, one per line, and the instructions about how to edit it. The sequence is well described so I won't repeat it here.

Who wants to save the current status before daring a reordering, can simply create a new branch and try rebasing the new one. Otherwise, one might just take note of the starting ID. It's always possible, at a later time, ask git what are the differences between the old and the new branch, using git diff.

Moving branches between trees: fetch, rebase, cherry-pick

Since the history of a package, including all branches, is hosted in the .git folder within the package itself, a common need is moving branches between different trees, both within the same disk and between remote computers.

The copy objects between different trees, git uses the fetch subcommand. Working on the receiving side, you tell on the command line what remote branch you want to download, as well as the name of the local branch where commits should be placed. The program retrieves the history of the remote branch and copies only the objects that are missing from the local tree. During the copy, any remote tag on the relevant branch will be reproduced locally.

The name you use for the local branch may be already existent. Git will create the branch if it doesn't exists; otherwise it will grow the existing local branch. In this case the local branch should match previous history of the branch being fetched, or it wouldn't be possible to copy the objects while preserving the local commits. When this happens, the error is "rejected: non fast-forward". Therefore, fetch can only grow a branch, without changing it in any way unless you explicitly force this behaviour.

Usually, the "small programmers" keep a local copy of the branches by "big programmers", and they periodically git fetch to follow development of the upstream package. A local branch that is used to follow remote development is called remote tracking branch or remote branch for short. Who's working on modifying and external project usually makes a new local branch, hanging off the remote branch. The fetch subcommand is used in this way:

   git fetch id-remote-tree source-branch:target-branch

The remote tree may be a pathname, a remote folder specified in ssh format or a URL, both http:// or git://.

Please note that all branches of a tree are local, even the ones called "remote". All information managed by git are included in the .git folder of the source package, and this is a design choice. A branch may be remote tracking or not according to how it is used. Anyways you can, to simplify your command lines, spell your predefined arguments for specific remote trees in your .git/config file.

After you worked on a local branch, spun out of a remote-tracking one, a further fetch on the remote branch will lead to split branches. You'll thus frequently need to move the local branch in order to have it rooted in the current tip of the remote one. This operation is called rebase and to perform it you need to be on the local branch and issue "git rebase <otherbranch>". Git does the following work: it identifies the most recent common commit, it rewinds all local commits after that forking point, it applies the commits that lead to the new base and finally re-applies the ones that have been rewound, managing possible conflicts.

If you think about it, the "rebase -i" already described is similar, but the patch sequence is just reapplied to the same base. Sure you can add -i when rebasing to a different branch as well.

Unlikely what happens with normal patch application, a rebase allows git to build a better context for the commit, so the number of errors and conflicts is greatly reduced. Rebasing involves advanced algorithms, called 3-way merge and octopus merge, which are the state of the art of current research in this topic.

Another common requirements by programmers is importing in a branch some patches that have already been applied to another branch. The subcommand cherry-pick allows to choose and pick the commits you want, one at a time, among all the commits in the local tree; such commits are applied to the tip of the current branch. The command is very useful when you leave development on a branch where you experimented stuff that is still useful, or when you track branches where other developers work, and you need to use specific parts of their work within yours.

Conflict handling

In all situations where several people a developing concurrently, one the most common problems is conflict handling. A conflict happens when you try to apply a patch to a code fragment, but that same fragment has been modified after the patch was taken. The two changes start from the same code but they are not compatible, and cannot be merged automatically. A conflict may also happen within the same tree, during a merge (not covered here) or a rebase, whether interactive or not. For example, a conflict happens when you reverse the order of two patches, the first one being a variable renaming and the second one some change to code using that very variable. The second patch can't be applied to the original code, as the lines being modified didn't exist when the variable was using the old name.

In case of conflict, git reports the conflict in its messages and stops the rebase operation, leaving som so-called "conflict markers" in the source file. Such markers are the usual <<<<<, ===== and >>>>>. The user is then expected to solve manually the issue and then "git add" the fixed files before continuing the merge or rebase operation (with commands such as"git rebase -continue"). As an alternative, the user can abort the whole rebase with "git rebase --abort". Before you explicitly call git add, the conflicting file is not save as a git object; it's thus almost impossible to insert within git files with the conflict markers inside.

Unlikely CVS, git finds conflicts only in merging files that are already known to the system, so there is no information loss. With CVS and some other centralized version systems, the conflicts happen between a local version and a file recorded in the repository. The program in this case adds the conflict markers in the local file, so the user won't have the original local file any more, and can only recover by hand-editing. With git, the local file modified by the markers is just a temporary copy, which is considered derived from two parent files, both known to git. To express this dual-parent situation, the git diff command uses a different output format in this case, to show separately the differences of the descendant from both ancestors at the same time. It takes some time to get accustomed to this new format, but with some practice you'll appreciate the usefulness of such information.

In conflict management it is helpful, once again, that git records history as immutable objects. Even in the most horrible source corruption, you can recover a good versions to restart from, ignoring the result of the erroneous operation. If you tried a merge or rebase hoping it would succeed, but then it doesn't and you have no time to fix the conflicts, you may easily checkout one of the original branches: all the files ridden with conflict markers will just be deleted, together with the files whose merge succeeded.

Another mistake that may happen is a local modification of a remote-tracking branch. In this case, a later git pull will perform a merge, and the local branch will have the wrong history, with the identifiers for all remote commits that don't match upstream any more, as they have been applied to a different local tree. Sometimes conflicts may arise, but they are the wrong way: instead of being unable to apply the local patch to the upstream code, git tried to apply the upstream patches to the local development.

The solution here is relatively easy: you can rename this branch and repeat your fetch or pull: no significant data transfer will take place, because remote objects have already been downloaded, and a new, correct, remote-tracking branch will be created. Later on, you can delete the previous branch, or cherry pick some local commits from it, or checkout one of the local commits in its history to rebase it to the current remote branch.

Code exchange through email

After a developer has cleaned up the code to make it acceptable, after she rebased to the new upstream version and after she solved any conflict, the next step is usually publication, sending the patches to maintainers. The command "git format-patch" is used to create in the current directory one file for each commit, starting from the version named on the command line up to the tip of the current branch. The name of such file starts with a 4-digit number, so every Unix command will use or show them in order.

The files git format-patch creates are laid out like email messages, with all the headers. According to the options passed to the command, messages may include all information needed to be identified as a thread, if sent as-is. If you want to contribute your work to discussion lists for the relevant package, you can simply send those messages. If your email client changes messages in an unpleasant way (like breaking long lines or encoding in some non-plain-text MIME representation) you can run git send-email directly. This however assumes some more configuration of the git package, which must know how to actually send out the messages.

At the other side of the net there's people who need to apply locally the patches they received by email. To do that they simply need to run git am (apply mailbox). The command applies the patch and reproduces the log message in the branch where it is invoked, preserving authorship and other attributions. If the code base is not the same as with the original poster, the program will apply the patch using the same techniques (and the same limits) as the patch command, since it lack the whole history to do 3-way merge. If a conflict happens, usually upstream maintainers discard the contribution and send back a terse and cold message to the original poster: "please rebase and resubmit". Actually, rebasing is based on complete history, so it's less likely for the contributor to have conflicts than it is for upstream maintainers.

As git format-patch is able to delect if file have been renamed, or if a new file has been copied from another and then modified slightly, is reports this information in its output. Thus, the patch command is not always to make the same job as git am. Missing renames or copies, the two diff formats are the same, but in those special cases the git format is more compact and more readable than the standard diff output (patch input), at least until the new feature will be added to the two Unix commands.

Using shared data space

While working with git on major projects, one problem developers usually feel is the huge amount of data that is hosted in each working directory. For example, my busybox folder currently hosts 18MB of source code and 20MB in .git. While code can be compressed to 2.5MB, the git data is already compressed and remains 20MB.

As soon as you work on several branches at the same time, because you are following different use cases of the same software package, it's useful to have different folders, with different checkouts of the same project, to avoid switching branches too often, as each time you need to recompile everything, which takes time. In this situation, the amount of common history becomes an heavy load, both on the work disk and on the backup device. Clearly, most git objects are repeated in the various folders, as past history of the package is the same, and local differences across branches are relatively little.

To avoid such data duplication, git allows to specify other archives for objects, called alternates. These additional archives are read by git but they are never written to: commits always happen locally. Such alternate locations can be specified in the environment, as GIT_ALTERNATE_OBJECT_DIRECTORIES or in the file .git/objects/info/alternates, within the working directory of the project.

My personal choice, for my kernel work, is keeping a git archive that only hosts branches I download from the network, where I periodically run git fetch. Work in progress then lives in git trees that refer to that on as alternate. I still need to run git fetch in each of the projects, as the status of branches is kept locally, but the fetch operation run in the secondary git repositories finds objects that are already available (in the alternate directory), and won't make another local copy of objects that are part of the upstream package. In fact, in the project-specific directories I always fetch from the other folder in the same computer, to avoid network traffic and to avoid storing new objects in the wrong place, if the upstream branch has grown in the meantime.

Working with alternates you can save duplication of quite a lot of data, and the .git folders for each project will weight only a few megabytes. Moreover, you can choose to only backup .git, ignoring the checked-out files, since the checkout can always be extracted by the git repository. Finally, in some cases you can even avoid backing up the master local archive, as a copy of upstream can be recovered from the net at any time.

To avoid inadvertently loosing some of your work, I also created a new git repository, that still uses alternates for upstream data. In such repository I move all branches I'm not using any more (with a fetch within the local system), before I remove them from my working repositories -- actually, I now use git remote, but that's an advanced topic. In this way I keep a local copy of my complete history, without keeping the working place crowded with old branches and without wasting more than a few megs of storage for such complete history.

Figura 1 - gitk

In addition to the command line, which remains the preferred interaction tool for developers, there are some graphic tools for git users, which are useful to both understand how a projects' history evolved, and navigate among the various development branches.

The figure shows a window of gitk (written in Tcl/Tk). Another approach to visualization is that of gitweb, which is usually installed on the servers that offer source code through git.

To probe further

The git package is distributed together with extensive documentation, as man pages (man command). For each subcommand you find a manual page whose name begins by git-, so for example you can invoke "man git-fetch". This convention reflects the origins of git, when each subcommand was actually a standalone command (with a dash in the name); but it also allows to split a big corpus of documentation into useful parts, whereas a single man page would be unmanageable. The main page, "man git" is available nonetheless, and brings introductory and general information.

Something more introductory, designed for beginners, is gittutorial(7) (i.e., the gittutorial man page in chapter 7 of the manual), and its follower gittutorial-2(7), which goes to more depth. Other manual pages that can be useful are listed in the SEE ALSO section of git(1).

The official project site is git.or.cz, and includes among other things an interesting "git for svn users", and other course material, within http://git.or.cz/course/.

The http://www.youtube.com/watch?v=4XpnKHJAok8 video is a recording of Linus Torvalds talking about git to Google technicians. It's more like informal chatting than a technical presentation, but it's quite interesting nonetheless.

Box 5 in this page briefly lists other git subcommands that I originally planned to describe as useful or otherwise interesting; detailed information about such tools can be found elsewhere, as hinted in this section.

Box 5 - Other important subcommands

This box lists other git commands that are not covered in this article, but that I suggest studying if you want to become a serious git user. Some have been touched in riquadro 1.