Thought Parsing

Most git tutorials start with teaching you a series of commands to manipulate a repository and only follow up later, if at all, with the effects those commands have on the underlying data. This is great for getting started quickly, but I think it underprepares users for more advanced usage: there's a fairly low limit to the number of opaque commands one can memorize to resolve a particular situation, and eventually every advanced user will need to learn what the underlying data model is and how git's commands manipulate it. So this tutorial is backwards: instead of starting with commands and working down to the underlying model, we'll start with the model and work up.

This tutorial was inspired by comments on this lobste.rs post.

I won't assume any familiarity with git for this tutorial, but a base level of computer science knowledge (in particular what a graph is) will be helpful. I will try to explain everything as I go, but there are lots of little concepts it's easy to miss.

So. In extreme short, a git repository is a directed acyclic graph (a.k.a. a “DAG”) with commits as nodes and the parent/child relationship as edges. This is less intimidating than it sounds:

A “commit” is a snapshot of the repository at a moment in time. There are a bunch of technical details to how commits are stored, manipulated and shared which make it seem more complicated than this (in particular, you've probably heard lots of talk of “diffs” of changes between one commit and another being passed around), but as far as the data model is concerned, each commit describes the entire state of the repository at that instant.
“Parent” and “child” just mean one commit came immediately before or after another, respectively. Since one of the goals of a source control system is to be able to view how the contents of a repository have changed over time, we want to know the order commits came in.
A “graph” is a bunch of arbitrary things called “nodes” which are connected together. The connections are called “edges”. As mentioned, in a git repository, commits are nodes and edges mean “this commit came immediately before/after this other one”.
“Directed” just means that the edges of the graph have a direction to them: when one commit is the parent of another, it is not also the child, and vise versa.
“Acyclic” means that there are no loops (“cycles”) in the repository. Starting at one commit, you cannot follow the chain of connections between that commit, its children, its children's children, etc. and get back to the original commit. This is important for two major reasons: first, it doesn't make sense (how can a commit be derived through a bunch of changes from itself?) and second, manipulating acyclic graphs is much easier than manipulating graphs that may contain cycles. (For one thing, if you follow a chain of connected commits, you are guaranteed to always reach a last commit with no children.)

Generally, I think diagrams are helpful, so here's a labeled diagram of a git repository:

(Commits are generally identified by a truncation of the hash of their contents. This is conveniently unambiguous and compact. “d54be4f” is an example of a commit hash in the above diagram.)

There's one more important concept to understand, which is a “branch”. Branches in git are somewhat misleadingly named; a branch is just a pointer to a particular commit. Thus, two or more different branches may point to the same commit (indeed, this is the normal state of affairs immediately after creating a new branch). Annotated with branches, our diagram becomes the following:

Now that I've explained what the underlying data model looks like, let's take a crack at manipulating it. The most fundamental thing you can do to a git repository is add commits to it. This is done, unsurprisingly, with the git commit command, which creates a new commit from changes we've selected from the working tree with the current commit as its parent. As an additional side effect (which is almost always what we want), it updates the current branch to point to the new commit. Thus, we use the following command:

git commit -a

to have the following effect on the repository:

The second most fundamental aspect of manipulating a repository is managing branches. First, we'll switch between existing branches. This is done with the git checkout command:

git checkout branchname

This doesn't change the underlying repository, but it does change the contents of the working tree to reflect the commit that branch points at, and future git commit commands will update that branch to point to newly created commits.

Creating branches is typically done with the git branch command: git branch branchname

creates a new branch pointing to the current commit:

Note that this doesn't update the current branch; you still need to checkout branchname for commits to apply to it.

These are the basics! Switching between and committing to different branches will cause the line of commits each branch refers to to diverge. This is the source of the name “branch”.

git checkout branchname

			edit working tree

			git commit -a

git checkout master

			edit working tree

			git commit -a

Now, obviously you need to be able to merge divergent branches back together, or there wasn't much point in branching to begin with—you might as well just copy the repository. This is done with the git merge command, which makes a new commit with the current commit and all the commits you've specified in the command as parents, and updates the current branch to point at this new “merge commit”.

git merge branchname

Recall that each commit is a snapshot of the working tree, so combining two commits like this is not always easy! Git will do its best to merge automatically, but if it can't, it will tell you there were “merge conflicts” and ask you to resolve them manually. Once you've resolved any conflicts, you then manually commit the finished merge.

Now, given a repository, obviously you need to be able to know the structure of the repository graph in order to manipulate it suitably. In general, git log is used to show information about past commits; to see the parent/child relationships between commits shown explicitly, use git log --graph. Graphical tools such as gitk also show this information.

In part two, we'll go over more advanced bits of manipulation such as moving and editing commits.

A Backwards Git Tutorial, Part 1

Navigation

Recent Articles

A Backwards Git Tutorial, Part 1