Skip to content

Version Control Primer

tobey edited this page Aug 7, 2015 · 1 revision

Version Control Concerns

Primary goal is to have a meaningful history to which you can easily add new code and from which you can easily revert to specific points so as to duplicate results or work.

Meaningful history

There are two (related) parts to this: making your diffs (commits) easy to read and having a strategy or pattern for bringing different lines of development together (merging).

If your diffs are unreadable, then the history graph quickly becomes unusable and meaningless. People will lose trust in the system and will quit using the system. So it is important to know how git works so that you can make commits that are consice, organized, and readable. This comes down to using git on a day-to-day basis and learning:

  • what is a commit,
  • how to make a commit,
  • how to separate different concerns into different commits,
  • how to fine tune a commit,
  • understanding what types of files or information should not be kept under version control,
  • how to use branches,
  • how to merge branches,
  • and the implications of making merges in an environment with multiple developers.

Low friction workflow - easily add code and revert

  • Need to choose a workflow - Integration manager vs Shared Repository. Each has its advantages.
  • You must learn to use the tools. Everything is possible with the command line, but the graphical tools are pretty handy. I use Gitk and GitGui in addition to the command line.

There are many pure graphical clients for Git and many IDEs have some kind of built in integration with version control. But I have found these difficult to use without understanding the fundamentals of git. The fundamentals are best to learn by using the command line.

It will also help to get familiar with the following:

  • adding the git branch to your terminal prompt
  • learning how to use .gitignore
  • how to use git stash

A big part of maintaining a low friction workflow revolves around understanding what types of files or information should be included in version control and figuring out how to exclude these file. The general idea is that you don't want to keep generated files (e.g.: *.o, or Doxygen output), but you do want to track code that can generate certain outputs. If you need the outputs, then you run the generating code to produce it.

Another common sticking point is figuring out how to track host specific settings, such as specific environment variables, build settings, or the project settings files generated by many IDEs. It is common to want to track these settings locally on an individual developer or workstation level, but not to push them to a central shared repository. There are many possibly ways to handle this.

Duplicating results

In the scientific context, one of the primary uses for version control is to be able to revert to a specific place, and either duplicate specific results, or use the specific place as a starting point for further work. Usually to duplicate your work you need:

  • the code at a specific place,
  • a set of inputs,
  • a set of parameters,
  • and (sometimes) a specific computing environment.

Because git is not setup to track large binary files (such as our NetCDF inputs) we have to come up with a clever solution for handling inputs aside from simply checking them into version control. Possibilities include:

  • track scripts that can generate inputs
  • track notes about how/where inputs were obtained
  • keep inputs in alternate directory structure and track links to the inputs
  • add metadata generation system that appends input source info to mode outputs
  • other?

Bonus stuff

In addition to just using git for version control, when you use Github as a host for the source code repository, you get a variety of "bonus" items that can be extremely handy:

  • online code browser
  • wiki
  • issue tracker
  • comment system with support for images
  • notifications
  • various graphs/metrics
  • custom website