Version Control with Git

Modern Plain Text Social Science: Week 4

Kieran Healy

September 18, 2023

Plain Text Credo Again

Knowing what you did

… O that I were as great

As is my grief, or lesser than my name!

Or that I could forget what I have been,

Or not remember what I must be now!

We need to keep track

  • In the “Engineering” model, changes are tracked outside of files, at the level of the project.
  • When you get down to it a “project” is a folder that contains a bunch of files and subfolders.
  • There’s a long history of tools that keep track of what happens to the files in projects and who did things to which files.
  • A tool like this is called a Version Control System or a Revision Control System.

Git

  • Git is the industry-standard VCS.
  • It’s very powerful.
  • It’s used to run the software development projects of gigantic corporations that must coordinate huge projects with contributions by hundreds or thousands of developers.
  • It makes professionals and amateurs alike cry on the regular.

Git

  • We can start by using a very small fraction of git’s functionality.

Git as a Logbook

Our sample project

What must we track?

This is our local folder

This is what we need to track

  • Even though they are generated, for figures and tables (and final outputs like papers) we will often want to know which version of the code and data they were produced with.
  • We’ll come back to this in more detail later on. But in essence we’ll write an R function to stamp a version identifier on the output.

Initialise a repository

Initialise a repository

  • The .git folder is “hidden” because its name begins with a dot.
  • Inside it contains its own file and folder hierarchy for keeping track of changes to our project.
  • Now we need to tell git what to track.

Staging a file

  • The first step is to “stage” a file with the command git add <filename>
  • This adds the file to an index of things git is going to manage.
  • Git keeps this index inside the .git folder.

Staging a file

  • No changes are recorded at this point.
  • Nothing changes in our local folder.
  • We’ve just added a file to a list of items to be tracked.
  • As the name suggests, conceptually the staging area is preparatory.
  • … Preparatory to what?

Committing a file

  • Committing a file makes a record of it in our local repository.

Committing a file

  • The 64-bit hexadecimal number is an SHA Hash.
  • It’s derived from the file(s) being committed and it uniquely identifies the changes associated with that commit.
  • The message added with -m is for our benefit. It should be terse and informative about what’s new, fixed, or changed in the commit.

Committing a file

  • Inside the .git folder, a compressed version of the file is created.

Staging several files …

… and committing them

Committing several files

Remember …

  • The Local Folder is our project as we (and the file system) can see it.
  • The Local Repository lives inside the special .git folder that sits at the root of our project. Conceptually we think of it as its own separate thing that git is managing for us.
  • We can move files and folders in and out of our local folder, edit them, and so on, just as we normally would.
  • But for git to track and manage any of these files or changes to them we need to add and commit them to the local repository.

Editing and committing

Editing and committing

Storing the changes

Remember (again) …

  • The Local Folder is our project as we (and the file system) can see it.
  • The Local Repository lives inside the special .git folder that sits at the root of our project. Conceptually we think of it as its own separate thing that git is managing for us.
  • We can move files and folders in and out of our local folder, edit them, and so on, just as we normally would.
  • But for git to track and manage any of these files or changes to them we need to add and commit them to the local repository.

Git as a Log Book

  • Working this way, we treat git as managing the files we want to track.
  • Physically these changes are managed in the .git folder inside the project, but think of it as a separate entity, the local repository.
  • Every time we decide to make a change to a tracked file in the project we do git add <filename> and git commit -m "Message"
  • We write git add -u to stage any changes to every already-tracked file.
  • Newly-created files that we want to stage and commit are not picked up by git add -u
  • They must be added by name, either one at a time or with shell-style paths and globs (wildcards).

GitHub

Keeping a remote Log

  • We have a local repository, but …
  • We can also keep copies of it wherever we like.
  • In the Cloud, for example …

There’s no such thing as the Cloud


It’s just someone else’s computer

Keeping a remote copy

  • We proceed as before but then push changes to GitHub or wherever our remote repository is located.

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

Workflow: GitHub Round Trip

git pull

Branches

Our project again

 git log
commit 6df6d6b10b127e87115dceac06c869150b67e800 (HEAD -> main, origin/main)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:49:33 2023 -0400

    Began writeup

commit ef0772106fe4d6f10ee62afd3c951ff14b16efb1
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:38:55 2023 -0400

    Data and Rproj files

commit f43f6f10ba5cd1a57430c032c8c14f056b154e9b
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:34:08 2023 -0400

    Added covidcases

Our project again

  • Individual commits can be referenced by their SHA hash.
  • Usually the first few bits are sufficient to ID the commit in a single project.
> git log --oneline

6df6d6b (HEAD -> main, origin/main) Began writeup
ef07721 Data and Rproj files
f43f6f1 Added covidcases
  • The current commit is called the HEAD

Our project again

  • The SHA or commit id always refers to the same code.
  • Zooming out from the various files changed in each case to the level of commits, our project looks like this:

We’re on a branch

  • By default we’re on the main branch

We can create branches

> git branch figtest 
> git checkout figtest

Switched to branch 'figtest'

We make some edits in covidcases.qmd

mtcars |> 
  ggplot(mapping = aes(x = hp, 
                       y = mpg)) + 
  geom_point() + 
  geom_smooth()

What’s our status?

# Back at the shell here
> git status
On branch figtest
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   covidcases.qmd

no changes added to commit (use "git add" and/or "git commit -a")

Commit changes on the new branch

> git add covidcases.qmd
> git commit -m "Checking ggplot code"

[figtest 912aed7] Checking ggplot code
 1 file changed, 12 insertions(+)

Schematically

Look at the whole log

> git log --abbrev-commit
commit 912aed7 (HEAD -> figtest)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:25:42 2023 -0400

    Checking ggplot code

commit 6df6d6b (origin/main, main)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:49:33 2023 -0400

    Began writeup

commit ef07721
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:38:55 2023 -0400

    Data and Rproj files

commit f43f6f1
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:34:08 2023 -0400

    Added covidcases

Status?

git status
On branch figtest
nothing to commit, working tree clean

Make some more changes …

> git log --abbrev-commit

commit 4a77b2f (HEAD -> figtest)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:33:57 2023 -0400

    Fixed the plot

commit 912aed7
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:25:42 2023 -0400

    Checking ggplot code

Schematically

Merge it back in to main

> git checkout main
Switched to branch 'main'
  • None of the code we wrote in the last two commits is on this branch.
> git merge figtest

Updating 6df6d6b..4a77b2f
Fast-forward
 covidcases.qmd | 13 +++++++++++++
 1 file changed, 13 insertions(+)
  • And now it is.

Schematically

Write some more

  • Make some more edits and commit those changes.
## Read in the data
df <- read_csv(here::here("data", "uscases.csv"))

Schematically

Where are we?

> git log --abbrev-comit

commit 06677f4 (HEAD -> main, origin/main)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:45:10 2023 -0400

    Load the US data

commit 4a77b2f (origin/figtest, figtest)
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:33:57 2023 -0400

    Fixed the plot

commit 912aed7
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 20:25:42 2023 -0400

    Checking ggplot code

commit 6df6d6b
Author: Kieran Healy <kjhealy@gmail.com>
Date:   Mon Sep 18 17:49:33 2023 -0400

    Began writeup

Summary

It’s a log book

  • For single person projects you can go a long way with a very basic git workflow.
  • Essentially do your work then add, commit, and push the results.
  • At a minimum you will be left with a record of your project’s development where you can revist any commit and see what the state of the project was at that point.

It’s a testbed

  • A little discipline with branching will let you experiment with new lines of work or code without getting lost or losing the ability to revert to what was there before, all in the same project folder.
  • Successful lines of inquiry can be merged into the main branch, unsuccessful ones kept for the record or discarded altogether.

It’s a collaboration tool

  • The combination of multiple copies of the repository on the one hand and the act of branching and merging on the other generalizes to working with multiple contributors to a repository and managing all of their work.
  • Again, Git requires a bit of discipline about this—a team should have rules about branch naming, task prioritization, bug fix procedures, etc.
  • Services like GitHub provide most of their value by offering ways to manage these tasks in the context of a git-based way of working.

It’s often a pain in the butt

  • Underneath Git is kind of a mess.
  • Various design features make it easy to be confused or misled.
  • Start with the Sylor-Miller & Evans zine.