Introduction to Git and GitHub Classroom
PDF version
Revision Control for Reproducible Research
A very important part of reproducible research is using revision control.
For an excellent introduction to using git with R, I recommend Professor Jenny Bryan’s web book, “Happy Git and GitHub for the useR” at http://happygitwithr.com/.
Tools for Revision Control
This document describes two related tools:
- Git is a program that helps you manage revisions in files that you edit (computer code, text documents, etc.), and coordinate sharing documents and working on them with other people.
- GitHub is a web site that allows people to share projects using Git. It is free for educational users and for open-source projects that anyone can see and copy. GitHub also allows paying customers to have private projects. Educational users can set up private projects for free.
Installing Git on Your own Computer
There are three good options for git that work
- If your computer is a Mac or runs Linux, Git may already be installed.
You can check by opening a terminal window and typing
which git
. If the computer responds with something like/usr/bin/git
, then Git is already installed. Otherwise, follow the instructions below. - For Windows and MacOS, you can download Git from https://git-scm.com/ and install it on your computer.
- For Linux computers, you can install Git from a terminal window as
follows:
- For Debian or Ubuntu,
sudo apt-get install git
- For Fedora or other RedHat-type distributions,
sudo yum install git
orsudo dnf install git
.
- For Debian or Ubuntu,
After you install Git, it is important to run two commands:
Open a git command window:
- On Windows, open the Start menu, go to “Git” and click on “Git Bash”
- On a Mac, open a terminal window
Run the two following commands:
git config --global user.name "Your Name"
(using your own name instead of “Your Name”)git config --global user.email "your.email.address@vanderbilt.edu"
(using your own email address)
Git uses information to keep track of who makes changes to a file. If you are editing a file on your computer and a friend is editing it on her computer, git uses this user information to keep track of who made each change. Then when you and your friend merge your changes, git will be able to tell you which of you edited what.
What Git Does
Git is a very powerful tool that can do many things, and some of the very advanced can become confusing, even to experts. Fortunately, we can ignore most of what git can do, and focus on a few simple things:
- Cloning a project from a remote server (e.g., github.com) to make a local copy of the file repository.
- Using file differencing to see what you changed since the last time you committed changes to your local repository.
- Staging and committing to record changes that you made to your project in the local file repository on your computer.
- Synchronizing repositories on multip[e computers: Pushing changes from your local computer and a remote server and pulling changes from a remote computer to your local computer.
Pretty much everything you might want to do with git, you can do from inside RStudio.
Git Vocabulary:
Repository: A repository is where git stores the history of an entire project. Git tracks every change that you make to every file in a project.
Git can synchronize repositories in different computers. We will use GitHub a lot for the homework assignments in this course. When you start a homework project, you will use RStudio to create a new project on your computer based on an assignment repository on GitHub.
If you are collaborating with other people, you can both edit files and then synchronize your repository with your partner’s repository and this will let each of you see all of the changes that each of you made to the files.
Clone: When you start a new homeowkr project in R, you will clone a template that I post on GitHub classroom. Cloning a project not only gives you a copy of the current project, but it also gives you a copy of the entire history of the project, so if you are working with a partner and one of you clones a project from the other, you will be able to use Git to coordinate your work and each of you will be able to see everything the other one has done on the project. It is like “track changes” on steroids.
An easy way to clone remote repositories from RStudio is to go to the “File” menu and choose “New Project”. Then choose the option, “Version Control”. Then select “Git” and enter the URL for the remote repository. I walked you through this in class on Jan.~14.
Behind the scenes, RStudio tells git to clone the remote repository, which makes a copy of that repository on your local computer. After you have cloned a repository, you will have not only the current version of all the files in the project, but you will have the entire history of each of those files.
Commit: A commit is a snapshot of all the files in a repository at some point in time. If you edit some files, create some new files, and delete some files, then you can commit all of those changes (edits, new files, and deleted files).
The git repository will add the new files to the repository, note that the deleted files have been deleted, and note the changes in the edited files between the latest version in the repository and the edited version that you are committing. Then it will then note the current state of all the files in the repository, so the commit represents a snapshot of the current state of all the files in the repository when you make the commit.
Staging: Because a commit records a snapshot of the state of your entire project, it often involves multiple files. Before you commit, you need to stage the files you want included in the commit.
Because a git commit records changes, you don’t need to worry about staging files that have not changed, but for files that have changed, you need to stage each changed file that you want included in the commit.
The easiest way to do this is in the Git pane in RStudio, which shows a list of all the changed files. You can then check the box next to each file you want to commit. You can also do this in the window that you open with the “Diff” or “Commit” buttons in the Git pane.
To commit, you choose one or more files to stage, write a comment describing the changes your commit is recording, and press the Commit button.
Using Git with RStudio
File Differencing When you are working in an RStudio project that has a git repository, if you edit a file, RStudio notices that it has changed and the file appears in the “Git” window in RStudio. If you highlight that file in the “Git” window and click on the “Diff” button, RStudio will open a window that shows you what changed in the file since, compared to the latest version in the repository.
RStudio shows the changes by identifying which lines in the file have changed, and showing the old version of those lines in red, and the new version in green. If you delete a line, you will just see a red line, and if you add a new line, you will just see a green line.
If you decide that you are not happy with the changes and want to restore the file to the version that existed in the repository, you can right click on the file and select “Revert.” Be careful with this, because if you revert a file, you will lose all the changes you made!
Staging and Committing Changes is where you tell git to save the changes that you made to your local files into your local repository. Git will only record changes in files when you commit changes. Committing is a two-step process:
First, you stage the files you want to commit. When you stage files, you tell
git
which of the changed files you want to include in the commit.Second, after you have staged one or more changed files, you tell
git
to commit the changes on those files to the repository. This makes a permanent record of all the changes in the staged files between the previous commit and this commit.In RStudio, any file you change will show up in the “Git” pane. Changes can be editing a file, creating a new file, deleting a file, or renaming a file. You stage a file by checking the box next to the file name in the “Git” window in RStudio. Then you tell git to permanently remember those changes by Committing: Click on the “Commit” button in the “Git” window.
Committing stores changes. If you delete a file, and then stage and commit it, the repository will note that the file is now deleted. However, all of the previous versions of the file, before you committed the delete, will be stored in the repository. This is very useful. If you have ever been working on a project and accidentally deleted an important file, that can be very painful. With git, if you committed that file to the repository, then even if you delete the file and commit the delete, you will be able to recover the file by checking a previous version out of the repository.
Note that the repository only remembers the changes that you tell it to commit. Specifically, it records the differences between the last version of the file that you committed and the new version that you are committing. It does not know anything about anything that happened between the two commits. Thus, if you edit a file but do not commit it, and then delete the file and commit the file as a deleted file, git will only record the fact that you deleted the file. It will not remember any edits that you made before you deleted it.
It is a good idea to commit changes pretty frequently. Any time you hve something that is working, it’s a good idea to commit. For instance, if you are working on a project that has many parts to it, as soon as you have answered one part you should stage and commit the changes. That way, if something goes wrong, you can recover your work from the repository.
When you commit changes to a repository, git asks you to enter a comment to describe the commit. You can give a brief description of what you changed in the commit, or remark on the state of the files (e.g., “Answered exercises 1-5.” or “Finally, the scripts are working properly!”). Think about what would be useful to you in helping you understand the commit if you are looking back over your repository history at some time in the future.
Synchronizing Repositories Git can synchronize multiple repositories. You can push the changes you have made on your local repository to a remote repository on a server, or you can pull changes in the remote repository to your local computer and merge them into your local repository.
When you work on your computer, all the work you do on your project is stored on the local git repository on your computer. You turn in your homework by pushing your local repository to the remote repository at GitHub. Only after you have pushed your work will I be able to see what you have done.
This is also relevant to asking for help. If you push your work to GitHub, then I can clone it to my computer and take a look at your code and offer comments or suggestions.
This may seem complicated, but you can simplify it if you follow a basic practice:
Every time you start working on a project that has a remote repository, pull from the remote repository before you start working.
Every time you have committed work that you don’t want to lose, push to the remote repository.
This also means that if you push projects from your personal computer to a remote repository (e.g., on GitHub), then even if your personal computer breaks or gets stolen, the remote GitHub repository will have the whole history of the project, up through the last time you pushed it.
Conflicts
If you edit the same file on two different computers, git will attempt to merge the two sets of edits automatically. Git does a good job with this if you edit different lines on the two computers. However, if you edit the same lines on the two computers, git doesn’t know which version of the changed lines you want to keep.
Original | Computer 1 | Computer 2 |
---|---|---|
Mary had a little lamb | Mary had a great big lamb | Mary had a little lamb |
Its fleece was white as snow | Its fleece was white as clouds | Its fleece was white as milk |
And everywhere that Mary went | And everywhere that Mary went | And everywhere that Mary walked |
The lamb was sure to go | The lamb was sure to go | The lamb was sure to go |
If you try to merge these, git can deal with the edits to the first and third lines, but the two computers made incompatible edits to the second line and git does not know whether to go with “clouds” or “milk”.
When you pull the changes from one computer onto the other, git will complain about a conflict, and the file will look like
Mary had a great big lamb
<<<<<<< HEAD
Its fleece was white as clouds
=======
Its fleece was white as milk
>>>>>>> change
And everywhere that Mary walked
The lamb was sure to go
Then you have to manually edit the file to resolve the conflict.
If you have conflicts, you will need to edit the files to resolve the
conflicts and delete the lines git uses to mark conflicts
(the ones beginning with <<<<<<<
, =======
, and >>>>>>>
).
Then you will need to stage the files where you resolved the changes
and make a commit.
GitHub and GitHub Classroom
GitHub is a web site devoted to sharing open-source Git repositories and allowing paying customers to operate private git repositories. You can get a free account at https://github.com and as a student, you can get some free extra features if you request a student account at https://education.github.com/students.
GitHub classroom is an add-on service that GitHub offers for teachers, which allows teachers to post assignments on GitHub and then invite students to clone the assignment and then turn in the completed assignment via a private repository.
For each homework assignment, I will create a repository on GitHub Classroom and invite you to accept the assignment by posting an URL on the assignment page on the class website. When you accept the assignment, GitHub will clone the assignment into private repository just for you on GitHub. Only you and I will be able to see your private repository.
You can then clone the private repository to your personal computer and complete it. As you make commits, I encourage you to push the changes back up to GitHub.