Pre-requisites

To follow along, please do the following before we start:

Git & GitHub for Data Scientists

Topics

Questions to be Answered

Conceptual

  • What is Git?
  • What is GitHub?
  • What is version control?
  • What's a repository?
  • What's a commit?
  • What's a fork?
  • What's a branch?
  • What's a diff?
  • What's a conflict?
  • What's a pull request?

Practical

  • How do I push changes for others to see?
  • How do I merge the changes others have made?
  • How do I notify others that I have made changes?
  • How can I get an overview of the repository?
  • What's the difference between branches and forks and which one should I use?

What is GitHub?

GitHub is a Git repository hosting platform.

  • Provides a central place to store your source code.
  • Enables collaboration with others through pull requests.
  • Also, it renders notebooks!

Getting Started to Follow Along

  1. Start your Jupyter or JupyterLab:
    jupyter lab
    
  2. Open a terminal to run commands.

Example

Niwako and her team are collaborating on a Covid-19 dataset.

Demo - Commit & Push

  1. Create a new repo. – https://github.com/new
  2. Clone the repo locally.
    git clone <repo> github-jupyter-covid
    
  3. Create a new notebook file.
    • Make a plot using data from here.
  4. Stage, commit, and push the file back to GitHub.
    git add .
    git commit -am "<commit message>"
    git push
    
  5. Take a look at the Notebook on GitHub.

What is Git?

  • A distributed version control system.

What's a version control system?

  • Software that keeps track of the history of files in a repository.

What's a repo (repository)?

  • The place where all the files in your project are stored,
  • Along with every version of those files that were committed,
  • Including files in other branches.

What's a branch?

  • A separate line of commits off the main branch.
  • When done, we merge the branch:
    • Either, back to the main branch.
    • Or, into the child branch.

How does it work?

  • GitHub hosts your team's repository.
  • Your fork of the team's repo on GitHub is a full copy of the team's repo along with the full history of every file.
  • Your locally cloned repo of your fork is also a full copy of the ones on GitHub.

What are Pull Requests?

  • A request to review changes before merging changes into the main branch.

What's a diff?

  • Reveals just the changes that were made, which makes the reviewer's life easier.

Revisiting our Example

Demo - Fork & Pull Request

  1. Fork the github-jupyter/covid
    • This is Shad's forked repo.
  2. Clone the repo locally
    git clone <repo> shadanan-covid
    
  3. Update the covid notebook -- add a chart dividing confirmed cases by population.
    • Use state population data from kaggle or from the original source.
  4. Use GitHub to create a pull request against upstream.
  5. Observe the state of your repo with:
    git log --graph --all
    

Demo - ReviewNB

  • Notebooks are JSON - they don't diff well.
  1. Install ReviewNB.
  2. Use ReviewNB to view the diff and make a comment.

Synchronizing Changes

git fetch

git pull (on master)

git push (on branch)

What's a conflict?

  • A change that cannot be automatically reconciliated.

Dealing with Conflicts

  • Jupyter notebooks aren't easy to merge because they are JSON docs.
  • The best way to deal with conflicts is to avoid them.
    • Don't work on the same notebook at the same time as someone else.
    • Have everyone on your team make changes in a notebook with their initials.
    • Have a single person be in charge of merging the final changes into the source of truth notebook.

But sometimes, you end up in a situation where you have no other choice.

Let's see what options we have...

Revisiting our Example

Demo Cont'd - nbdime (NoteBook DIff & MErge)

  1. Add github-jupyter/covid to our remotes.
    git remote add upstream git@github.com:github-jupyter/covid.git
    
  2. Use git fetch upstream to get Niwako's changes.
  3. Use git log --graph --all to view the state of all the repos.
  4. Try merging upstream/master
    git merge upstream/master
    
    • Observe that the notebook is now broken.
  5. Abort the merge.
    git merge --abort
    
  6. Install nbdime.
    pip3 install --upgrade nbdime
    nbdime extensions --enable
    
  7. Enable nbdime for the current repo
    nbdime config-git --enable
    
  8. Run the merge again.
    git merge origin/master
    
  9. Use nbdime's merge tool:
    git mergetool --tool nbdime -- *.ipynb
    

Git Cheat Sheet

  • Git Config
    • Global: ~/.gitconfig
    • Repo: .git/config
  • Status
    git status
    
  • Log / Dag
    git log --graph --oneline --all
    
  • Committing:
    git add [file]
    git commit
    
  • GitHub's Cheet Sheet

Summary

Bonus - black

  • An opinionated Python code formatter.
  • Useful when collaborating because it eliminates code style discussions in reviews.
  • Installation instructions are here.

Configure git dag

Run:

git log --graph --all

Or, for something really special, put the following in to your ~/.gitconfig file:

[alias]
    dag = log --graph --abbrev-commit --decorate --date=relative --format=format:'%C(bold blue)%h%C(reset) -%C(auto)%d%C(reset) %C(bold white)%s%C(reset)%n          %C(dim white)%an%C(reset) <%ae> -%C(reset) %C(cyan)%aD%C(reset) %C(green)(%ar)%C(reset)' --all

Now you can run:

git dag