Skip to content

Instantly share code, notes, and snippets.

@pbugnion
Last active October 22, 2023 12:25
Show Gist options
  • Save pbugnion/ea2797393033b54674af to your computer and use it in GitHub Desktop.
Save pbugnion/ea2797393033b54674af to your computer and use it in GitHub Desktop.
Keeping IPython notebooks under Git version control

This gist lets you keep IPython notebooks in git repositories. It tells git to ignore prompt numbers and program outputs when checking that a file has changed.

To use the script, follow the instructions given in the script's docstring.

For further details, read this blogpost.

The procedure outlined here is inspired by this answer on Stack Overflow.

#!/usr/bin/env python
"""
Suppress output and prompt numbers in git version control.
This script will tell git to ignore prompt numbers and cell output
when looking at ipynb files if their metadata contains:
"git" : { "suppress_output" : true }
The notebooks themselves are not changed.
See also this blogpost: http://pascalbugnion.net/blog/ipython-notebooks-and-git.html.
Usage instructions
==================
1. Put this script in a directory that is on the system's path.
For future reference, I will assume you saved it in
`~/scripts/ipynb_drop_output`.
2. Make sure it is executable by typing the command
`chmod +x ~/scripts/ipynb_drop_output`.
3. Register a filter for ipython notebooks by
putting the following line in `~/.config/git/attributes`:
`*.ipynb filter=clean_ipynb`
4. Connect this script to the filter by running the following
git commands:
git config --global filter.clean_ipynb.clean ipynb_drop_output
git config --global filter.clean_ipynb.smudge cat
To tell git to ignore the output and prompts for a notebook,
open the notebook's metadata (Edit > Edit Notebook Metadata). A
panel should open containing the lines:
{
"name" : "",
"signature" : "some very long hash"
}
Add an extra line so that the metadata now looks like:
{
"name" : "",
"signature" : "don't change the hash, but add a comma at the end of the line",
"git" : { "suppress_outputs" : true }
}
You may need to "touch" the notebooks for git to actually register a change, if
your notebooks are already under version control.
Notes
=====
This script is inspired by http://stackoverflow.com/a/20844506/827862, but
lets the user specify whether the ouptut of a notebook should be suppressed
in the notebook's metadata, and works for IPython v3.0.
"""
import sys
import json
nb = sys.stdin.read()
json_in = json.loads(nb)
nb_metadata = json_in["metadata"]
suppress_output = False
if "git" in nb_metadata:
if "suppress_outputs" in nb_metadata["git"] and nb_metadata["git"]["suppress_outputs"]:
suppress_output = True
if not suppress_output:
sys.stdout.write(nb)
exit()
ipy_version = int(json_in["nbformat"])-1 # nbformat is 1 more than actual version.
def strip_output_from_cell(cell):
if "outputs" in cell:
cell["outputs"] = []
if "prompt_number" in cell:
del cell["prompt_number"]
if ipy_version == 2:
for sheet in json_in["worksheets"]:
for cell in sheet["cells"]:
strip_output_from_cell(cell)
else:
for cell in json_in["cells"]:
strip_output_from_cell(cell)
json.dump(json_in, sys.stdout, sort_keys=True, indent=1, separators=(",",": "))
@LonghornRach
Copy link

What would be a reason for you not to run the filter on a notebook? (Like what would an example use case be?)

@jodybrookover
Copy link

If you wanted to keep the data/output.

@EvdH0
Copy link

EvdH0 commented Sep 25, 2015

According to https://ipython.org/ipython-doc/dev/notebook/nbformat.html#code-cells in v 4.0
prompt_number renamed to execution_count
so 81-82 should be
if "execution_count" in cell:
del cell["execution_count"]

@kkumer
Copy link

kkumer commented Oct 9, 2015

Actually, if you delete execution_count cell, jupyter will complain. What works for me is

if "execution_count" in cell:                                                  
    cell["execution_count"] = None 

@rschutjens
Copy link

Is it correct that this does not restore the output if you checkout an earlier versions of the notebook? That would be what I want, only the latest commit would have full ipynb in git (and after I push it to github).

I am new to notebooks, git, and github so sorry if this might be an obvious answer.

@jibe-b
Copy link

jibe-b commented Mar 21, 2016

Here is an update for the ipynb format, version 4 :

import sys
import json

nb = sys.stdin.read()

json_in = json.loads(nb)

nb_metadata = json_in["metadata"]
suppress_output = False
if "git" in nb_metadata:
    if "suppress_outputs" in nb_metadata["git"] and nb_metadata["git"]["suppress_outputs"]:
        suppress_output = True
if not suppress_output:
    sys.stdout.write(nb)
    exit()

def strip_output_from_cell(cell):
    if "outputs" in cell:
        cell["outputs"] = []
    if "execution_count" in cell:
        del cell["execution_count"]

for cell in json_in["cells"]:
    strip_output_from_cell(cell)

json.dump(json_in, sys.stdout, sort_keys=True, indent=1, separators=(",",": "))

@jibe-b
Copy link

jibe-b commented Mar 21, 2016

only the latest commit would have full ipynb in git (and after I push it to github)

@rschutjens I am not sure to understand what you mean but it seems it is a bit different from what you expect:

  • the script creates a temporary version of the notebook without cell outputs, while the original file is left unchanged
  • this temporary version is added to the stage, and then add... commit... push{bash} makes a commit with the version without outputs.

So when you checkout, you never get the outputs.

Welcome to notebooks and git, hope you enjoy it!

@Era-Dorta
Copy link

Latest version complained about "execution_count" not being defined, changing @jibe-b version
del cell["execution_count"]
to
cell["execution_count"] = None
fixed the issue

@miketrumpis
Copy link

Thanks for the tool! I don't know if this is a json version quirk, but I needed to add a newline to the end of stdout to keep git happy.

json.dump(json_in, sys.stdout, sort_keys=True, indent=1, separators=(",",": "))
sys.stdout.write('\n')

@tonysherbondy
Copy link

Also, if you use unicode in your notebooks, you probably want to let json.dump keep those with this:

json.dump(json_in, sys.stdout, sort_keys=True, indent=1, separators=(",",": "), ensure_ascii=False)

@korakotlee
Copy link

on Linux got this error
error: cannot run ipynb_drop_output: No such file or directory

so i went into .gitconfig and add .py suffix

@larister
Copy link

Thanks @korakotlee, that saved me a bit of head scratching

@bw-matthew
Copy link

With jq you can implement this entirely inside the git configuration, which then makes it easy to turn it on for a specific repo.

Add the following to .git/config:

[filter "clean_ipynb"]
    clean = jq '{ cells: [.cells[] | . + { metadata: {} } + if .cell_type == \"code\" then { outputs: [], execution_count: null } else {} end ] } + delpaths([[\"cells\"]])'
    smudge = cat

Create the file .git/info/attributes with the content from above:

*.ipynb  filter=clean_ipynb

This does not conditionally prevent the formatting of the notebooks using the metadata. I'm sure it would be possible to check for this by wrapping the overall statement in an if.

@matthewfranglen
Copy link

An improvement to the jq approach is this:

[filter "clean_ipynb"]
    clean = jq --indent 1 --monochrome-output '. + if .metadata.git.suppress_outputs | not then { cells: [.cells[] | . + if .cell_type == \"code\" then { outputs: [], execution_count: null } else {} end ] } else {} end'
    smudge = cat

It respects the .metadata.git.suppress_outputs in the same way as the python scripts (will not alter if the value is present and truthy). It also matches the indentation I have observed.

@mazzma12
Copy link

mazzma12 commented Sep 6, 2017

3. Register a filter for ipython notebooks by
   putting the following line in `~/.config/git/attributes`:
   `*.ipynb  filter=clean_ipynb`

use this command to generate gitattributes file if you do not have any

touch ~/.gitattributes
git config --global core.attributesfile ~/.gitattributes

@rgbkrk
Copy link

rgbkrk commented Jan 11, 2018

Something you may want to try out now is the official jupyter tool, nbdime -- it does similar git integration and quite a bit more.

@rightx2
Copy link

rightx2 commented Jan 26, 2018

works well in Mac OS X but not in Ubuntu 16.04. . . . The only difference is the existance of ~/.config/git folder. This directory exists in Mac OS X, but not in Ubuntu 16.04. So I aritficially made this directory and create attribute file too, but it doesn't work

@chetgray
Copy link

chetgray commented Jun 8, 2018

In the initial overview part of the script's docstring, the example metadata should read

"git" : { "suppress_outputs" : true }

"suppress_output", without the final s, is currently the key documented there (elsewhere it's correctly documented "suppress_outputs").

@jblom
Copy link

jblom commented Feb 27, 2019

looking for a way to solve this for Jupyter and there is a way to link this cleaning cycle to a pre-save hook via the jupyter_notebook_config.py:

https://jupyter-notebook.readthedocs.io/en/stable/extending/savehooks.html

(in case you can't find the Jupyter config)
https://stackoverflow.com/questions/32625939/ipython-notebook-where-is-jupyter-notebook-config-py-in-mac

I'm using this now and it works well. I haven't read all the details in this thread, but I guess you can solve a lot by diving more into these Jupyter hooks & config options.

@ChristopherBrix
Copy link

Thank you very much for this great script!
Would you mind specifying a Licence?

@amit1rrr
Copy link

amit1rrr commented Apr 11, 2020

People interested in making Jupyter notebooks play nice with git / GitHub might like to checkout ReviewNB. It's a code review tool for Jupyter notebooks on GitHub.

jupyter_notebook_diff

Disclaimer: I built ReviewNB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment