Skip to content

Instantly share code, notes, and snippets.

@MilesCranmer
Last active February 22, 2024 23:17
Show Gist options
  • Save MilesCranmer/5c7d86c8740219355d2dfdb184910711 to your computer and use it in GitHub Desktop.
Save MilesCranmer/5c7d86c8740219355d2dfdb184910711 to your computer and use it in GitHub Desktop.
Accurate word count changes in git, useful for tracking changes on a paper in overleaf

Words changed in a git repo

This should work generally, but I use this to track the number of words changed in a (LaTeX) paper with a version history in git (and which Overleaf uses by default).

This is a tricky thing to deal with for many reasons.

Show the added words, deleted words, words on duplicate lines on every commit in the last day (bash):

for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
echo $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs),\
     $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs),\
     $(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
done

Since sometimes we move massive amounts of text, showing the words inside duplicate lines can show flag words that are just from moving things around. If the number of words picked up by the words on duplicate lines rivals that of the added and removed, it's probably just a move commit.

Assuming that in a "move commit," 80%+ of the lines are duplicates, the following code should show you the total number of edited words in a day. Edit the --since command at the top to get it for different ranges (e.g., --since="10 days ago").

total=0
for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
    added=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs)
    deleted=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs)
    duplicated=$(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
    if [ "$added" -eq "0" ]; then
        changed=$deleted
        total=$((total+deleted))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changed:" $changed
    elif [ "$(echo "$duplicated/$added > 0.8" | bc -l)" -eq "1" ]; then
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" 0
    else
        changed=$((added+deleted))
        total=$((total+changed))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" $changed
    fi
done
echo "Total changed:", $total

If you are using overleaf, it should auto-commit frequently enough that this works.

Outside of overleaf, you should commit before and after you move large amounts of text so that you can track proper word count changes in a file.

@MilesCranmer
Copy link
Author

@WPennock right, this would include the .bib files, good point.

I think you might be able to do this?

git diff --word-diff=porcelain $sha~1..$sha -- '*.tex'

rather than the existing git diff's used in my snippet?

@WPennock
Copy link

@WPennock right, this would include the .bib files, good point.

I think you might be able to do this?

git diff --word-diff=porcelain $sha~1..$sha -- '*.tex'

rather than the existing git diff's used in my snippet?

Thank you very much! I tried this and I got numbers in the hundreds instead of thousands, so I believe this makes the correct distinction. I incorporated this into a shell script for repeated use, and I really appreciate you contributing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment