Skip to content

Instantly share code, notes, and snippets.

@MilesCranmer
Last active February 22, 2024 23:17
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MilesCranmer/5c7d86c8740219355d2dfdb184910711 to your computer and use it in GitHub Desktop.
Save MilesCranmer/5c7d86c8740219355d2dfdb184910711 to your computer and use it in GitHub Desktop.
Accurate word count changes in git, useful for tracking changes on a paper in overleaf

Words changed in a git repo

This should work generally, but I use this to track the number of words changed in a (LaTeX) paper with a version history in git (and which Overleaf uses by default).

This is a tricky thing to deal with for many reasons.

Show the added words, deleted words, words on duplicate lines on every commit in the last day (bash):

for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
echo $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs),\
     $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs),\
     $(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
done

Since sometimes we move massive amounts of text, showing the words inside duplicate lines can show flag words that are just from moving things around. If the number of words picked up by the words on duplicate lines rivals that of the added and removed, it's probably just a move commit.

Assuming that in a "move commit," 80%+ of the lines are duplicates, the following code should show you the total number of edited words in a day. Edit the --since command at the top to get it for different ranges (e.g., --since="10 days ago").

total=0
for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
    added=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs)
    deleted=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs)
    duplicated=$(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
    if [ "$added" -eq "0" ]; then
        changed=$deleted
        total=$((total+deleted))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changed:" $changed
    elif [ "$(echo "$duplicated/$added > 0.8" | bc -l)" -eq "1" ]; then
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" 0
    else
        changed=$((added+deleted))
        total=$((total+changed))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" $changed
    fi
done
echo "Total changed:", $total

If you are using overleaf, it should auto-commit frequently enough that this works.

Outside of overleaf, you should commit before and after you move large amounts of text so that you can track proper word count changes in a file.

@Merovex
Copy link

Merovex commented Aug 19, 2020

Very creative. I'm working on a book/paper compiler using GitHub actions and was looking for a word count alternative based on commits. I think this may be the best example of how it's done. It doesn't fit my use case since I'm only counting identified directories within a repo. But, it did a fabulous job for the use case it targets.

Best of luck on the PhD.

@kortina
Copy link

kortina commented Mar 30, 2021

I kind of want something like this for my journal and blog repos, where I could see stats like new documents per week, words written per day or week, graphed over time.

Either of y'all have ideas on tools for that?

@WPennock
Copy link

WPennock commented Feb 21, 2024

@MilesCranmer thank you for putting this together! Pardon my ignorance, but I got a high word count, which I think may mean the search includes things like modifying .bib files. Is there a way to ensure it only reports changes within the main .tex file and not the whole repository?

@MilesCranmer
Copy link
Author

@WPennock right, this would include the .bib files, good point.

I think you might be able to do this?

git diff --word-diff=porcelain $sha~1..$sha -- '*.tex'

rather than the existing git diff's used in my snippet?

@WPennock
Copy link

@WPennock right, this would include the .bib files, good point.

I think you might be able to do this?

git diff --word-diff=porcelain $sha~1..$sha -- '*.tex'

rather than the existing git diff's used in my snippet?

Thank you very much! I tried this and I got numbers in the hundreds instead of thousands, so I believe this makes the correct distinction. I incorporated this into a shell script for repeated use, and I really appreciate you contributing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment