Skip to content

Instantly share code, notes, and snippets.

@atrisovic
Last active March 3, 2022 15:35
Show Gist options
  • Save atrisovic/6b0df01e879bb2623bef2857d1582ac8 to your computer and use it in GitHub Desktop.
Save atrisovic/6b0df01e879bb2623bef2857d1582ac8 to your computer and use it in GitHub Desktop.
Rewriting git history for data_requests

Rewriting git history for data_requests

What happened

Beneficiery ID numbers were shared in a private GitHub repositry, in the following directories:

data_requests/request_projects/medicaid_duplicate_check_2019_09_27

The files were created or changed in the following 3 commits:

update medicaid duplicate check
izahn committed on Sep 28, 2019
 89e3955  
Commits on Sep 27, 2019
include markdown version
izahn committed on Sep 27, 2019
 f4a209d  
medicaid check
izahn committed on Sep 27, 2019
 43f92e9

A number of branches contain these files also, which is a problem:

git branch --contains 89e3955a
  dec2019_medicaid_platform_cvd
  dec2019_pdez_cvd_merge
  feb2020_burrows_fl_county_hosps
  feb2020_nih_pm_maps
  jan2020_ashkan_aggregate_health
  jan2020_medicaid_dual_admissions
  jan2020_windows_file_conversion
* master
  nov2019_check_cms_crosswalk
  nov2019_epa_percentiles
  nov2019_seasonal_temperature

We need to check if there are changes in the medicaid_check.md among these branches, or if they are all the same:

for var in dec2019_medicaid_platform_cvd dec2019_pdez_cvd_merge feb2020_burrows_fl_county_hosps 
feb2020_nih_pm_maps jan2020_ashkan_aggregate_health jan2020_medicaid_dual_admissions 
jan2020_windows_file_conversion nov2019_check_cms_crosswalk nov2019_epa_percentiles 
nov2019_seasonal_temperature
do
    git diff master..$var -- medicaid_check.md
    echo "Check for $var"
done

All branches contain the same directory with the same files, which is good.

Strategy

Remove the subdirectory data_requests/request_projects/medicaid_duplicate_check_2019_09_27 containing sensitive data in all branches except the master branch. In the master branch, the subfolder and the files will be also removed, but reintroduced with censured data.

The official git documentation recommends the use of git filter-repo to remove a directory from the repo and its history.

git filter-repo --path request_projects/medicaid_duplicate_check_2019_09_27/ --invert-paths
Parsed 402 commits
New history written in 1.54 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 463fa36 aggregate kate weinberger data
Enumerating objects: 2844, done.
Counting objects: 100% (2844/2844), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1327/1327), done.
Writing objects: 100% (2844/2844), done.
Total 2844 (delta 1417), reused 2835 (delta 1408)
Completely finished after 2.90 seconds.

Now, there is no data_requests/request_projects/medicaid_duplicate_check_2019_09_27 or its content in the repository or its history. We push modified branches to the remote:

for var in dec2019_pdez_cvd_merge dec2019_medicaid_platform_cvd feb2020_burrows_fl_county_hosps 
feb2020_nih_pm_maps jan2020_ashkan_aggregate_health jan2020_medicaid_dual_admissions 
jan2020_windows_file_conversion nov2019_check_cms_crosswalk nov2019_epa_percentiles 
nov2019_seasonal_temperature
do
    git checkout $var 
    git push -f origin $var
    echo "Pushed $var"
done

...
Switched to branch 'nov2019_epa_percentiles'
Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/NSAPH/data_requests
 + 001be69...914dc44 nov2019_epa_percentiles -> nov2019_epa_percentiles (forced update)
Pushed nov2019_epa_percentiles

Now, we reintroduce the files with censurship and add them on top of the master branch:

git add medicaid_check.md 
git add medicaid_check.Rmd 
git commit --author="Ista Zahn <izahn@hsph.harvard.edu>" -m "add medicaid check"
git push -f origin master

Now we delete everything locally:

rm -rf data_requests/

Edit: The same happened in the following directory, and we repeated the procedure:

data_requests/request_projects/exp_covar_health_merge_2016_april2019/results/duplicate_qid/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment