Beneficiery ID numbers were shared in a private GitHub repositry, in the following directories:
data_requests/request_projects/medicaid_duplicate_check_2019_09_27
The files were created or changed in the following 3 commits:
update medicaid duplicate check
izahn committed on Sep 28, 2019
89e3955
Commits on Sep 27, 2019
include markdown version
izahn committed on Sep 27, 2019
f4a209d
medicaid check
izahn committed on Sep 27, 2019
43f92e9
A number of branches contain these files also, which is a problem:
git branch --contains 89e3955a
dec2019_medicaid_platform_cvd
dec2019_pdez_cvd_merge
feb2020_burrows_fl_county_hosps
feb2020_nih_pm_maps
jan2020_ashkan_aggregate_health
jan2020_medicaid_dual_admissions
jan2020_windows_file_conversion
* master
nov2019_check_cms_crosswalk
nov2019_epa_percentiles
nov2019_seasonal_temperature
We need to check if there are changes in the medicaid_check.md
among these branches, or if they are all the same:
for var in dec2019_medicaid_platform_cvd dec2019_pdez_cvd_merge feb2020_burrows_fl_county_hosps
feb2020_nih_pm_maps jan2020_ashkan_aggregate_health jan2020_medicaid_dual_admissions
jan2020_windows_file_conversion nov2019_check_cms_crosswalk nov2019_epa_percentiles
nov2019_seasonal_temperature
do
git diff master..$var -- medicaid_check.md
echo "Check for $var"
done
All branches contain the same directory with the same files, which is good.
Remove the subdirectory data_requests/request_projects/medicaid_duplicate_check_2019_09_27
containing
sensitive data in all branches except the master
branch. In the master
branch, the subfolder and the files will be
also removed, but reintroduced with censured data.
The official git documentation recommends the use of
git filter-repo
to remove a directory from the repo and its history.
git filter-repo --path request_projects/medicaid_duplicate_check_2019_09_27/ --invert-paths
Parsed 402 commits
New history written in 1.54 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 463fa36 aggregate kate weinberger data
Enumerating objects: 2844, done.
Counting objects: 100% (2844/2844), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1327/1327), done.
Writing objects: 100% (2844/2844), done.
Total 2844 (delta 1417), reused 2835 (delta 1408)
Completely finished after 2.90 seconds.
Now, there is no data_requests/request_projects/medicaid_duplicate_check_2019_09_27
or its content in the repository
or its history. We push modified branches to the remote:
for var in dec2019_pdez_cvd_merge dec2019_medicaid_platform_cvd feb2020_burrows_fl_county_hosps
feb2020_nih_pm_maps jan2020_ashkan_aggregate_health jan2020_medicaid_dual_admissions
jan2020_windows_file_conversion nov2019_check_cms_crosswalk nov2019_epa_percentiles
nov2019_seasonal_temperature
do
git checkout $var
git push -f origin $var
echo "Pushed $var"
done
...
Switched to branch 'nov2019_epa_percentiles'
Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/NSAPH/data_requests
+ 001be69...914dc44 nov2019_epa_percentiles -> nov2019_epa_percentiles (forced update)
Pushed nov2019_epa_percentiles
Now, we reintroduce the files with censurship and add them on top of the master
branch:
git add medicaid_check.md
git add medicaid_check.Rmd
git commit --author="Ista Zahn <izahn@hsph.harvard.edu>" -m "add medicaid check"
git push -f origin master
Now we delete everything locally:
rm -rf data_requests/
Edit: The same happened in the following directory, and we repeated the procedure:
data_requests/request_projects/exp_covar_health_merge_2016_april2019/results/duplicate_qid/