Skip to content

Instantly share code, notes, and snippets.

@R-ohit-B-isht
Last active May 15, 2024 16:14
Show Gist options
  • Save R-ohit-B-isht/b1312271ec25a81870efbe250b549e65 to your computer and use it in GitHub Desktop.
Save R-ohit-B-isht/b1312271ec25a81870efbe250b549e65 to your computer and use it in GitHub Desktop.

Playbook: Cleaning Git Repositories with BFG

Overview

This playbook outlines the steps to use the BFG (Byte-Friendly Granularity) tool to rewrite Git repository history and remove sensitive data, such as passwords, credentials, or other private information that may have been accidentally committed.

What's Needed From User

  • URL of the Git repository to be cleaned
  • Name of the main branch (e.g., master, main, etc.)
  • (Optional) List of sensitive strings or patterns to be removed (e.g., API keys, passwords)

Procedure

  1. Ask for Sensitive Data

    • Ask the user if they have a list of sensitive strings or patterns that need to be removed from the repository.
    • If the user does not provide a list, proceed to the next step with a general cleanup.
  2. Clone the Repository

    • Clone the full repository to the local machine git clone <repo_url> without mirror flag.
  3. Install the BFG Tool

    • Install Java and jq via sudo apt -y install default-jre-headless jq
    • Get latest version and set it in the shell variable latestVersion=$(curl -s "https://search.maven.org/solrsearch/select?q=a:bfg" | jq -r '.response.docs[0].latestVersion'); export latestVersion
    • Fetch latest bfg.jar sudo wget https://repo1.maven.org/maven2/com/madgag/bfg/$latestVersion/bfg-$latestVersion.jar
  4. Identify Sensitive Data and create replace.txt

    • If the user has provided a list of sensitive strings or patterns in the initial prompt use git log --all --full-history --pretty=format:"%H" | xargs git show --pretty=format:"" --stdin | grep -E '<user_provided_pattern>'
    • If the user has not provided a list of sensitive strings or patterns in the initial prompt, use: git log --all --full-history --pretty=format:"%H" | xargs git show --pretty=format:"" --stdin | grep -Ei '(AWS|amazonaws|access|secret|password|passwd|api|token|credential|auth|oauth|bearer|encryption|client|private|cert|ssl|ssh|jwt|key|username|user|uname|email|mail|database|db|connection|conn|url|endpoint|config|cfg)'
    • Note down only the sensitive strings that need to be removed from the repository, not the variable names but the actual sensitive data.
    • Create Replace Text file with the sensitive strings echo -e '<sensitive_credential_string>' > replace.txt
  5. Remove Sensitive Patterns with BFG command

    • Replace sensitive strings in the repository: java -jar bfg.jar --replace-text replace.txt
  6. Verify the Change after running BFG command

    • For each pattern key from replace.txt verify that the sensitive data has been removed from the new branch with git log --all --full-history --pretty=format:"" --stdin | grep -E '<user_provided_pattern> '.
    • Ensure that the sensitive data has been removed, if not then re-run the BFG tool with the correct pattern.
  7. Update the Repository

    • Move any newly created files for the cleanup process outside the repository and make sure it's never committed.
    • After running the BFG tool, update the repository to reflect the changes git reflog expire --expire=now --all && git gc --prune=now --aggressive
    • Create a new branch based on the cleaned-up commit git checkout --orphan <new_cleaned_branch_name>
    • Create a new commit after cleaning up the repository with the message "Cleaned sensitive data"
  8. Share the Clean Repository

    • Create a ZIP archive of the cleaned repository and share it via the messaging interface.

Specifications

  • The sensitive data has been removed from the repository history.
  • The cleaned repository is available for further use or sharing.
  • The original repository remains unchanged, and the cleaned repository can be used as a replacement.

Advice and Pointers

  • Use only BFG tool for cleaning sensitive data from the repository.
  • Ignore the variables that still contain sensitive data even after running bfg as they may be used in the codebase at the latest commit.
  • Clone the repository only once in the beginning.
  • Do general cleanup if not provided with a list of sensitive strings or patterns in the initial prompt.
  • Always create a fresh clone of the repository when using the BFG tool to avoid potential conflicts or issues.
  • Never push anything on GitHub.
  • Always keep a backup of the original repository before running the BFG tool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment