Skip to content

Instantly share code, notes, and snippets.

@twolfson
Last active November 24, 2022 10:28
Show Gist options
  • Save twolfson/f5d8adead6def0b55663 to your computer and use it in GitHub Desktop.
Save twolfson/f5d8adead6def0b55663 to your computer and use it in GitHub Desktop.
How to generate secure and reusable scrubbed database dumps

For a while I have wanted to set up reusable scrubbed database dumps like I had at a former employer. While I never learned the specifics, I feel like the gist is still the same. We take the following steps:

  1. Create scrubbed/pruned database within internal network
  2. Upload database to private yet externally accessible location (in our case S3)
  3. Download and install uploaded scrubbed/pruned database for development
    • Guarantees accurate yet non identifiable information

Side note: We also store multiple versions and use a hardcoded file (i.e. latest-version) as a pointer for which to use.

The corresponding scripts are run like this:

# Within the internal network
# Create a locally scrubbed copy of our database
target_db="scrubbed-db"
./create_scrubbed_db.sh "$target_db"

# Upload the scrubbed database to S3
python upload_db_dump.py "$target_db"

# As a developer
# Download and use latest database from S3
# DEV: This script is custom to our setup (alembic and schema.sql) but the ideas/setup should be reusable
./copy_latest_dump.sh "development-db"
#!/usr/bin/env bash
# Exit on first error
set -e
# If there is no target db, complain and leave
target_db="$1"
if test "$target_db" = ""; then
echo "\`target_db\` was not specified. Please specify the local name of the database you would like to copy the scrubbed data into." 1>&2
echo "Usage: $0 [target_db]" 1>&2
exit 1
fi
# Define local constant
dump_filename="scrubbed-db.sql"
# Output commands as they occur
set -x
# Drop the existing database
dropdb "$target_db" || true
# Create a new one based on `schema.sql`
# DEV: We use `schema.sql` to guarantee our test db and development are in alignment on schema
# while our production and development db are in alignment on data (so they need to have the same schema)
createdb "$target_db"
psql "$target_db" < schema.sql
# If we don't have a local copy of the db, download one
if ! test -f "$dump_filename"; then
# DEV: We must use `gunzip` since S3 will always send back gzip encoded content
# DEV: Output a marker SQL comment for knowing where our dump came from
dump_url="$(python get_dump_url.py)"
echo "-- Dumped from: $dump_url" > "$dump_filename"
curl "$dump_url" | gunzip >> "$dump_filename"
else
set +x
echo "Using existing snapshot on disk. If you would like to download a new one, please run:" 1>&2
echo "rm \"$dump_filename\"" 1>&2
set -x
fi
# Hackishly determine the current revision and upgrade to it
# DEV: We must partially upgrade so the schemas line up for the data sets we expect
# We might be in development and be a few migrations/columns ahead of production
# Inside $dump_filename
# COPY alembic_version (version_num) FROM stdin;
# abcdef123456
alembic_version="$(grep 'COPY alembic_version (version_num) FROM stdin;' "$dump_filename" -A 1 | tail -n 1)"
alembic -x db=development upgrade "$alembic_version"
# DEV: We need to drop the version to prevent double inserts
psql "$target_db" -c "DELETE FROM alembic_version;"
# Install our db
cat $dump_filename | psql "$target_db"
# Run alembic to migrate to latest version
alembic -x db=development upgrade head
#!/usr/bin/env bash
# Exit on first error
set -e
# If there is no target db, complain and leave
target_db="$1"
if test "$target_db" = ""; then
echo "\`target_db\` was not specified. Please specify the local name of the database you would like to copy data into." 1>&2
echo "Usage: $0 [target_db]" 1>&2
exit 1
fi
# Output commands as they occur
set -x
# Resolve the URI for our source db (e.g. read from config files)
source_db="postgres://user:password@locahost:5432/dbname"
echo "$source_db"
# Drop the existing database
dropdb "$target_db" || true
# Create a new one with no schema
createdb "$target_db"
# For development, dump to disk and load from disk
# DEV: If there is any data that should be excluded, use `--exclude-table-data`
dump_command="pg_dump"
if test "$DEVELOPMENT" = "TRUE"; then
# If we don't have a local copy of the db, download one
if ! test -f dump.sql; then
$dump_command "$source_db" > dump.sql
fi
# Instal our db
cat dump.sql | psql "$target_db"
# Otherwise, pipe into the target db
else
$dump_command "$source_db" | psql "$target_db"
fi
# DEV: If you have any scrubbing to perform, please use your custom scrubbing script
# e.g. python scrub.py "$target_db"
# Load in dependencies
import argparse
from boto.s3.connection import S3Connection
from boto.s3.key import Key
S3_BUCKET = '{{S3_BUCKET}}'
S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
LATEST_FILENAME = 'latest-dump'
URL_TTL = 60 * 1 # 1 minute
# Define our script
def main(name=None):
"""
Get a URL for downloading a database dump from S3
:param str name: Name of snapshot to download
"""
# Rename our CLI arguments to better variables
source_filename = name
# Generate our S3 connections
s3_source_conn = S3Connection(
S3_ACCESS_KEY_ID,
S3_SECRET_ACCESS_KEY,
)
s3_source_bucket = s3_source_conn.get_bucket(S3_BUCKET)
# If we don't have a file name yet, resolve it from S3
# http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.get_contents_as_string
if source_filename is None:
s3_latest_key = Key(bucket=s3_source_bucket, name=LATEST_FILENAME)
source_filename = s3_latest_key.get_contents_as_string()
# Generate our S3 target
s3_source_key = Key(bucket=s3_source_bucket, name=source_filename)
print(s3_source_key.generate_url(expires_in=URL_TTL))
if __name__ == '__main__':
# Set up our arguments
parser = argparse.ArgumentParser()
parser.add_argument('name', nargs='?', help='Optional name of the dump to use')
# Parse them and run it through `main`
args = parser.parse_args()
main(**args.__dict__)
# Load in dependencies
import argparse
import datetime
import subprocess
import sys
from boto.s3.connection import S3Connection
from boto.s3.key import Key
S3_BUCKET = '{{S3_BUCKET}}'
S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
LATEST_FILENAME = 'latest-dump'
def upload_progres_cb(bytes_transmitted, bytes_total):
"""Helper to output progress of S3 upload"""
print('Bytes uploaded: {bytes_transmitted}/{bytes_total}'
.format(bytes_transmitted=bytes_transmitted, bytes_total=bytes_total))
# Define our script
def main(db_conninfo):
"""
Upload database dump to S3
# DEV: We wanted to write this script via `bash` but `boto's s3put` doesn't support stdin
:param str db_conninfo: Connection string to use to connecting to PostgreSQL
e.g. `postgresql://localhost:5432/mydb`
"""
# Create an S3 key to upload our dump into
# Use same format as our releases (except with `000` right padding for same precision)
target_filename = datetime.datetime.utcnow().strftime('%Y%m%d.%H%M%S.%f000.sql.gz')
s3_target_conn = S3Connection(
S3_ACCESS_KEY_ID,
S3_SECRET_ACCESS_KEY,
)
s3_target_bucket = s3_target_conn.get_bucket(S3_BUCKET)
s3_target_key = Key(bucket=s3_target_bucket, name=target_filename)
s3_latest_key = Key(bucket=s3_target_bucket, name=LATEST_FILENAME)
# Create a child process to dump our database info
# DEV: We only dump data so we can still use `schema.sql` when loading our dumps
# DEV: We pipe `stdin/stderr` to this program's `stdin/stderr` to hear any issues from `pg_dump`
# https://docs.python.org/2/library/subprocess.html
# DEV: We pipe to `gzip` to minimize amount of data being loaded into Python's memory
print('Buffering `pg_dump --data-only` into gzip')
pg_dump_child = subprocess.Popen(['pg_dump', '--data-only', db_conninfo],
stdin=None, stdout=subprocess.PIPE, stderr=sys.stderr)
gzip_child = subprocess.Popen(['gzip'],
stdin=pg_dump_child.stdout, stdout=subprocess.PIPE, stderr=sys.stderr)
gzip_content = gzip_child.stdout.read()
# Wait for `pg_dump` to terminate and verify it was a good exit
exit_code = pg_dump_child.wait()
if exit_code != 0:
raise RuntimeError('Expected exit code "0" from `pg_dump` but it was "{code}"'.format(code=exit_code))
# Wait for `gzip` to terminate and verify it was a good exit
exit_code = gzip_child.wait()
if exit_code != 0:
raise RuntimeError('Expected exit code "0" from `gzip` but it was "{code}"'.format(code=exit_code))
# Notify user of content we generated
print('Content buffered and gzipped ({bytes} bytes)'.format(bytes=len(gzip_content)))
# Upload the dump to S3
# http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.set_contents_from_file
print('Uploading data to S3 "{bucket}/{name}"'.format(bucket=S3_BUCKET, name=target_filename))
s3_target_key.set_contents_from_string(gzip_content, headers={
'Content-Type': 'text/plain',
'Content-Encoding': 'gzip',
}, cb=upload_progres_cb)
print('Upload complete!')
# Update the latest dump file on S3 to point to our new file
print('Marking "{bucket}/{name}" as latest dump via "{bucket}/{latest_name}"'
.format(bucket=S3_BUCKET, name=target_filename, latest_name=LATEST_FILENAME))
s3_latest_key.set_contents_from_string(target_filename)
print('Marking complete!')
if __name__ == '__main__':
# Set up our arguments
parser = argparse.ArgumentParser()
parser.add_argument(
'db_conninfo', help='Connection URI for source PostgreSQL (e.g. `postgresql://localhost:5432/mydb`)')
# Parse them and run it through `main`
args = parser.parse_args()
main(**args.__dict__)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment