twolfson/README.md

## README.md

      
    Raw
  

              README.md
            
          
    For a while I have wanted to set up reusable scrubbed database dumps like I had at a former employer. While I never learned the specifics, I feel like the gist is still the same. We take the following steps:

Create scrubbed/pruned database within internal network
Upload database to private yet externally accessible location (in our case S3)
Download and install uploaded scrubbed/pruned database for development

Guarantees accurate yet non identifiable information


Side note: We also store multiple versions and use a hardcoded file (i.e. latest-version) as a pointer for which to use.
The corresponding scripts are run like this:
# Within the internal network
# Create a locally scrubbed copy of our database
target_db="scrubbed-db"
./create_scrubbed_db.sh "$target_db"

# Upload the scrubbed database to S3
python upload_db_dump.py "$target_db"

# As a developer
# Download and use latest database from S3
# DEV: This script is custom to our setup (alembic and schema.sql) but the ideas/setup should be reusable
./copy_latest_dump.sh "development-db"

  
## copy_latest_dump.sh
#!/usr/bin/env bash
# Exit on first error
set -e

# If there is no target db, complain and leave
target_db="$1"
if test "$target_db" = ""; then
  echo "\`target_db\` was not specified. Please specify the local name of the database you would like to copy the scrubbed data into." 1>&2
  echo "Usage: $0 [target_db]" 1>&2
  exit 1
fi

# Define local constant
dump_filename="scrubbed-db.sql"

# Output commands as they occur
set -x

# Drop the existing database
dropdb "$target_db" || true

# Create a new one based on `schema.sql`
# DEV: We use `schema.sql` to guarantee our test db and development are in alignment on schema
#   while our production and development db are in alignment on data (so they need to have the same schema)
createdb "$target_db"
psql "$target_db" < schema.sql

# If we don't have a local copy of the db, download one
if ! test -f "$dump_filename"; then
  # DEV: We must use `gunzip` since S3 will always send back gzip encoded content
  # DEV: Output a marker SQL comment for knowing where our dump came from
  dump_url="$(python get_dump_url.py)"
  echo "-- Dumped from: $dump_url" > "$dump_filename"
  curl "$dump_url" | gunzip >> "$dump_filename"
else
  set +x
  echo "Using existing snapshot on disk. If you would like to download a new one, please run:" 1>&2
  echo "rm \"$dump_filename\"" 1>&2
  set -x
fi

# Hackishly determine the current revision and upgrade to it
# DEV: We must partially upgrade so the schemas line up for the data sets we expect
#   We might be in development and be a few migrations/columns ahead of production
# Inside $dump_filename
#   COPY alembic_version (version_num) FROM stdin;
#   abcdef123456
alembic_version="$(grep 'COPY alembic_version (version_num) FROM stdin;' "$dump_filename" -A 1 | tail -n 1)"
alembic -x db=development upgrade "$alembic_version"
# DEV: We need to drop the version to prevent double inserts
psql "$target_db" -c "DELETE FROM alembic_version;"

# Install our db
cat $dump_filename | psql "$target_db"

# Run alembic to migrate to latest version
alembic -x db=development upgrade head

## create_scrubbed_db.sh
#!/usr/bin/env bash
# Exit on first error
set -e

# If there is no target db, complain and leave
target_db="$1"
if test "$target_db" = ""; then
  echo "\`target_db\` was not specified. Please specify the local name of the database you would like to copy data into." 1>&2
  echo "Usage: $0 [target_db]" 1>&2
  exit 1
fi

# Output commands as they occur
set -x

# Resolve the URI for our source db (e.g. read from config files)
source_db="postgres://user:password@locahost:5432/dbname"
echo "$source_db"

# Drop the existing database
dropdb "$target_db" || true

# Create a new one with no schema
createdb "$target_db"

# For development, dump to disk and load from disk
# DEV: If there is any data that should be excluded, use `--exclude-table-data`
dump_command="pg_dump"
if test "$DEVELOPMENT" = "TRUE"; then
  # If we don't have a local copy of the db, download one
  if ! test -f dump.sql; then
    $dump_command "$source_db" > dump.sql
  fi

  # Instal our db
  cat dump.sql | psql "$target_db"
# Otherwise, pipe into the target db
else
  $dump_command "$source_db" | psql "$target_db"
fi

# DEV: If you have any scrubbing to perform, please use your custom scrubbing script
# e.g. python scrub.py "$target_db"

## get_dump_url.py
# Load in dependencies
import argparse

from boto.s3.connection import S3Connection
from boto.s3.key import Key

S3_BUCKET = '{{S3_BUCKET}}'
S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
LATEST_FILENAME = 'latest-dump'
URL_TTL = 60 * 1  # 1 minute


# Define our script
def main(name=None):
    """
    Get a URL for downloading a database dump from S3

    :param str name: Name of snapshot to download
    """
    # Rename our CLI arguments to better variables
    source_filename = name

    # Generate our S3 connections
    s3_source_conn = S3Connection(
        S3_ACCESS_KEY_ID,
        S3_SECRET_ACCESS_KEY,
    )
    s3_source_bucket = s3_source_conn.get_bucket(S3_BUCKET)

    # If we don't have a file name yet, resolve it from S3
    # http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.get_contents_as_string
    if source_filename is None:
        s3_latest_key = Key(bucket=s3_source_bucket, name=LATEST_FILENAME)
        source_filename = s3_latest_key.get_contents_as_string()

    # Generate our S3 target
    s3_source_key = Key(bucket=s3_source_bucket, name=source_filename)
    print(s3_source_key.generate_url(expires_in=URL_TTL))


if __name__ == '__main__':
    # Set up our arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('name', nargs='?', help='Optional name of the dump to use')

    # Parse them and run it through `main`
    args = parser.parse_args()
    main(**args.__dict__)

## upload_db_dump.py
# Load in dependencies
import argparse
import datetime
import subprocess
import sys

from boto.s3.connection import S3Connection
from boto.s3.key import Key

S3_BUCKET = '{{S3_BUCKET}}'
S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
LATEST_FILENAME = 'latest-dump'


def upload_progres_cb(bytes_transmitted, bytes_total):
    """Helper to output progress of S3 upload"""
    print('Bytes uploaded: {bytes_transmitted}/{bytes_total}'
          .format(bytes_transmitted=bytes_transmitted, bytes_total=bytes_total))


# Define our script
def main(db_conninfo):
    """
    Upload database dump to S3

    # DEV: We wanted to write this script via `bash` but `boto's s3put` doesn't support stdin

    :param str db_conninfo: Connection string to use to connecting to PostgreSQL
        e.g. `postgresql://localhost:5432/mydb`
    """
    # Create an S3 key to upload our dump into
    # Use same format as our releases (except with `000` right padding for same precision)
    target_filename = datetime.datetime.utcnow().strftime('%Y%m%d.%H%M%S.%f000.sql.gz')
    s3_target_conn = S3Connection(
        S3_ACCESS_KEY_ID,
        S3_SECRET_ACCESS_KEY,
    )
    s3_target_bucket = s3_target_conn.get_bucket(S3_BUCKET)
    s3_target_key = Key(bucket=s3_target_bucket, name=target_filename)
    s3_latest_key = Key(bucket=s3_target_bucket, name=LATEST_FILENAME)

    # Create a child process to dump our database info
    # DEV: We only dump data so we can still use `schema.sql` when loading our dumps
    # DEV: We pipe `stdin/stderr` to this program's `stdin/stderr` to hear any issues from `pg_dump`
    #   https://docs.python.org/2/library/subprocess.html
    # DEV: We pipe to `gzip` to minimize amount of data being loaded into Python's memory
    print('Buffering `pg_dump --data-only` into gzip')
    pg_dump_child = subprocess.Popen(['pg_dump', '--data-only', db_conninfo],
                                     stdin=None, stdout=subprocess.PIPE, stderr=sys.stderr)
    gzip_child = subprocess.Popen(['gzip'],
                                  stdin=pg_dump_child.stdout, stdout=subprocess.PIPE, stderr=sys.stderr)
    gzip_content = gzip_child.stdout.read()

    # Wait for `pg_dump` to terminate and verify it was a good exit
    exit_code = pg_dump_child.wait()
    if exit_code != 0:
        raise RuntimeError('Expected exit code "0" from `pg_dump` but it was "{code}"'.format(code=exit_code))

    # Wait for `gzip` to terminate and verify it was a good exit
    exit_code = gzip_child.wait()
    if exit_code != 0:
        raise RuntimeError('Expected exit code "0" from `gzip` but it was "{code}"'.format(code=exit_code))

    # Notify user of content we generated
    print('Content buffered and gzipped ({bytes} bytes)'.format(bytes=len(gzip_content)))

    # Upload the dump to S3
    # http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.set_contents_from_file
    print('Uploading data to S3 "{bucket}/{name}"'.format(bucket=S3_BUCKET, name=target_filename))
    s3_target_key.set_contents_from_string(gzip_content, headers={
        'Content-Type': 'text/plain',
        'Content-Encoding': 'gzip',
    }, cb=upload_progres_cb)
    print('Upload complete!')

    # Update the latest dump file on S3 to point to our new file
    print('Marking "{bucket}/{name}" as latest dump via "{bucket}/{latest_name}"'
          .format(bucket=S3_BUCKET, name=target_filename, latest_name=LATEST_FILENAME))
    s3_latest_key.set_contents_from_string(target_filename)
    print('Marking complete!')

if __name__ == '__main__':
    # Set up our arguments
    parser = argparse.ArgumentParser()
    parser.add_argument(
        'db_conninfo', help='Connection URI for source PostgreSQL (e.g. `postgresql://localhost:5432/mydb`)')

    # Parse them and run it through `main`
    args = parser.parse_args()
    main(**args.__dict__)
	#!/usr/bin/env bash
	# Exit on first error
	set -e

	# If there is no target db, complain and leave
	target_db="$1"
	if test "$target_db" = ""; then
	echo "\`target_db\` was not specified. Please specify the local name of the database you would like to copy the scrubbed data into." 1>&2
	echo "Usage: $0 [target_db]" 1>&2
	exit 1
	fi

	# Define local constant
	dump_filename="scrubbed-db.sql"

	# Output commands as they occur
	set -x

	# Drop the existing database
	dropdb "$target_db" \|\| true

	# Create a new one based on `schema.sql`
	# DEV: We use `schema.sql` to guarantee our test db and development are in alignment on schema
	# while our production and development db are in alignment on data (so they need to have the same schema)
	createdb "$target_db"
	psql "$target_db" < schema.sql

	# If we don't have a local copy of the db, download one
	if ! test -f "$dump_filename"; then
	# DEV: We must use `gunzip` since S3 will always send back gzip encoded content
	# DEV: Output a marker SQL comment for knowing where our dump came from
	dump_url="$(python get_dump_url.py)"
	echo "-- Dumped from: $dump_url" > "$dump_filename"
	curl "$dump_url" \| gunzip >> "$dump_filename"
	else
	set +x
	echo "Using existing snapshot on disk. If you would like to download a new one, please run:" 1>&2
	echo "rm \"$dump_filename\"" 1>&2
	set -x
	fi

	# Hackishly determine the current revision and upgrade to it
	# DEV: We must partially upgrade so the schemas line up for the data sets we expect
	# We might be in development and be a few migrations/columns ahead of production
	# Inside $dump_filename
	# COPY alembic_version (version_num) FROM stdin;
	# abcdef123456
	alembic_version="$(grep 'COPY alembic_version (version_num) FROM stdin;' "$dump_filename" -A 1 \| tail -n 1)"
	alembic -x db=development upgrade "$alembic_version"
	# DEV: We need to drop the version to prevent double inserts
	psql "$target_db" -c "DELETE FROM alembic_version;"

	# Install our db
	cat $dump_filename \| psql "$target_db"

	# Run alembic to migrate to latest version
	alembic -x db=development upgrade head
	# Load in dependencies
	import argparse

	from boto.s3.connection import S3Connection
	from boto.s3.key import Key

	S3_BUCKET = '{{S3_BUCKET}}'
	S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
	S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
	LATEST_FILENAME = 'latest-dump'
	URL_TTL = 60 * 1 # 1 minute


	# Define our script
	def main(name=None):
	"""
	Get a URL for downloading a database dump from S3

	:param str name: Name of snapshot to download
	"""
	# Rename our CLI arguments to better variables
	source_filename = name

	# Generate our S3 connections
	s3_source_conn = S3Connection(
	S3_ACCESS_KEY_ID,
	S3_SECRET_ACCESS_KEY,
	)
	s3_source_bucket = s3_source_conn.get_bucket(S3_BUCKET)

	# If we don't have a file name yet, resolve it from S3
	# http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.get_contents_as_string
	if source_filename is None:
	s3_latest_key = Key(bucket=s3_source_bucket, name=LATEST_FILENAME)
	source_filename = s3_latest_key.get_contents_as_string()

	# Generate our S3 target
	s3_source_key = Key(bucket=s3_source_bucket, name=source_filename)
	print(s3_source_key.generate_url(expires_in=URL_TTL))


	if __name__ == '__main__':
	# Set up our arguments
	parser = argparse.ArgumentParser()
	parser.add_argument('name', nargs='?', help='Optional name of the dump to use')

	# Parse them and run it through `main`
	args = parser.parse_args()
	main(**args.__dict__)
	# Load in dependencies
	import argparse
	import datetime
	import subprocess
	import sys

	from boto.s3.connection import S3Connection
	from boto.s3.key import Key

	S3_BUCKET = '{{S3_BUCKET}}'
	S3_ACCESS_KEY_ID = '{{S3_ACCESS_KEY_ID}}'
	S3_SECRET_ACCESS_KEY = '{{S3_SECRET_ACCESS_KEY}}'
	LATEST_FILENAME = 'latest-dump'


	def upload_progres_cb(bytes_transmitted, bytes_total):
	"""Helper to output progress of S3 upload"""
	print('Bytes uploaded: {bytes_transmitted}/{bytes_total}'
	.format(bytes_transmitted=bytes_transmitted, bytes_total=bytes_total))


	# Define our script
	def main(db_conninfo):
	"""
	Upload database dump to S3

	# DEV: We wanted to write this script via `bash` but `boto's s3put` doesn't support stdin

	:param str db_conninfo: Connection string to use to connecting to PostgreSQL
	e.g. `postgresql://localhost:5432/mydb`
	"""
	# Create an S3 key to upload our dump into
	# Use same format as our releases (except with `000` right padding for same precision)
	target_filename = datetime.datetime.utcnow().strftime('%Y%m%d.%H%M%S.%f000.sql.gz')
	s3_target_conn = S3Connection(
	S3_ACCESS_KEY_ID,
	S3_SECRET_ACCESS_KEY,
	)
	s3_target_bucket = s3_target_conn.get_bucket(S3_BUCKET)
	s3_target_key = Key(bucket=s3_target_bucket, name=target_filename)
	s3_latest_key = Key(bucket=s3_target_bucket, name=LATEST_FILENAME)

	# Create a child process to dump our database info
	# DEV: We only dump data so we can still use `schema.sql` when loading our dumps
	# DEV: We pipe `stdin/stderr` to this program's `stdin/stderr` to hear any issues from `pg_dump`
	# https://docs.python.org/2/library/subprocess.html
	# DEV: We pipe to `gzip` to minimize amount of data being loaded into Python's memory
	print('Buffering `pg_dump --data-only` into gzip')
	pg_dump_child = subprocess.Popen(['pg_dump', '--data-only', db_conninfo],
	stdin=None, stdout=subprocess.PIPE, stderr=sys.stderr)
	gzip_child = subprocess.Popen(['gzip'],
	stdin=pg_dump_child.stdout, stdout=subprocess.PIPE, stderr=sys.stderr)
	gzip_content = gzip_child.stdout.read()

	# Wait for `pg_dump` to terminate and verify it was a good exit
	exit_code = pg_dump_child.wait()
	if exit_code != 0:
	raise RuntimeError('Expected exit code "0" from `pg_dump` but it was "{code}"'.format(code=exit_code))

	# Wait for `gzip` to terminate and verify it was a good exit
	exit_code = gzip_child.wait()
	if exit_code != 0:
	raise RuntimeError('Expected exit code "0" from `gzip` but it was "{code}"'.format(code=exit_code))

	# Notify user of content we generated
	print('Content buffered and gzipped ({bytes} bytes)'.format(bytes=len(gzip_content)))

	# Upload the dump to S3
	# http://boto.readthedocs.org/en/latest/ref/s3.html#boto.s3.key.Key.set_contents_from_file
	print('Uploading data to S3 "{bucket}/{name}"'.format(bucket=S3_BUCKET, name=target_filename))
	s3_target_key.set_contents_from_string(gzip_content, headers={
	'Content-Type': 'text/plain',
	'Content-Encoding': 'gzip',
	}, cb=upload_progres_cb)
	print('Upload complete!')

	# Update the latest dump file on S3 to point to our new file
	print('Marking "{bucket}/{name}" as latest dump via "{bucket}/{latest_name}"'
	.format(bucket=S3_BUCKET, name=target_filename, latest_name=LATEST_FILENAME))
	s3_latest_key.set_contents_from_string(target_filename)
	print('Marking complete!')

	if __name__ == '__main__':
	# Set up our arguments
	parser = argparse.ArgumentParser()
	parser.add_argument(
	'db_conninfo', help='Connection URI for source PostgreSQL (e.g. `postgresql://localhost:5432/mydb`)')

	# Parse them and run it through `main`
	args = parser.parse_args()
	main(**args.__dict__)