Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Dmitry-Klymenko/6593f091cbd5d7081463302570c89dc7 to your computer and use it in GitHub Desktop.
Save Dmitry-Klymenko/6593f091cbd5d7081463302570c89dc7 to your computer and use it in GitHub Desktop.
Google Cloud Platform Big Query script to copy data from one region to another
#!/bin/sh
#
# Google Cloud Platform Big Query script to copy data from one region to another
#
# Disclaimer: This is not an officially supported Google product.
# Written code can be used as a baseline but is not meant for production usage.
#
# Based on https://gist.github.com/dutchiechris/1da2f8786107a635e7e6f5a8c50e84dc with a few improvements (main one: support for large BQ files which cannot be exported to a single JSON file).
#
# Window 10 has a Bash shell feature. Once enabled and installed it allows to run shell command on Windows platform. It is a completely different environment through.
# Do not forget to remove all \r symbols from the script file.
#
# BigQuery can only export data to the same region as it stores the data so we need to export data to the same region, copy files between regions and then import data in a new region.
#
# 1. The script will read table names in the source dataset and for each table do two things: export tha table schema to the local TMPDIR and export the table data to the source bucket
# 2. Data will be copied from SOURCE_BUCKET to DEST_BUCKET between regions
# 3. For each table name in the SOURCE_DATASET the script will import data to the DEST_DATASET
#
# If you are migrating Google Analytics (GMP) 360 BigQuery data, since data move can take some time (days) and you don't want to interrupt your data flow and additional step will be required.
# Instead of moving data directly to the destination dataset, set your DEST_DATASET to a different name. Once migration is complete, you can validate data quality
# and "manually" move new tables (created during script execution time). Now, unlink GA from BigQuery and delete SOURCE_DATASET (the one you have copied to DEST_DATASET).
# Re-create SOURCE_DATASET in the desired region and link GA and BigQuery again. This will ensure that any new data GA to BQ export will make will be sent to the correct region.
# The only thing left is to copy tables from DEST_DATASET to a new SOURCE_DATASET. This code is last in the script and is commented.
#
###
### Configuration:
###
# Project that owns source dataset
#SOURCE_PROJECT=<PROJECT-NAME>
# Project that owns destination dataset
#DEST_PROJECT=<PROJECT-NAME>
# Source bucket and dataset must be in the same region
SOURCE_DATASET=<SRC-PROJECT-ID>:<SRC-DATASET-NAME>
SOURCE_BUCKET=gs://<SRC-BUCKET-NAME>
# Destination bucket and dataset must be in the same region
DEST_DATASET=<DEST-PROJECT-ID>:<DEST-DATASET-NAME>
DEST_BUCKET=gs://<DEST-BUCKET-NAME>
##DO NOT EDIT BELOW
###
### Main script:
###
TMPDIR=$(mktemp -d)
echo "Using temporary folder: $TMPDIR"
### Extract table data and save table schema locally in the temporary folder
for f in `bq ls --max_results=1000000 $SOURCE_DATASET | grep TABLE | awk '{print $1}'`; do
echo "**** Saving Schema from $f to $TMPDIR/schema-$f.json"
bq show --format=prettyjson "$SOURCE_DATASET.$f" | jq '.schema.fields' > $TMPDIR/schema-$f.json
echo "**** Extracting $f to $SOURCE_BUCKET/$f.json"
bq extract --destination_format=NEWLINE_DELIMITED_JSON "$SOURCE_DATASET.$f" $SOURCE_BUCKET/$f*.json
done
## Copy
echo "**** Copying data from $SOURCE_BUCKET/*.json to $DEST_BUCKET/"
gsutil -m cp $SOURCE_BUCKET/*.json $DEST_BUCKET/
### Load
#https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#bigquery_load_table_gcs_json-cli
for f in `bq ls --max_results=100000 $SOURCE_DATASET | grep TABLE | awk '{print $1}'`; do
echo "**** Loading $f to $DEST_BUCKET/$f.json with schema $TMPDIR/schema-$f.json"
bq load --sync --replace --source_format=NEWLINE_DELIMITED_JSON "$DEST_DATASET.$f" $DEST_BUCKET/$f*.json $TMPDIR/schema-$f.json
done
rm -rf $TMPDIR
#Copy all tables between datasets in the same region.
#
#for f in `bq ls --max_results=100000 $DEST_DATASET | grep TABLE | awk '{print $1}'`; do
# echo "**** Copying $f"
# bq cp -f -n $DEST_DATASET.$f $SOURCE_DATASET.$f
#done
echo "**** Completed. Well done! do not forget to delete buckets"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment