Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save mugli/82cc0f1823b3cb072f4e471a4f676d61 to your computer and use it in GitHub Desktop.
Save mugli/82cc0f1823b3cb072f4e471a4f676d61 to your computer and use it in GitHub Desktop.
Copy MySQL table to big query.

Import a MySQL Table or Database to a BigQuery Schema.

Usage:

~/bin/mysql_table_to_big_query.sh bucket_name schema_name table_name

~/bin/mysql_schema_to_big_query.sh bucket_name \[schema_name\].

Description

mysql_table_to_big_query.sh exports the table from MySQL to CSV and exports the schema to JSON and SQL file. The files are then uploaded to a folder of an existing cloud bucket.

These files are then imported to big query.

A BigQuery Dataset is created in the same project (if not existing) with the name {SCHEMA_NAME}_{DATE}.

If the table has a column with datatype=DATE, the BigQuery table will be partitioned.

mysql_schema_to_big_query.sh extracts a list of all tables from the MySQLschema and calls mysql_table_to_big_query.sh for each.

It creates log files in the local directory for each table. These can be used to review the import errors, then deleted.

Attribution

Script is based on work by Shantanuo shantanuo/mysql_to_big_query.sh Main differences from the original This version requires you to create a Google cloud storage bucket first. This gives some control of the storage class and also allows tables with underscores (which are not allowed in bucket name) There are some tweaks to the logging. Added BQ Partitioning form fork https://gist.github.com/apenney/2d78fc1fb734bf3f58d5adbd1dfdd461

Dependencies

bash,

MySQL with authenticated access to specified schema.

Google Cloud SDK installed with shell configured for a GCP project.

A bucket belonging to the GCP project with write permissions. Select appropriate storage class and region. If the bucket is not used for anything else then conisder adding lifecycle rule to delete the intermediary files after a couple of days. They can't be deleted immediately without affecting the bq load.

Local Drive must have adequate space to hold the table in uncompressed CSV format.

mysql_schema_to_big_query.sh expects to find mysql_table_to_big_query.sh in ~/bin.

Installation

  1. Use wget to download these scripts.
  1. Copy both scripts to ~/bin and chmod to executable.
  2. Install gsutil and bigquery (or Google Cloud SDK) and configure gsutil for your GCP project.

Limitations

Cannot import tables containing blobs.

Does not delete the intermediate files in the storage bucket. These must remain until the BQ load job completes.

To check status of a load job use >bq ls -j

Tested on Ubuntu.

Subject to all the Limitations for importing CSV data into Big Query

If your date or timestamp data is not in the correct format then import as string by modifying the select statement that creates the schema in json_query.txt.

The Dataset Location is set by default rules for the project. Check the location of the dataset matches the buckets.

#!/bin/bash
BUCKET_NAME=$1
TABLE_SCHEMA=$2
TABLE_NAME=$3
PROJECT_ID=`gcloud config list --format 'value(core.project)' 2>/dev/null`
UIDX=`date '+_%Y_%m_%d'`
DATASET="$TABLE_SCHEMA$UIDX"
upload_folder="$BUCKET_NAME/UPLOAD/"
echo "Import $TABLE_SCHEMA.$TABLE_NAME from MySQL to BigQuery $PROJECT_ID:$DATASET.$TABLE_NAME via bucket $upload_folder"
# uncomment set -x for more printouts
#set -x
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="$TABLE_NAME$mytime$TABLE_SCHEMA"
echo $file_prefix
splitat="4000000000"
bulkfiles=200
maxbad=300
# Cleanup function called on exit
function cleanup_files() {
rm -f ${TABLE_NAME}.*
rm -f ${file_prefix}*
rm -f blob_query.txt json_query.txt
}
# make sure bucket, schema and table names are supplied
if [ $# -ne 3 ];then
echo "Copy a table from MySQL schema to existing bucket in current gcp project then import to BigQuery dataset $DATASET."
echo "usage: $0 BUCKET_NAME SCHEMA_NAME TABLE_NAME"
exit 1
fi
echo "Checking that table doesn't contain blob data."
cat > blob_query.txt << heredoc
select sum(IF((DATA_TYPE LIKE '%blob%'),1, 0)) from INFORMATION_SCHEMA.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
heredoc
mycount=`mysql $mysqlargs -Bs < blob_query.txt`
if [ $mycount -ne 0 ]; then
echo "Cannot import table $TABLE_NAME. Blob column found."
exit 3
fi
echo "Creating JSON schema from mysql table structure"
##
## DATE and TIMESTAMP format
## If the DATE field is not formatted as 'YYYY-DD-MM' or TIMESTAMP field is not formatted as YYYY-DD-MM HH:MM:SS' then import these as STRING and use a Query to transform them. You will loose partitioning. E.g.
#select CONCAT('{"name": "', COLUMN_NAME, '","type":"', IF(DATA_TYPE like "%date%", "STRING",IF(DATA_TYPE like "%int%", "INTEGER",IF(DATA_TYPE = "decimal","FLOAT","STRING"))) , '"},') as json from information_schema.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
##
cat > json_query.txt << heredoc
select CONCAT('{"name": "', COLUMN_NAME, '","type":"', IF(DATA_TYPE like "%date%", "TIMESTAMP",IF(DATA_TYPE like "%int%", "INTEGER",IF(DATA_TYPE = "decimal","FLOAT","STRING"))) , '"},') as json from information_schema.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
heredoc
echo '[' > $TABLE_NAME.json
mysql -Bs < json_query.txt | sed '$s/,$//' >> $TABLE_NAME.json
mysql $TABLE_SCHEMA -Bse"show create table $TABLE_NAME\G" > $TABLE_NAME.sql
echo ', {"name": "hostname","type":"STRING"} ]' >> $TABLE_NAME.json
echo "Uploading schemas: $TABLE_NAME.json, $TABLE_NAME.sql to cloud: $upload_folder"
gsutil cp $TABLE_NAME.json gs://$upload_folder
gsutil cp $TABLE_NAME.sql gs://$upload_folder
echo "Exporting table: $TABLE_SCHEMA.$TABLE_NAME to CSV-like file: $TABLE_NAME.txt1"
mysql $TABLE_SCHEMA --quick -Bse"select * from $TABLE_NAME" > $TABLE_NAME.txt1
FILESIZE=$(stat -c%s "$TABLE_NAME.txt1")
if [ -s "$TABLE_NAME.txt1" ]
then
echo "Converting file $TABLE_NAME.txt1 $FILESIZE to CSV format for Big Query"
else
echo "$TABLE_NAME.txt1 is empty or does not exist! Exiting."
exit 3
fi
echo 'tr'
tr -d "\r" < $TABLE_NAME.txt1 > $TABLE_NAME.txt
echo 'sed1'
sed -i "s/$/\t$TABLE_SCHEMA/" $TABLE_NAME.txt
echo 'sed2'
sed -i 's/(Ctrl-v)(Ctrl-m)//g' $TABLE_NAME.txt
echo "Spliting large file $TABLE_NAME.txt to files with prefix $file_prefix"
split -C $splitat $TABLE_NAME.txt $file_prefix
#echo "Loop and upload $file_prefix to google cloud"
for file in `ls $file_prefix*`
do
# big query does not seem to like double quotes and NULL
sed -i -e 's/\"//g' -e's/NULL//g' $file
gzip $file
echo "Uploading CSV dumpfile: $file.gz to google cloud: $upload_folder"
gsutil cp $file.gz gs://$upload_folder
if [ $? -ne 0 ];then
echo "$file could not be uploaded to cloud"
cleanup_files
exit 3
fi
#rm -f $file.gz
done
#check if data set exists and create if not
bq_safe_mk() {
dataset=$1
exists=$(bq ls -d | grep -w $dataset)
if [ -n "$exists" ]; then
echo "BigQuery Dataset $dataset already exists"
else
echo "Creating BigQuery Dataset $dataset"
bq mk $dataset
fi
}
# import data to big query
bq_safe_mk $DATASET
#echo "Check what's matching in the bucketi: $upload_folder$file_prefix"
filelist=`gsutil ls gs://$upload_folder$file_prefix*.gz | xargs -n$bulkfiles | tr ' ', ','`
echo "Loading files $filelist into Big Query"
for mylist in $filelist
do
# Check if we have a TIMESTAMP or DATE column, if we do then we can use partitioning
BQ_ARGS="--nosync -F \t --job_id=$file --max_bad_record=$maxbad"
if grep "TIMESTAMP" "$TABLE_NAME.json"; then
FIELD=$(grep "TIMESTAMP" "$TABLE_NAME.json" | tail -1 | cut -d'"' -f4)
echo "Partitioning Table based on TIMESTAMP column $FIELD"
BQ_ARGS+=" --time_partitioning_type=DAY --time_partitioning_field=${FIELD}"
elif grep "DATE" "$TABLE_NAME.json"; then
FIELD=$(grep "DATE" "$TABLE_NAME.json" | tail -1 | cut -d'"' -f4)
echo "Partitioning Table based on DATE column $FIELD"
BQ_ARGS+=" --time_partitioning_type=DAY --time_partitioning_field=${FIELD}"
fi
#bq load --nosync -F '\t' --job_id="$file" --max_bad_record=$maxbad $DATASET.$TABLE_NAME $mylist $TABLE_NAME.json
if ! bq load $BQ_ARGS "${DATASET}.${TABLE_NAME}" "$mylist" "${TABLE_NAME}.json"; then
echo "ERROR: bq load failed for $file, check file exists in cloud."
cleanup_files
exit 2
fi
done
echo "$0 completed. Cleaning up"
cleanup_files
exit
#!/bin/bash
BUCKET_NAME=$1
TABLE_SCHEMA=$2
TABLE_NAME=$3
PROJECT_ID=`gcloud config list --format 'value(core.project)' 2>/dev/null`
UIDX=`date '+_%Y_%m_%d'`
DATASET="$TABLE_SCHEMA$UIDX"
upload_folder="$BUCKET_NAME/UPLOAD/"
echo "Import $TABLE_SCHEMA.$TABLE_NAME from MySQL to BigQuery $PROJECT_ID:$DATASET.$TABLE_NAME via bucket $upload_folder"
# uncomment set -x for more printouts
#set -x
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="$TABLE_NAME$mytime$TABLE_SCHEMA"
echo $file_prefix
splitat="4000000000"
bulkfiles=200
maxbad=300
# Cleanup function called on exit
function cleanup_files() {
rm -f "$TABLE_NAME.*"
rm -f "$file_prefix*"
rm -f blob_query.txt json_query.txt
}
# make sure bucket, schema and table names are supplied
if [ $# -ne 3 ];then
echo "Copy a table from MySQL schema to existing bucket in current gcp project then import to BigQuery dataset $DATASET."
echo "usage: $0 BUCKET_NAME SCHEMA_NAME TABLE_NAME"
exit 1
fi
echo "Checking that table doesn't contain blob data."
cat > blob_query.txt << heredoc
select sum(IF((DATA_TYPE LIKE '%blob%'),1, 0)) from INFORMATION_SCHEMA.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
heredoc
mycount=`mysql $mysqlargs -Bs < blob_query.txt`
if [ $mycount -ne 0 ]; then
echo "Cannot import table $TABLE_NAME. Blob column found."
exit 3
fi
echo "Creating JSON schema from mysql table structure"
##
## DATE and TIMESTAMP format
## If the DATE field is not formatted as 'YYYY-DD-MM' or TIMESTAMP field is not formatted as YYYY-DD-MM HH:MM:SS' then import these as STRING and use a Query to transform them. You will loose partitioning. E.g.
#select CONCAT('{"name": "', COLUMN_NAME, '","type":"', IF(DATA_TYPE like "%date%", "STRING",IF(DATA_TYPE like "%int%", "INTEGER",IF(DATA_TYPE = "decimal","FLOAT","STRING"))) , '"},') as json from information_schema.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
##
cat > json_query.txt << heredoc
select CONCAT('{"name": "', COLUMN_NAME, '","type":"', IF(DATA_TYPE like "%date%", "TIMESTAMP",IF(DATA_TYPE like "%int%", "INTEGER",IF(DATA_TYPE = "decimal","FLOAT","STRING"))) , '"},') as json from information_schema.columns where TABLE_SCHEMA = '$TABLE_SCHEMA' AND TABLE_NAME = '$TABLE_NAME';
heredoc
echo '[' > $TABLE_NAME.json
mysql -Bs < json_query.txt | sed '$s/,$//' >> $TABLE_NAME.json
mysql $TABLE_SCHEMA -Bse"show create table $TABLE_NAME\G" > $TABLE_NAME.sql
echo ', {"name": "hostname","type":"STRING"} ]' >> $TABLE_NAME.json
echo "Uploading schemas: $TABLE_NAME.json, $TABLE_NAME.sql to cloud: $upload_folder"
gsutil cp $TABLE_NAME.json gs://$upload_folder
gsutil cp $TABLE_NAME.sql gs://$upload_folder
echo "Exporting table: $TABLE_SCHEMA.$TABLE_NAME to CSV-like file: $TABLE_NAME.txt1"
mysql $TABLE_SCHEMA --quick -Bse"select * from $TABLE_NAME" > $TABLE_NAME.txt1
FILESIZE=$(stat -c%s "$TABLE_NAME.txt1")
if [ -s "$TABLE_NAME.txt1" ]
then
echo "Converting file $TABLE_NAME.txt1 $FILESIZE to CSV format for Big Query"
else
echo "$TABLE_NAME.txt1 is empty or does not exist! Exiting."
exit 3
fi
echo 'tr'
tr -d "\r" < $TABLE_NAME.txt1 > $TABLE_NAME.txt
echo 'sed1'
sed -i "s/$/\t$TABLE_SCHEMA/" $TABLE_NAME.txt
echo 'sed2'
sed -i 's/(Ctrl-v)(Ctrl-m)//g' $TABLE_NAME.txt
echo "Spliting large file $TABLE_NAME.txt to files with prefix $file_prefix"
split -C $splitat $TABLE_NAME.txt $file_prefix
#echo "Loop and upload $file_prefix to google cloud"
for file in `ls $file_prefix*`
do
# big query does not seem to like double quotes and NULL
sed -i -e 's/\"//g' -e's/NULL//g' $file
gzip $file
echo "Uploading CSV dumpfile: $file.gz to google cloud: $upload_folder"
gsutil cp $file.gz gs://$upload_folder
if [ $? -ne 0 ];then
echo "$file could not be uploaded to cloud"
cleanup_files
exit 3
fi
#rm -f $file.gz
done
#check if data set exists and create if not
bq_safe_mk() {
dataset=$1
exists=$(bq ls -d | grep -w $dataset)
if [ -n "$exists" ]; then
echo "BigQuery Dataset $dataset already exists"
else
echo "Creating BigQuery Dataset $dataset"
bq mk $dataset
fi
}
# import data to big query
bq_safe_mk $DATASET
#echo "Check what's matching in the bucketi: $upload_folder$file_prefix"
filelist=`gsutil ls gs://$upload_folder$file_prefix*.gz | xargs -n$bulkfiles | tr ' ', ','`
echo "Loading files $filelist into Big Query"
for mylist in $filelist
do
# Check if we have a TIMESTAMP or DATE column, if we do then we can use partitioning
BQ_ARGS="--nosync -F \t --job_id=$file --max_bad_record=$maxbad"
if grep "TIMESTAMP" "$TABLE_NAME.json"; then
FIELD=$(grep "TIMESTAMP" "$TABLE_NAME.json" | tail -1 | cut -d'"' -f4)
echo "Partitioning Table based on TIMESTAMP column $FIELD"
BQ_ARGS+=" --time_partitioning_type=DAY --time_partitioning_field=${FIELD}"
elif grep "DATET" "$TABLE_NAME.json"; then
FIELD=$(grep "DATE" "$TABLE_NAME.json" | tail -1 | cut -d'"' -f4)
echo "Partitioning Table based on DATE column $FIELD"
BQ_ARGS+=" --time_partitioning_type=DAY --time_partitioning_field=${FIELD}"
fi
#bq load --nosync -F '\t' --job_id="$file" --max_bad_record=$maxbad $DATASET.$TABLE_NAME $mylist $TABLE_NAME.json
if ! bq load $BQ_ARGS "${DATASET}.${TABLE_NAME}" "$mylist" "${TABLE_NAME}.json"; then
echo "ERROR: bq load failed for $file, check file exists in cloud."
cleanup_files
exit 2
fi
done
cleanup_files
exit
# Install google utilities
wget http://commondatastorage.googleapis.com/pub/gsutil.tar.gz
tar xfz gsutil.tar.gz -C $HOME
vi ~/.bashrc
export PATH=${PATH}:$HOME/gsutil
cd gsutil
python setup.py install
gsutil config
sudo sh
easy_install bigquery
bq init
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment