Skip to content

Instantly share code, notes, and snippets.

@jaketf
Last active November 18, 2023 17:00
Show Gist options
  • Save jaketf/3c0b4dac7ec32748b45a2ab88fb44699 to your computer and use it in GitHub Desktop.
Save jaketf/3c0b4dac7ec32748b45a2ab88fb44699 to your computer and use it in GitHub Desktop.
TPT to GCS script

Table of Contents

TPT + gsutil Teradata export Script

This script orchestrates running a TPT base template with many different substituions (e.g. for each table / partition) and uploading the exported files to Google Cloud Storage.

Pre-requisistes

The script assumes the following are installed on the VM

  • Teradata Build Tools (tbuild)
  • Google Cloud SDK (gcloud / gsutil)
  • Enough disk space to support largest export.
  • TPT job base template with @var_* placeholders to be replaced by the values in each substituion file.
  • User running the script permissions on all logs / checkpoint dirs specified in .twbcfg.ini (e.g. root).
  • TPT job variables files on GCS for each table in the dataset you'd like to export from TD and upload to GCS. These should each accept the following variables
    • @var_username: Pass along the UN passed to this script to each TPT invocation.
    • @var_password: Pass along the PW passed to this script to each TPT invocation.
    • @var_directorypath: Set dynamically by this script for each TPT export.
    • @var_outfile: Name of file to use when exporting to directorypath.

An example template file and substitutions files can be found in examples/

Controlling Output Filenames

The output file names are named based on the --bucket and --destination-prefix as well as the substitution file path itself. The logic is for each substitutions files found under the substitutions path replace the --sub-files-path with the bucket and destination prefix.

For example if the script is invoked like so:

# Notably no output prefix
sudo ./tpt2gcs.sh run -v \
  -t gs://dannyteradata/tptwrapper/example_base_template.tpt \
  -u dbc \
  -pw dbc \
  -b output-bucket \
  -s  gs://dannyteradata/tptwrapper/subs/ \
  -e truncate \
  -l projects/dannyteradata/logs/tpt2gcs-script

And the subs path prefix contains substitutions files like this:

gsutil ls gs://dannyteradata/tptwrapper/subs/**
gs://dannyteradata/tptwrapper/subs/tpch/nations.txt  # full tpch.nations table
gs://dannyteradata/tptwrapper/subs/foo/bar/$20200101.txt  # single partition of foo.bar table

The exported data file would look like:

gsutil ls gs://output-bucket/**
gs://output-bucket/tpch/nations/nations.dat
gs://output-bucket/tpch/nations/_SUCCESS
gs://output-bucket/foo/bar/$20200101/$20200101.dat
gs://output-bucket/foo/bar/$20200101/_SUCCESS

Performance

Storage

Exports from Teradata are bound by Teradata resources / FastExport throughput. For this reason the gain in disk performance of Local SSDs is not worth the restrictions that come with SSDs. Instead, it's reccomended to use persistent disks which can mount a larger total disk volume (64 TB as compared to 9 TB).

Machine Type

We'd suggest no smaller machine type than n1-standard-8 to optimize networking throughput.

Behavior

  • Should do single export per partition
  • VM must have enough disk space for largest partition on the table
  • Handle password securely
    • Decrypt teradata password file from gcs kms
  • Script runs in GCP VM so uses ADC for access to GCS / KMS APIs
    • needs roles/storage.objectAdmin on the bucket.
  • Copy locally exported files to GCS.
  • Writes a success file (so the BQ loader function picks up this partition)
  • Clean up local disk

Running the example

From this directory locally.

BUCKET=dannyteradata
# Copy Example Configuration Files
gsutil cp examples/example_base_template.tpt gs://${BUCKET}/tptwrapper/example_base_template.tpt
gsutil cp examples/nation.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp examples/lineitem.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp tpt2gcs.sh gs://${BUCKET}/tptwrapper/

SSH to the VM with teradata tbuild installed.

VM_NAME="teradata"
PROJECT="dannysand"
gcloud beta compute ssh --zone "us-central1-c" "${VM_NAME}" --project "$PROJECT"

From the GCP VM with connectivity to Teradata (or the TD instance itself).

gsutil cp gs://dannyteradata/tptwrapper/tpt2gcs.sh .
chmod +x tpt2gcs.sh
sudo ./tpt2gcs.sh run -v \
  -t gs://dannyteradata/tptwrapper/example_base_template.tpt \
  -u dbc \
  -pw dbc \
  -b dannyteradata \
  -s  gs://dannyteradata/tptwrapper/subs/ \
  -e truncate \
  -p tptwrapper/exports/tdch/ \
  -l projects/dannyteradata/logs/tpt2gcs-script

Usage

Usage:
  tpt2gcs.sh [run|help] \
    [-h] [-v] [-e skip | fail | truncate] [-p <DESTINATION_PREFIX>] \
    -t gs://<BUCKET>/<BASE_TPT_TEMPLATE_PATH> \
    -u <USERNAME> \
    -pw <PASSWORD> \
    -b <BUCKET> \
    -s  gs://<BUCKET>/<SUBSTITUTIONS_PATH>

Wrapper to manage many TPT exports to Google Cloud Storage.
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-t, --tpt-template gs://<BUCKET>/<TPT_TEMPLATE_PATH>
        Required. Base TPT template defining variable sub_files.
        This script assumes that this template uses the following substitution
        variables:
          var_username: Pass along the UN passed to this script to each TPT invocation.
          var_password: Pass along the PW passed to this script to each TPT invocation.
          var_directorypath: Set dynamically by this script for each TPT export.
          var_outfile: Set dynamically by this script for each TPT export.
-u, --username <USERNAME>
        Required. Teradata Username.
-p, --password <PASSWORD>
        Required Teradata Password.
-b, --bucket <GCS_BUCKET>
        Required Destination Google Cloud Storage Bucket.
-s  --sub-files-path gs://<BUCKET>/<SCRIPTS_PATH>
        Reuired Path to files of comma delimited var_foo='bar',var_baz='qux',..
        with which to run the template tpt script.
        This scripts will essentially run the following command for each
        sub_files file in this path:
        tbuild -h 128M -f <template-script> -C  -u '<sub_files-file-contents>'
-e  --gcs-dir-exists-behavior
        Optional. What to do if the target GCS directory is not empty
        One of 'fail', 'truncate', 'skip'. Default is 'skip'.
-p  --destination-prefix
        Optional. Prefix to upload GCS files to. This should usually be the destination
        dataset name in BigQuery. This should contain trailing slash if desired.
-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.

Teradata Chunking Script

There is a chunk.sh convenience script for chunking the historical data in a teradata table based on some timestamp column (e.g. created_ts).

Example Usage

sudo ./chunk.sh run -v \
  -td localhost \
  -u dbc \
  -pw dbc \
  -t tpch.lineitem \
  -c l_commitdate \
  -g DAY \
  -o chunks.csv

Usage


Usage:
  chunk.sh [run|help] \
    [-h] [-v] \
    [-c <CHUNKING_COL>] \
    [-g YEAR | MONTH | DAY | HOUR ] [-r <ROWS_PER_CHUNK>] \
    -td <TDPID> \
    -u <USERNAME> \
    -pw <PASSWORD> \
    -t <TABLE> \
    -lwm <LOW WATERMARK> \
    -hwm <LOW WATERMARK> \
    -o <OUTPUT FILE>

Helper utility for chunking the history of a teradata TABLE into manageable size
chunks for exports. The output of the script is a query defining each batch.
It will split data into sub-chunks based on a date or timestamp column
(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
an optimization as to not have excess overhead of too many export jobs in the
case that there are many sparse partitions (e.g. early in the history of
application logs).
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-td, --TDPID
        Required. Teradata instance ID.
-u, --username <USERNAME>
        Required. Teradata Username.
-pw, --password <PASSWORD>
        Required. Teradata Password.
-t, --TABLE
        Required. ID of TABLE to chunk.
-c, --chunking-col
        Column name of timestamp on which to perform chunking.
        Default: "cre_ts"
-g, --granularity
        subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
        Default: "HOUR"
-r, --rows-per-chunk
        target number of rows per chunk.
        Default: "100,000"
-lwm, --low-watermark
        timestamp where export time range should start.
        Default: "1970-01-01 00:00:00"
-hwm, --high-watermark
        timestamp where export time range should end.
        Default: 20-11-25 01:16:17

-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.
#!/usr/bin/env bash
# Copyright 2020 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# This software is provided as-is,
# without warranty or representation for any use or purpose.
# Your use of it is subject to your agreement with Google.
set -eo pipefail
################################################################################
# GLOBAL DEFAULTS #
################################################################################
APP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
APP_CACHE_DIR_TEMPLATE=".chunk-cache-dir-XXXXXXXX"
APP_CACHE_DIR=$(mktemp -d -t "$APP_CACHE_DIR_TEMPLATE" -p "$APP_DIR")
trap 'rm -rf "${APP_CACHE_DIR}"' EXIT
CMDNAME="$(basename -- "$0")"
BTEQ="bteq" # tbuild binary with universal options
GSUTIL="gsutil -q"
GCLOUD="gcloud -q"
export GRANULARITY="HOUR"
export CHUNKING_COL="cre_ts"
export ROWS_PER_CHUNK="100000"
export LWM="1970-01-01 00:00:00"
# shellcheck disable=SC2155
export HWM=$(date --utc +"%y-%m-%d %H:%M:%S")
export VERBOSE="false"
################################################################################
# UTILITY FUNCTIONS #
################################################################################
function g_log() {
local severity="$1"
local message="$2"
if [[ -n "${LOG_NAME}" ]]; then
gcloud logging write "${LOG_NAME}" --severity="$severity" "$message"
fi
}
# error log wrapper
function log_and_fail() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: ERROR $*" >&2
g_log "ERROR" "$*"
return 1
}
# verbose log wrapper
function debug_log() {
if [[ ${VERBOSE} == "true" ]]; then
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: DEBUG $*" >&2
fi
g_log "DEBUG" "$*"
}
# verbose log wrapper
function info_log() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: INFO $*" >&2
g_log "INFO" "$*"
}
# gcs copy
function g_cp() {
CMD="$GSUTIL cp ${1} ${2}"
debug_log "Running ${CMD}"
${CMD} || log_and_fail "Unable to execute ${CMD}"
}
# Run the teradata query to perform chunking logic
function run_chunk_query_to_file(){
if [[ $GRANULARITY =~ YEAR|MONTH ]]; then
sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), '$GRANULARITY') AS TIMESTAMP(6))"
elif [ "$GRANULARITY" == "DAY" ]; then
sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), 'DD') AS TIMESTAMP(6))"
elif [ "$GRANULARITY" == "HOUR" ]; then
export ts=$CHUNKING_COL
sub_chunk_expression="$ts - (EXTRACT(MINUTE FROM $ts) * INTERVAL '1' MINUTE + EXTRACT(SECOND FROM $ts) * INTERVAL '1' SECOND)"
else
log_and_fail "granularity must be YEAR, MONTH, DAY or HOUR."
fi
debug_log "sub chunk expression = $sub_chunk_expression"
$BTEQ <<EOF
.LOGON $TDPID/$USERNAME,$PASSWORD;
.EXPORT RESET;
.EXPORT REPORT FILE=$OUTPUT_FILE;
.SET SEPARATOR '|';
.SET WIDTH 254;
WITH
rows_per_part AS (
SELECT
$sub_chunk_expression AS part,
COUNT(*) AS row_count
FROM
$TABLE
WHERE
$CHUNKING_COL BETWEEN '$LWM' AND '$HWM'
GROUP BY
part),
rows_per_part_cum_sum AS (
SELECT
part,
row_count,
SUM(row_count) OVER(ORDER BY part ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_row_count
FROM
rows_per_part),
chunk_nums AS (
SELECT
part,
row_count,
running_row_count,
CAST(FLOOR(running_row_count / (.70 * $ROWS_PER_CHUNK)) AS INTEGER) AS chunk_num
FROM
rows_per_part_cum_sum),
chunks AS (
SELECT
chunk_num,
MIN(part) AS min_timestamp,
MAX(part) AS max_timestamp,
COUNT(*) AS parts_per_chunk,
SUM(row_count) AS ROWS_PER_CHUNK
FROM
chunk_nums
GROUP BY chunk_num)
SELECT
chunk_num (TITLE ''),
'SELECT * FROM $TABLE WHERE l_commitdate BETWEEN ''' || TO_CHAR(min_timestamp, 'YYYY-MM-DD HH:MI:SS') || ''' AND ''' || TO_CHAR(max_timestamp, 'YYYY-MM-DD HH:MI:SS') || ''';' (TITLE '')
FROM
chunks
ORDER BY
min_timestamp ASC,
max_timestamp ASC;
.LOGOFF;
.EXIT
EOF
# remove useless header from output file
sed -i '1d' "$OUTPUT_FILE"
}
usage() {
cat << EOF
Usage:
${CMDNAME} [run|help] \\
[-h] [-v] \\
[-c <CHUNKING_COL>] \\
[-g YEAR | MONTH | DAY | HOUR ] [-r <ROWS_PER_CHUNK>] \\
-td <TDPID> \\
-u <USERNAME> \\
-pw <PASSWORD> \\
-t <TABLE> \\
-lwm <LOW WATERMARK> \\
-hwm <LOW WATERMARK> \\
-o <OUTPUT FILE>
Helper utility for chunking the history of a teradata TABLE into manageable size
chunks for exports. The output of the script is a query defining each batch.
It will split data into sub-chunks based on a date or timestamp column
(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
an optimization as to not have excess overhead of too many export jobs in the
case that there are many sparse partitions (e.g. early in the history of
application logs).
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
Shows this help message.
-td, --TDPID
Required. Teradata instance ID.
-u, --username <USERNAME>
Required. Teradata Username.
-pw, --password <PASSWORD>
Required. Teradata Password.
-t, --TABLE
Required. ID of TABLE to chunk.
-c, --chunking-col
Column name of timestamp on which to perform chunking.
Default: "cre_ts"
-g, --granularity
subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
Default: "HOUR"
-r, --rows-per-chunk
target number of rows per chunk.
Default: "100,000"
-lwm, --low-watermark
timestamp where export time range should start.
Default: "1970-01-01 00:00:00"
-hwm, --high-watermark
timestamp where export time range should end.
Default: $(date --utc +"%y-%m-%d %H:%M:%S")
-l --cloud-log-name
Optional. Google Cloud Logging Name.
The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
for other possible naming conventions.
-v, --verbose
Add even more verbosity when running the script.
These are supported commands use in various situations:
help
Print help
run
Run a TPT for every sub_files file in the GCS sub-files-path.
EOF
echo
}
################################################################################
# ENTRYPOINT CONTROL LOOP #
################################################################################
if [[ "$#" -eq 0 ]]; then
echo "You must provide at least one command."
usage
exit 1
fi
CMD=$1
shift
if [ "${CMD}" == "help" ]
then
usage
exit 0
fi
if [ "${CMD}" != "run" ]
then
usage
exit 1
fi
################################################################################
# CLI ARGUMENT PARSING #
################################################################################
if [[ "$#" -eq 0 ]]; then
echo "must set required options."
usage
exit 1
else
# Parse Flags.
while (( "$#" ))
do
case "${1}" in
-h|--help)
usage;
exit 0 ;;
-t|--TABLE)
export TABLE="${2}";
shift 2 ;;
-td|--TDPID)
export TDPID="${2}";
shift 2 ;;
-u|--username)
export USERNAME="${2}";
shift 2 ;;
-pw|--password)
export PASSWORD="${2}";
shift 2 ;;
-c|--chunking-col)
export CHUNKING_COL="${2}";
shift 2 ;;
-g|--granularity)
export GRANULARITY="${2}";
shift 2 ;;
-r|--rows-per-chunk)
export EXISTS_BEHAVIOR="${2}";
shift 2 ;;
-lwm|--low-watermark)
export LWM="${2}";
shift 2 ;;
-hwm|--high-watermark)
export HWM="${2}";
shift 2 ;;
-o|--output-file)
export OUTPUT_FILE="${2}";
shift 2 ;;
-l|--cloud-log-name)
export LOG_NAME="${2}";
shift 2 ;;
-v|--verbose)
export VERBOSE="true";
export GSUTIL="gsutil"
export GCLOUD="gcloud"
echo "Verbosity turned on" >&2
shift ;;
*)
usage
echo "ERROR: Unknown argument ${1}"
exit 1
;;
esac
done
# Check required Args
if [ -z ${TDPID+x} ]
then
echo "tdpid is required"
usage
exit 1
fi
if [ -z ${TABLE+x} ]
then
echo "table is required"
usage
exit 1
fi
if [ -z ${USERNAME+x} ]
then
echo "username is required"
usage
exit 1
fi
if [ -z ${PASSWORD+x} ]
then
echo "password is required"
usage
exit 1
fi
if [ -z ${OUTPUT_FILE+x} ]
then
echo "output-file is required"
usage
exit 1
fi
# run the main program
run_chunk_query_to_file
debug_log "created chunks: $(cat "${OUTPUT_FILE}")"
fi
USING CHARACTER SET @var_utf
DEFINE JOB tpt_export_template
(
APPLY TO OPERATOR ($FILE_WRITER()[@var_instances]
ATTR (FileName=@var_outfile,Format=@var_format, TextDelimiterhex=@var_delimiter_value, DirectoryPath=@var_directorypath)
)
SELECT * FROM OPERATOR($EXPORT()[@var_exp_instances]
ATTR (TdpId=@var_tdpid, UserName=@var_userid, UserPassword=@var_password,QueryBandSessInfo=@var_qb,
SelectStmt=@var_exportquery,SpoolMode=@var_spoolmode,BlockSize=@var_blocksize,dateform=@var_dateform,
MaxSessions=@var_maxsessions, MaxDecimalDigits = 38)
);
);
var_tdpid='localhost',
var_exportquery='select * from tpch.lineitem',
var_delimiter_value='10',
var_dateform='ANSIDATE',
var_utf='UTF8',
var_qb='UtilityDataSize=large;',
var_maxsessions=1,
var_instances=1,
var_exp_instances=1,
var_blocksize=1048472,
var_spoolmode='Nospool',
var_format='DELIMITED'
#!/usr/bin/env bash
# Copyright 2020 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# This software is provided as-is,
# without warranty or representation for any use or purpose.
# Your use of it is subject to your agreement with Google.
set -eo pipefail
################################################################################
# GLOBAL DEFAULTS #
################################################################################
APP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
APP_CACHE_DIR_TEMPLATE=".tpt2gcs-cache-dir-XXXXXXXX"
APP_CACHE_DIR=$(mktemp -d -t "$APP_CACHE_DIR_TEMPLATE" -p "$APP_DIR")
trap 'rm -rf "${APP_CACHE_DIR}"' EXIT
CMDNAME="$(basename -- "$0")"
TBUILD="tbuild" # tbuild binary with universal options
GSUTIL="gsutil -q"
GCLOUD="gcloud -q"
SUCCESS_FILENAME="_SUCCESS"
export EXISTS_BEHAVIOR="skip"
export VERBOSE="false"
export DESTINATION_PREFIX=""
################################################################################
# UTILITY FUNCTIONS #
################################################################################
function g_log() {
local severity="$1"
local message="$2"
if [[ -n "${LOG_NAME}" ]]; then
gcloud logging write "${LOG_NAME}" --severity="$severity" "$message"
fi
}
# error log wrapper
function log_and_fail() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: ERROR $*" >&2
g_log "ERROR" "$*"
return 1
}
# verbose log wrapper
function debug_log() {
if [[ ${VERBOSE} == "true" ]]; then
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: DEBUG $*" >&2
fi
g_log "DEBUG" "$*"
}
# verbose log wrapper
function info_log() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: INFO $*" >&2
g_log "INFO" "$*"
}
# gcs copy
function g_cp() {
CMD="$GSUTIL cp ${1} ${2}"
debug_log "Running ${CMD}"
${CMD} || log_and_fail "Unable to execute ${CMD}"
}
# gcs recursive copy
function g_cp_r() {
CMD="$GSUTIL -m cp -r ${1} ${2}"
debug_log "Running ${CMD}"
${CMD} || log_and_fail "Unable to execute ${CMD}"
}
# gcs recursive delete
function g_rm_r() {
CMD="$GSUTIL -m rm -r ${1}"
debug_log "Running ${CMD}"
${CMD} || log_and_fail "Unable to execute ${CMD}"
}
# gcs list objects at prefix
function g_ls() {
CMD="$GSUTIL ls ${1}"
debug_log "Running ${CMD}"
${CMD}
}
# gcs cat
function g_cat() {
CMD="$GSUTIL cat ${1}"
debug_log "Running ${CMD}"
${CMD} || log_and_fail "Unable to execute ${CMD}"
}
# create empty file on gcs
function g_touch(){
local success_file
success_file=$(mktemp)
# shellcheck disable=SC2064
trap "rm -f $success_file" RETURN
g_cp "$success_file" "${1}"
}
# Read / write files from app cache dir
function save_to_file() {
# shellcheck disable=SC2005
echo "$(eval echo "\$$1")" > "${APP_CACHE_DIR}/.$1"
}
function read_from_file() {
cat "${APP_CACHE_DIR}/.$1" 2>/dev/null || true
}
function tbuild_with_job_vars() {
$TBUILD "-h" "128M" "-f" "$1" "-C" "-u" "$2" || log_and_fail "Failed to run $TBUILD -u '$1'"
}
function gcs_push(){
local tmp_dir="$1"
local gcs_dest_path="$2"
# Push locally exported data to GCS.
debug_log "copying $tmp_dir to $gcs_dest_path"
g_cp_r "$tmp_dir" "$gcs_dest_path"
# Add success file to indicate entire batch was successfully pushed
debug_log "pushing $SUCCESS_FILENAME to GCS"
g_touch "$gcs_dest_path/$SUCCESS_FILENAME"
info_log "Successfully ran $sub_filename pushing $SUCCESS_FILENAME to GCS."
# Clean up locally exported files.
debug_log "cleaning temp dir $tmp_dir"
rm -rf "$tmp_dir"
}
################################################################################
# Core Run TPT Jobs Logic #
################################################################################
function run(){
debug_log "APP_CACHE_DIR=${APP_CACHE_DIR}"
local base_tpt_script_local="${APP_CACHE_DIR}/base.tpt"
g_cp "$TPT_TEMPLATE" "$base_tpt_script_local"
# TODO Loop over list of table names and generate a dynamic fast export script.
# Loop over Substitution files
local sub_files
sub_files=$(g_ls "$SUBSTITUTIONS_PATH**")
debug_log "SUBSTITUTIONS_FILES: ${sub_files}"
local tmp_dir
local subs
for sub_file in $sub_files
do
sub_path="${sub_file/$SUBSTITUTIONS_PATH}"
sub_path="${sub_path%.*}" # trim file extension
sub_filename=$(basename -- "$sub_file")
sub_filename="${sub_filename%.*}" # trim file extension
tmp_dir=$(mktemp -d "${sub_filename}-XXXXXXXX" -p "${APP_CACHE_DIR}")
subs=$(g_cat "$sub_file")
gcs_dest_path="${sub_file/$SUBSTITUTIONS_PATH/gs:\/\/$BUCKET\/$DESTINATION_PREFIX}"
set +e
objs_at_dest=$(g_ls "$gcs_dest_path")
ls_exit_code="$?"
set -e
if [ $ls_exit_code -ne 0 ]
then
debug_log "no objects in dest path $gcs_dest_path. continuing."
else
case $EXISTS_BEHAVIOR in
skip)
continue
;;
fail)
log_and_fail "found unexpected objects in dest path $gcs_dest_path. $objs_at_dest"
;;
truncate)
debug_log "deleting existing objects in dest path $gcs_dest_path. $objs_at_dest"
g_rm_r "$gcs_dest_path"
;;
esac
fi
debug_log "Running TPT for $sub_filename to temp dir $tmp_dir with substitutions $subs"
tbuild_with_job_vars "$base_tpt_script_local" "var_userid='$USERNAME', var_password='$PASSWORD', var_directorypath='$tmp_dir', var_outfile='$sub_filename.dat', ${subs//$'\n'/}"
gcs_push "$tmp_dir" "$gcs_dest_path"
done
}
usage() {
cat << EOF
Usage:
${CMDNAME} [run|help] \\
[-h] [-v] [-e skip | fail | truncate] [-p <DESTINATION_PREFIX>] \\
-t gs://<BUCKET>/<BASE_TPT_TEMPLATE_PATH> \\
-u <USERNAME> \\
-pw <PASSWORD> \\
-b <BUCKET> \\
-s gs://<BUCKET>/<SUBSTITUTIONS_PATH>
Wrapper to manage many TPT exports to Google Cloud Storage.
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
Shows this help message.
-t, --tpt-template gs://<BUCKET>/<TPT_TEMPLATE_PATH>
Required. Base TPT template defining variable sub_files.
This script assumes that this template uses the following substitution
variables:
var_username: Pass along the UN passed to this script to each TPT invocation.
var_password: Pass along the PW passed to this script to each TPT invocation.
var_directorypath: Set dynamically by this script for each TPT export.
var_outfile: Set dynamically by this script for each TPT export.
-u, --username <USERNAME>
Required. Teradata Username.
-p, --password <PASSWORD>
Required Teradata Password.
-b, --bucket <GCS_BUCKET>
Required Destination Google Cloud Storage Bucket.
-s --sub-files-path gs://<BUCKET>/<SCRIPTS_PATH>
Reuired Path to files of comma delimited var_foo='bar',var_baz='qux',..
with which to run the template tpt script.
This scripts will essentially run the following command for each
sub_files file in this path:
tbuild -h 128M -f <template-script> -C -u '<sub_files-file-contents>'
-e --gcs-dir-exists-behavior
Optional. What to do if the target GCS directory is not empty
One of 'fail', 'truncate', 'skip'. Default is 'skip'.
-p --destination-prefix
Optional. Prefix to upload GCS files to. This should usually be the destination
dataset name in BigQuery. This should contain trailing slash if desired.
-l --cloud-log-name
Optional. Google Cloud Logging Name.
The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
for other possible naming conventions.
-v, --verbose
Add even more verbosity when running the script.
These are supported commands use in various situations:
help
Print help
run
Run a TPT for every sub_files file in the GCS sub-files-path.
EOF
echo
}
################################################################################
# ENTRYPOINT CONTROL LOOP #
################################################################################
if [[ "$#" -eq 0 ]]; then
echo "You must provide at least one command."
usage
exit 1
fi
CMD=$1
shift
if [ "${CMD}" == "help" ]
then
usage
exit 0
fi
if [ "${CMD}" != "run" ]
then
usage
exit 1
fi
################################################################################
# CLI ARGUMENT PARSING #
################################################################################
if [[ "$#" -eq 0 ]]; then
echo "must set required options."
usage
exit 1
else
# Parse Flags.
while (( "$#" ))
do
case "${1}" in
-h|--help)
usage;
exit 0 ;;
-t|--tpt-template)
export TPT_TEMPLATE="${2}";
shift 2 ;;
-u|--username)
export USERNAME="${2}";
shift 2 ;;
-pw|--password)
export PASSWORD="${2}";
shift 2 ;;
-b|--bucket)
export BUCKET="${2}";
shift 2 ;;
-s|--sub-files-path)
export SUBSTITUTIONS_PATH="${2}";
shift 2 ;;
-e|--gcs-dir-exists-behavior)
export EXISTS_BEHAVIOR="${2}";
shift 2 ;;
-p|--destination-prefix)
export DESTINATION_PREFIX="${2}";
shift 2 ;;
-l|--cloud-log-name)
export LOG_NAME="${2}";
shift 2 ;;
-v|--verbose)
export VERBOSE="true";
export GSUTIL="gsutil"
export GCLOUD="gcloud"
echo "Verbosity turned on" >&2
shift ;;
*)
usage
echo "ERROR: Unknown argument ${1}"
exit 1
;;
esac
done
# Check required Args
if [ -z ${TPT_TEMPLATE+x} ]
then
echo "tpt-template is required"
usage
exit 1
fi
if [ -z ${USERNAME+x} ]
then
echo "username is required"
usage
exit 1
fi
if [ -z ${PASSWORD+x} ]
then
echo "password is required"
usage
exit 1
fi
if [ -z ${BUCKET+x} ]
then
echo "bucket is required"
usage
exit 1
fi
if [ -z ${SUBSTITUTIONS_PATH+x} ]
then
echo "substitutions-path is required"
usage
exit 1
fi
# run the main program
run
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment