jaketf/README.md

## README.md

      
    Raw
  

              README.md
            
          
Table of Contents

TPT + gsutil Teradata export Script

Behavior
Usage


TPT + gsutil Teradata export Script

This script orchestrates running a TPT base template with many different
substituions (e.g. for each table / partition) and uploading the exported files
to Google Cloud Storage.
Pre-requisistes

The script assumes the following are installed on the VM

Teradata Build Tools (tbuild)
Google Cloud SDK (gcloud / gsutil)
Enough disk space to support largest export.
TPT job base template with @var_* placeholders to be replaced by the values in each substituion file.
User running the script permissions on all logs / checkpoint dirs specified in .twbcfg.ini (e.g. root).
TPT job variables files on GCS for each table in the dataset you'd like to export from TD and upload to GCS.
These should each accept the following variables

@var_username: Pass along the UN passed to this script to each TPT invocation.
@var_password: Pass along the PW passed to this script to each TPT invocation.
@var_directorypath: Set dynamically by this script for each TPT export.
@var_outfile: Name of file to use when exporting to directorypath.


An example template file and substitutions files can be found in examples/
Controlling Output Filenames

The output file names are named based on the --bucket and --destination-prefix
as well as the substitution file path itself.
The logic is for each substitutions files found under the substitutions path
replace the --sub-files-path with the bucket and destination prefix.
For example if the script is invoked like so:
# Notably no output prefix
sudo ./tpt2gcs.sh run -v \
  -t gs://dannyteradata/tptwrapper/example_base_template.tpt \
  -u dbc \
  -pw dbc \
  -b output-bucket \
  -s  gs://dannyteradata/tptwrapper/subs/ \
  -e truncate \
  -l projects/dannyteradata/logs/tpt2gcs-script
And the subs path prefix contains substitutions files like this:
gsutil ls gs://dannyteradata/tptwrapper/subs/**
gs://dannyteradata/tptwrapper/subs/tpch/nations.txt  # full tpch.nations table
gs://dannyteradata/tptwrapper/subs/foo/bar/$20200101.txt  # single partition of foo.bar table
The exported data file would look like:
gsutil ls gs://output-bucket/**
gs://output-bucket/tpch/nations/nations.dat
gs://output-bucket/tpch/nations/_SUCCESS
gs://output-bucket/foo/bar/$20200101/$20200101.dat
gs://output-bucket/foo/bar/$20200101/_SUCCESS
Performance

Storage

Exports from Teradata are bound by Teradata resources / FastExport throughput.
For this reason the gain in disk performance of Local SSDs is not worth the
restrictions that come with SSDs. Instead, it's reccomended to use persistent
disks which can mount a larger total disk volume (64 TB as compared to 9 TB).
Machine Type

We'd suggest no smaller machine type than n1-standard-8 to optimize networking throughput.
Behavior


Should do single export per partition
VM must have enough disk space for largest partition on the table
Handle password securely

Decrypt teradata password file from gcs kms


Script runs in GCP VM so uses ADC for access to GCS / KMS APIs

needs roles/storage.objectAdmin on the bucket.


Copy locally exported files to GCS.
Writes a success file (so the BQ loader function picks up this partition)
Clean up local disk

Running the example

From this directory locally.
BUCKET=dannyteradata
# Copy Example Configuration Files
gsutil cp examples/example_base_template.tpt gs://${BUCKET}/tptwrapper/example_base_template.tpt
gsutil cp examples/nation.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp examples/lineitem.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp tpt2gcs.sh gs://${BUCKET}/tptwrapper/
SSH to the VM with teradata tbuild installed.
VM_NAME="teradata"
PROJECT="dannysand"
gcloud beta compute ssh --zone "us-central1-c" "${VM_NAME}" --project "$PROJECT"
From the GCP VM with connectivity to Teradata (or the TD instance itself).
gsutil cp gs://dannyteradata/tptwrapper/tpt2gcs.sh .
chmod +x tpt2gcs.sh
sudo ./tpt2gcs.sh run -v \
  -t gs://dannyteradata/tptwrapper/example_base_template.tpt \
  -u dbc \
  -pw dbc \
  -b dannyteradata \
  -s  gs://dannyteradata/tptwrapper/subs/ \
  -e truncate \
  -p tptwrapper/exports/tdch/ \
  -l projects/dannyteradata/logs/tpt2gcs-script
Usage

Usage:
  tpt2gcs.sh [run|help] \
    [-h] [-v] [-e skip | fail | truncate] [-p <DESTINATION_PREFIX>] \
    -t gs://<BUCKET>/<BASE_TPT_TEMPLATE_PATH> \
    -u <USERNAME> \
    -pw <PASSWORD> \
    -b <BUCKET> \
    -s  gs://<BUCKET>/<SUBSTITUTIONS_PATH>

Wrapper to manage many TPT exports to Google Cloud Storage.
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-t, --tpt-template gs://<BUCKET>/<TPT_TEMPLATE_PATH>
        Required. Base TPT template defining variable sub_files.
        This script assumes that this template uses the following substitution
        variables:
          var_username: Pass along the UN passed to this script to each TPT invocation.
          var_password: Pass along the PW passed to this script to each TPT invocation.
          var_directorypath: Set dynamically by this script for each TPT export.
          var_outfile: Set dynamically by this script for each TPT export.
-u, --username <USERNAME>
        Required. Teradata Username.
-p, --password <PASSWORD>
        Required Teradata Password.
-b, --bucket <GCS_BUCKET>
        Required Destination Google Cloud Storage Bucket.
-s  --sub-files-path gs://<BUCKET>/<SCRIPTS_PATH>
        Reuired Path to files of comma delimited var_foo='bar',var_baz='qux',..
        with which to run the template tpt script.
        This scripts will essentially run the following command for each
        sub_files file in this path:
        tbuild -h 128M -f <template-script> -C  -u '<sub_files-file-contents>'
-e  --gcs-dir-exists-behavior
        Optional. What to do if the target GCS directory is not empty
        One of 'fail', 'truncate', 'skip'. Default is 'skip'.
-p  --destination-prefix
        Optional. Prefix to upload GCS files to. This should usually be the destination
        dataset name in BigQuery. This should contain trailing slash if desired.
-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.

Teradata Chunking Script

There is a chunk.sh convenience script for chunking the historical data in a
teradata table based on some timestamp column (e.g. created_ts).
Example Usage

sudo ./chunk.sh run -v \
  -td localhost \
  -u dbc \
  -pw dbc \
  -t tpch.lineitem \
  -c l_commitdate \
  -g DAY \
  -o chunks.csv
Usage


Usage:
  chunk.sh [run|help] \
    [-h] [-v] \
    [-c <CHUNKING_COL>] \
    [-g YEAR | MONTH | DAY | HOUR ] [-r <ROWS_PER_CHUNK>] \
    -td <TDPID> \
    -u <USERNAME> \
    -pw <PASSWORD> \
    -t <TABLE> \
    -lwm <LOW WATERMARK> \
    -hwm <LOW WATERMARK> \
    -o <OUTPUT FILE>

Helper utility for chunking the history of a teradata TABLE into manageable size
chunks for exports. The output of the script is a query defining each batch.
It will split data into sub-chunks based on a date or timestamp column
(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
an optimization as to not have excess overhead of too many export jobs in the
case that there are many sparse partitions (e.g. early in the history of
application logs).
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-td, --TDPID
        Required. Teradata instance ID.
-u, --username <USERNAME>
        Required. Teradata Username.
-pw, --password <PASSWORD>
        Required. Teradata Password.
-t, --TABLE
        Required. ID of TABLE to chunk.
-c, --chunking-col
        Column name of timestamp on which to perform chunking.
        Default: "cre_ts"
-g, --granularity
        subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
        Default: "HOUR"
-r, --rows-per-chunk
        target number of rows per chunk.
        Default: "100,000"
-lwm, --low-watermark
        timestamp where export time range should start.
        Default: "1970-01-01 00:00:00"
-hwm, --high-watermark
        timestamp where export time range should end.
        Default: 20-11-25 01:16:17

-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.


## chunk.sh
#!/usr/bin/env bash
# Copyright 2020 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# This software is provided as-is,
# without warranty or representation for any use or purpose.
# Your use of it is subject to your agreement with Google.

set -eo pipefail
################################################################################
# GLOBAL DEFAULTS                                                              #
################################################################################

APP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

APP_CACHE_DIR_TEMPLATE=".chunk-cache-dir-XXXXXXXX"
APP_CACHE_DIR=$(mktemp -d -t "$APP_CACHE_DIR_TEMPLATE" -p "$APP_DIR")
trap 'rm -rf "${APP_CACHE_DIR}"' EXIT

CMDNAME="$(basename -- "$0")"

BTEQ="bteq" # tbuild binary with universal options
GSUTIL="gsutil -q"
GCLOUD="gcloud -q"
export GRANULARITY="HOUR"
export CHUNKING_COL="cre_ts"
export ROWS_PER_CHUNK="100000"
export LWM="1970-01-01 00:00:00"
# shellcheck disable=SC2155
export HWM=$(date --utc +"%y-%m-%d %H:%M:%S")

export VERBOSE="false"

################################################################################
# UTILITY FUNCTIONS                                                            #
################################################################################

function g_log() {
  local severity="$1"
  local message="$2"
  if [[ -n "${LOG_NAME}" ]]; then
    gcloud logging write "${LOG_NAME}" --severity="$severity" "$message"
  fi
}

# error log wrapper
function log_and_fail() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: ERROR $*" >&2
  g_log "ERROR" "$*"
  return 1
}

# verbose log wrapper
function debug_log() {
  if [[ ${VERBOSE} == "true" ]]; then
    echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: DEBUG $*" >&2
  fi
  g_log "DEBUG" "$*"
}

# verbose log wrapper
function info_log() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: INFO $*" >&2
  g_log "INFO" "$*"
}

# gcs copy
function g_cp() {
  CMD="$GSUTIL cp ${1} ${2}"
  debug_log "Running ${CMD}"
  ${CMD} || log_and_fail "Unable to execute ${CMD}"
}

# Run the teradata query to perform chunking logic
function run_chunk_query_to_file(){

  if [[ $GRANULARITY =~ YEAR|MONTH ]]; then
    sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), '$GRANULARITY') AS TIMESTAMP(6))"
  elif [ "$GRANULARITY" == "DAY" ]; then
    sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), 'DD') AS TIMESTAMP(6))"
  elif [ "$GRANULARITY" == "HOUR" ]; then
    export ts=$CHUNKING_COL
    sub_chunk_expression="$ts - (EXTRACT(MINUTE FROM $ts) * INTERVAL '1' MINUTE + EXTRACT(SECOND FROM $ts) * INTERVAL '1' SECOND)"
  else
    log_and_fail "granularity must be YEAR, MONTH, DAY or HOUR."
  fi
  debug_log "sub chunk expression = $sub_chunk_expression"
  $BTEQ <<EOF
.LOGON $TDPID/$USERNAME,$PASSWORD;
.EXPORT RESET;
.EXPORT REPORT FILE=$OUTPUT_FILE;
.SET SEPARATOR '|';
.SET WIDTH 254;

  WITH
    rows_per_part AS (
    SELECT
      $sub_chunk_expression AS part,
      COUNT(*) AS row_count
    FROM
      $TABLE
    WHERE
      $CHUNKING_COL BETWEEN '$LWM' AND '$HWM'
    GROUP BY
      part),
    rows_per_part_cum_sum AS (
    SELECT
      part,
      row_count,
      SUM(row_count) OVER(ORDER BY part ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_row_count
    FROM
      rows_per_part),
    chunk_nums AS (
    SELECT
      part,
      row_count,
      running_row_count,
      CAST(FLOOR(running_row_count / (.70 * $ROWS_PER_CHUNK)) AS INTEGER) AS chunk_num
    FROM
      rows_per_part_cum_sum),
    chunks AS (
    SELECT
      chunk_num,
      MIN(part) AS min_timestamp,
      MAX(part) AS max_timestamp,
      COUNT(*) AS parts_per_chunk,
      SUM(row_count) AS ROWS_PER_CHUNK
    FROM
      chunk_nums
    GROUP BY chunk_num)

  SELECT
    chunk_num (TITLE ''),
    'SELECT * FROM $TABLE WHERE l_commitdate BETWEEN ''' || TO_CHAR(min_timestamp, 'YYYY-MM-DD HH:MI:SS') || ''' AND ''' || TO_CHAR(max_timestamp, 'YYYY-MM-DD HH:MI:SS') ||  ''';' (TITLE '')
  FROM
    chunks
  ORDER BY
    min_timestamp ASC,
    max_timestamp ASC;
.LOGOFF;
.EXIT
EOF

# remove useless header from output file
sed -i '1d' "$OUTPUT_FILE"
}

usage() {
cat << EOF
Usage:
  ${CMDNAME} [run|help] \\
    [-h] [-v] \\
    [-c <CHUNKING_COL>] \\
    [-g YEAR | MONTH | DAY | HOUR ] [-r <ROWS_PER_CHUNK>] \\
    -td <TDPID> \\
    -u <USERNAME> \\
    -pw <PASSWORD> \\
    -t <TABLE> \\
    -lwm <LOW WATERMARK> \\
    -hwm <LOW WATERMARK> \\
    -o <OUTPUT FILE>

Helper utility for chunking the history of a teradata TABLE into manageable size
chunks for exports. The output of the script is a query defining each batch.
It will split data into sub-chunks based on a date or timestamp column
(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
an optimization as to not have excess overhead of too many export jobs in the
case that there are many sparse partitions (e.g. early in the history of
application logs).
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-td, --TDPID
        Required. Teradata instance ID.
-u, --username <USERNAME>
        Required. Teradata Username.
-pw, --password <PASSWORD>
        Required. Teradata Password.
-t, --TABLE
        Required. ID of TABLE to chunk.
-c, --chunking-col
        Column name of timestamp on which to perform chunking.
        Default: "cre_ts"
-g, --granularity
        subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
        Default: "HOUR"
-r, --rows-per-chunk
        target number of rows per chunk.
        Default: "100,000"
-lwm, --low-watermark
        timestamp where export time range should start.
        Default: "1970-01-01 00:00:00"
-hwm, --high-watermark
        timestamp where export time range should end.
        Default: $(date --utc +"%y-%m-%d %H:%M:%S")

-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.
EOF
echo
}

################################################################################
# ENTRYPOINT CONTROL LOOP                                                      #
################################################################################

if [[ "$#" -eq 0 ]]; then
    echo "You must provide at least one command."
    usage
    exit 1
fi

CMD=$1
shift
if [ "${CMD}" == "help" ]
then
  usage
  exit 0
fi

if [ "${CMD}" != "run" ]
then
  usage
  exit 1
fi


################################################################################
# CLI ARGUMENT PARSING                                                         #
################################################################################
if [[ "$#" -eq 0 ]]; then
  echo "must set required options."
  usage
  exit 1
else
  # Parse Flags.
  while  (( "$#" ))
  do
    case "${1}" in
      -h|--help)
        usage;
        exit 0 ;;
      -t|--TABLE)
        export TABLE="${2}";
        shift 2 ;;
      -td|--TDPID)
        export TDPID="${2}";
        shift 2 ;;
      -u|--username)
        export USERNAME="${2}";
        shift 2 ;;
      -pw|--password)
        export PASSWORD="${2}";
        shift 2 ;;
      -c|--chunking-col)
        export CHUNKING_COL="${2}";
        shift 2 ;;
      -g|--granularity)
        export GRANULARITY="${2}";
        shift 2 ;;
      -r|--rows-per-chunk)
        export EXISTS_BEHAVIOR="${2}";
        shift 2 ;;
      -lwm|--low-watermark)
        export LWM="${2}";
        shift 2 ;;
      -hwm|--high-watermark)
        export HWM="${2}";
        shift 2 ;;
      -o|--output-file)
        export OUTPUT_FILE="${2}";
        shift 2 ;;
      -l|--cloud-log-name)
        export LOG_NAME="${2}";
        shift 2 ;;
      -v|--verbose)
        export VERBOSE="true";
        export GSUTIL="gsutil"
        export GCLOUD="gcloud"

        echo "Verbosity turned on" >&2
        shift ;;
      *)
        usage
        echo "ERROR: Unknown argument ${1}"
        exit 1
        ;;
    esac
  done

  # Check required Args
  if [ -z ${TDPID+x} ]
  then
    echo "tdpid is required"
    usage
    exit 1
  fi

  if [ -z ${TABLE+x} ]
  then
    echo "table is required"
    usage
    exit 1
  fi

  if [ -z ${USERNAME+x} ]
  then
    echo "username is required"
    usage
    exit 1
  fi

  if [ -z ${PASSWORD+x} ]
  then
    echo "password is required"
    usage
    exit 1
  fi

  if [ -z ${OUTPUT_FILE+x} ]
  then
    echo "output-file is required"
    usage
    exit 1
  fi
  # run the main program
  run_chunk_query_to_file
  debug_log "created chunks: $(cat "${OUTPUT_FILE}")"
fi

## example_base_template.tpt
USING CHARACTER SET @var_utf
DEFINE JOB tpt_export_template
(
  APPLY TO OPERATOR ($FILE_WRITER()[@var_instances]
                    ATTR (FileName=@var_outfile,Format=@var_format, TextDelimiterhex=@var_delimiter_value,  DirectoryPath=@var_directorypath)
)
SELECT * FROM OPERATOR($EXPORT()[@var_exp_instances]
                    ATTR (TdpId=@var_tdpid, UserName=@var_userid, UserPassword=@var_password,QueryBandSessInfo=@var_qb,
                          SelectStmt=@var_exportquery,SpoolMode=@var_spoolmode,BlockSize=@var_blocksize,dateform=@var_dateform,
                          MaxSessions=@var_maxsessions,  MaxDecimalDigits = 38)
);
);

## lineitem.txt
var_tdpid='localhost',
var_exportquery='select * from tpch.lineitem',
var_delimiter_value='10',
var_dateform='ANSIDATE',
var_utf='UTF8',
var_qb='UtilityDataSize=large;',
var_maxsessions=1,
var_instances=1,
var_exp_instances=1,
var_blocksize=1048472,
var_spoolmode='Nospool',
var_format='DELIMITED'

## tpt2gcs.sh
#!/usr/bin/env bash
# Copyright 2020 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# This software is provided as-is,
# without warranty or representation for any use or purpose.
# Your use of it is subject to your agreement with Google.

set -eo pipefail

################################################################################
# GLOBAL DEFAULTS                                                              #
################################################################################

APP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

APP_CACHE_DIR_TEMPLATE=".tpt2gcs-cache-dir-XXXXXXXX"
APP_CACHE_DIR=$(mktemp -d -t "$APP_CACHE_DIR_TEMPLATE" -p "$APP_DIR")
trap 'rm -rf "${APP_CACHE_DIR}"' EXIT

CMDNAME="$(basename -- "$0")"

TBUILD="tbuild" # tbuild binary with universal options
GSUTIL="gsutil -q"
GCLOUD="gcloud -q"
SUCCESS_FILENAME="_SUCCESS"
export EXISTS_BEHAVIOR="skip"

export VERBOSE="false"
export DESTINATION_PREFIX=""

################################################################################
# UTILITY FUNCTIONS                                                            #
################################################################################

function g_log() {
  local severity="$1"
  local message="$2"
  if [[ -n "${LOG_NAME}" ]]; then
    gcloud logging write "${LOG_NAME}" --severity="$severity" "$message"
  fi
}

# error log wrapper
function log_and_fail() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: ERROR $*" >&2
  g_log "ERROR" "$*"
  return 1
}

# verbose log wrapper
function debug_log() {
  if [[ ${VERBOSE} == "true" ]]; then
    echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: DEBUG $*" >&2
  fi
  g_log "DEBUG" "$*"
}

# verbose log wrapper
function info_log() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: INFO $*" >&2
  g_log "INFO" "$*"
}

# gcs copy
function g_cp() {
  CMD="$GSUTIL cp ${1} ${2}"
  debug_log "Running ${CMD}"
  ${CMD} || log_and_fail "Unable to execute ${CMD}"
}

# gcs recursive copy
function g_cp_r() {
  CMD="$GSUTIL -m cp -r ${1} ${2}"
  debug_log "Running ${CMD}"
  ${CMD} || log_and_fail "Unable to execute ${CMD}"
}

# gcs recursive delete
function g_rm_r() {
  CMD="$GSUTIL -m rm -r ${1}"
  debug_log "Running ${CMD}"
  ${CMD} || log_and_fail "Unable to execute ${CMD}"
}

# gcs list objects at prefix
function g_ls() {
  CMD="$GSUTIL ls ${1}"
  debug_log "Running ${CMD}"
  ${CMD}
}

# gcs cat
function g_cat() {
  CMD="$GSUTIL cat ${1}"
  debug_log "Running ${CMD}"
  ${CMD} || log_and_fail "Unable to execute ${CMD}"
}

# create empty file on gcs
function g_touch(){
  local success_file
  success_file=$(mktemp)
  # shellcheck disable=SC2064
  trap "rm -f $success_file" RETURN
  g_cp "$success_file" "${1}"
}

# Read / write files from app cache dir
function save_to_file() {
    # shellcheck disable=SC2005
    echo "$(eval echo "\$$1")" > "${APP_CACHE_DIR}/.$1"
}

function read_from_file() {
    cat "${APP_CACHE_DIR}/.$1" 2>/dev/null || true
}

function tbuild_with_job_vars() {
  $TBUILD "-h" "128M" "-f" "$1" "-C" "-u" "$2" || log_and_fail "Failed to run $TBUILD -u '$1'"
}

function gcs_push(){
    local tmp_dir="$1"
    local gcs_dest_path="$2"
    # Push locally exported data to GCS.
    debug_log "copying $tmp_dir to $gcs_dest_path"
    g_cp_r "$tmp_dir" "$gcs_dest_path"

    # Add success file to indicate entire batch was successfully pushed
    debug_log "pushing $SUCCESS_FILENAME to GCS"
    g_touch "$gcs_dest_path/$SUCCESS_FILENAME"

    info_log "Successfully ran $sub_filename pushing $SUCCESS_FILENAME to GCS."

    # Clean up locally exported files.
    debug_log "cleaning temp dir $tmp_dir"
    rm -rf "$tmp_dir"
}

################################################################################
# Core Run TPT Jobs Logic                                                      #
################################################################################
function run(){
  debug_log "APP_CACHE_DIR=${APP_CACHE_DIR}"
  local base_tpt_script_local="${APP_CACHE_DIR}/base.tpt"
  g_cp "$TPT_TEMPLATE" "$base_tpt_script_local"

  # TODO Loop over list of table names and generate a dynamic fast export script.
  # Loop over Substitution files
  local sub_files
  sub_files=$(g_ls "$SUBSTITUTIONS_PATH**")
  debug_log "SUBSTITUTIONS_FILES: ${sub_files}"
  local tmp_dir
  local subs

  for sub_file in $sub_files
  do
    sub_path="${sub_file/$SUBSTITUTIONS_PATH}"
    sub_path="${sub_path%.*}" # trim file extension
    sub_filename=$(basename -- "$sub_file")
    sub_filename="${sub_filename%.*}" # trim file extension
    tmp_dir=$(mktemp -d "${sub_filename}-XXXXXXXX" -p "${APP_CACHE_DIR}")
    subs=$(g_cat "$sub_file")
    gcs_dest_path="${sub_file/$SUBSTITUTIONS_PATH/gs:\/\/$BUCKET\/$DESTINATION_PREFIX}"

    set +e
    objs_at_dest=$(g_ls "$gcs_dest_path")
    ls_exit_code="$?"
    set -e
    if [ $ls_exit_code -ne 0 ]
    then
      debug_log "no objects in dest path $gcs_dest_path. continuing."
    else
       case $EXISTS_BEHAVIOR in
         skip)
         continue
         ;;
         fail)
         log_and_fail "found unexpected objects in dest path $gcs_dest_path. $objs_at_dest"
         ;;
         truncate)
         debug_log "deleting existing objects in dest path $gcs_dest_path. $objs_at_dest"
         g_rm_r "$gcs_dest_path"
         ;;
       esac
    fi

    debug_log "Running TPT for $sub_filename to temp dir $tmp_dir with substitutions $subs"
    tbuild_with_job_vars "$base_tpt_script_local" "var_userid='$USERNAME', var_password='$PASSWORD', var_directorypath='$tmp_dir', var_outfile='$sub_filename.dat', ${subs//$'\n'/}"

    gcs_push "$tmp_dir" "$gcs_dest_path"

  done
}


usage() {
cat << EOF
Usage:
  ${CMDNAME} [run|help] \\
    [-h] [-v] [-e skip | fail | truncate] [-p <DESTINATION_PREFIX>] \\
    -t gs://<BUCKET>/<BASE_TPT_TEMPLATE_PATH> \\
    -u <USERNAME> \\
    -pw <PASSWORD> \\
    -b <BUCKET> \\
    -s  gs://<BUCKET>/<SUBSTITUTIONS_PATH>

Wrapper to manage many TPT exports to Google Cloud Storage.
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
        Shows this help message.
-t, --tpt-template gs://<BUCKET>/<TPT_TEMPLATE_PATH>
        Required. Base TPT template defining variable sub_files.
        This script assumes that this template uses the following substitution
        variables:
          var_username: Pass along the UN passed to this script to each TPT invocation.
          var_password: Pass along the PW passed to this script to each TPT invocation.
          var_directorypath: Set dynamically by this script for each TPT export.
          var_outfile: Set dynamically by this script for each TPT export.
-u, --username <USERNAME>
        Required. Teradata Username.
-p, --password <PASSWORD>
        Required Teradata Password.
-b, --bucket <GCS_BUCKET>
        Required Destination Google Cloud Storage Bucket.
-s  --sub-files-path gs://<BUCKET>/<SCRIPTS_PATH>
        Reuired Path to files of comma delimited var_foo='bar',var_baz='qux',..
        with which to run the template tpt script.
        This scripts will essentially run the following command for each
        sub_files file in this path:
        tbuild -h 128M -f <template-script> -C  -u '<sub_files-file-contents>'
-e  --gcs-dir-exists-behavior
        Optional. What to do if the target GCS directory is not empty
        One of 'fail', 'truncate', 'skip'. Default is 'skip'.
-p  --destination-prefix
        Optional. Prefix to upload GCS files to. This should usually be the destination
        dataset name in BigQuery. This should contain trailing slash if desired.
-l  --cloud-log-name
        Optional. Google Cloud Logging Name.
        The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
        See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
        for other possible naming conventions.
-v, --verbose
        Add even more verbosity when running the script.

These are supported commands use in various situations:
help
        Print help
run
        Run a TPT for every sub_files file in the GCS sub-files-path.
EOF
echo
}

################################################################################
# ENTRYPOINT CONTROL LOOP                                                      #
################################################################################

if [[ "$#" -eq 0 ]]; then
    echo "You must provide at least one command."
    usage
    exit 1
fi

CMD=$1
shift
if [ "${CMD}" == "help" ]
then
  usage
  exit 0
fi

if [ "${CMD}" != "run" ]
then
  usage
  exit 1
fi


################################################################################
# CLI ARGUMENT PARSING                                                         #
################################################################################
if [[ "$#" -eq 0 ]]; then
  echo "must set required options."
  usage
  exit 1
else
  # Parse Flags.
  while  (( "$#" ))
  do
    case "${1}" in
      -h|--help)
        usage;
        exit 0 ;;
      -t|--tpt-template)
        export TPT_TEMPLATE="${2}";
        shift 2 ;;
      -u|--username)
        export USERNAME="${2}";
        shift 2 ;;
      -pw|--password)
        export PASSWORD="${2}";
        shift 2 ;;
      -b|--bucket)
        export BUCKET="${2}";
        shift 2 ;;
      -s|--sub-files-path)
        export SUBSTITUTIONS_PATH="${2}";
        shift 2 ;;
      -e|--gcs-dir-exists-behavior)
        export EXISTS_BEHAVIOR="${2}";
        shift 2 ;;
      -p|--destination-prefix)
        export DESTINATION_PREFIX="${2}";
        shift 2 ;;
      -l|--cloud-log-name)
        export LOG_NAME="${2}";
        shift 2 ;;
      -v|--verbose)
        export VERBOSE="true";
        export GSUTIL="gsutil"
        export GCLOUD="gcloud"

        echo "Verbosity turned on" >&2
        shift ;;
      *)
        usage
        echo "ERROR: Unknown argument ${1}"
        exit 1
        ;;
    esac
  done

  # Check required Args
  if [ -z ${TPT_TEMPLATE+x} ]
  then
    echo "tpt-template is required"
    usage
    exit 1
  fi

  if [ -z ${USERNAME+x} ]
  then
    echo "username is required"
    usage
    exit 1
  fi

  if [ -z ${PASSWORD+x} ]
  then
    echo "password is required"
    usage
    exit 1
  fi

  if [ -z ${BUCKET+x} ]
  then
    echo "bucket is required"
    usage
    exit 1
  fi

  if [ -z ${SUBSTITUTIONS_PATH+x} ]
  then
    echo "substitutions-path is required"
    usage
    exit 1
  fi
  # run the main program
  run
fi
	#!/usr/bin/env bash
	# Copyright 2020 Google Inc.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	#
	#
	# This software is provided as-is,
	# without warranty or representation for any use or purpose.
	# Your use of it is subject to your agreement with Google.

	set -eo pipefail
	################################################################################
	# GLOBAL DEFAULTS #
	################################################################################

	APP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

	APP_CACHE_DIR_TEMPLATE=".chunk-cache-dir-XXXXXXXX"
	APP_CACHE_DIR=$(mktemp -d -t "$APP_CACHE_DIR_TEMPLATE" -p "$APP_DIR")
	trap 'rm -rf "${APP_CACHE_DIR}"' EXIT

	CMDNAME="$(basename -- "$0")"

	BTEQ="bteq" # tbuild binary with universal options
	GSUTIL="gsutil -q"
	GCLOUD="gcloud -q"
	export GRANULARITY="HOUR"
	export CHUNKING_COL="cre_ts"
	export ROWS_PER_CHUNK="100000"
	export LWM="1970-01-01 00:00:00"
	# shellcheck disable=SC2155
	export HWM=$(date --utc +"%y-%m-%d %H:%M:%S")

	export VERBOSE="false"

	################################################################################
	# UTILITY FUNCTIONS #
	################################################################################

	function g_log() {
	local severity="$1"
	local message="$2"
	if [[ -n "${LOG_NAME}" ]]; then
	gcloud logging write "${LOG_NAME}" --severity="$severity" "$message"
	fi
	}

	# error log wrapper
	function log_and_fail() {
	echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: ERROR $*" >&2
	g_log "ERROR" "$*"
	return 1
	}

	# verbose log wrapper
	function debug_log() {
	if [[ ${VERBOSE} == "true" ]]; then
	echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: DEBUG $*" >&2
	fi
	g_log "DEBUG" "$*"
	}

	# verbose log wrapper
	function info_log() {
	echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: INFO $*" >&2
	g_log "INFO" "$*"
	}

	# gcs copy
	function g_cp() {
	CMD="$GSUTIL cp ${1} ${2}"
	debug_log "Running ${CMD}"
	${CMD} \|\| log_and_fail "Unable to execute ${CMD}"
	}

	# Run the teradata query to perform chunking logic
	function run_chunk_query_to_file(){

	if [[ $GRANULARITY =~ YEAR\|MONTH ]]; then
	sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), '$GRANULARITY') AS TIMESTAMP(6))"
	elif [ "$GRANULARITY" == "DAY" ]; then
	sub_chunk_expression="CAST(TRUNC(CAST($CHUNKING_COL AS DATE), 'DD') AS TIMESTAMP(6))"
	elif [ "$GRANULARITY" == "HOUR" ]; then
	export ts=$CHUNKING_COL
	sub_chunk_expression="$ts - (EXTRACT(MINUTE FROM $ts) * INTERVAL '1' MINUTE + EXTRACT(SECOND FROM $ts) * INTERVAL '1' SECOND)"
	else
	log_and_fail "granularity must be YEAR, MONTH, DAY or HOUR."
	fi
	debug_log "sub chunk expression = $sub_chunk_expression"
	$BTEQ <<EOF
	.LOGON $TDPID/$USERNAME,$PASSWORD;
	.EXPORT RESET;
	.EXPORT REPORT FILE=$OUTPUT_FILE;
	.SET SEPARATOR '\|';
	.SET WIDTH 254;

	WITH
	rows_per_part AS (
	SELECT
	$sub_chunk_expression AS part,
	COUNT(*) AS row_count
	FROM
	$TABLE
	WHERE
	$CHUNKING_COL BETWEEN '$LWM' AND '$HWM'
	GROUP BY
	part),
	rows_per_part_cum_sum AS (
	SELECT
	part,
	row_count,
	SUM(row_count) OVER(ORDER BY part ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_row_count
	FROM
	rows_per_part),
	chunk_nums AS (
	SELECT
	part,
	row_count,
	running_row_count,
	CAST(FLOOR(running_row_count / (.70 * $ROWS_PER_CHUNK)) AS INTEGER) AS chunk_num
	FROM
	rows_per_part_cum_sum),
	chunks AS (
	SELECT
	chunk_num,
	MIN(part) AS min_timestamp,
	MAX(part) AS max_timestamp,
	COUNT(*) AS parts_per_chunk,
	SUM(row_count) AS ROWS_PER_CHUNK
	FROM
	chunk_nums
	GROUP BY chunk_num)

	SELECT
	chunk_num (TITLE ''),
	'SELECT * FROM $TABLE WHERE l_commitdate BETWEEN ''' \|\| TO_CHAR(min_timestamp, 'YYYY-MM-DD HH:MI:SS') \|\| ''' AND ''' \|\| TO_CHAR(max_timestamp, 'YYYY-MM-DD HH:MI:SS') \|\| ''';' (TITLE '')
	FROM
	chunks
	ORDER BY
	min_timestamp ASC,
	max_timestamp ASC;
	.LOGOFF;
	.EXIT
	EOF

	# remove useless header from output file
	sed -i '1d' "$OUTPUT_FILE"
	}

	usage() {
	cat << EOF
	Usage:
	${CMDNAME} [run\|help] \\
	[-h] [-v] \\
	[-c <CHUNKING_COL>] \\
	[-g YEAR \| MONTH \| DAY \| HOUR ] [-r <ROWS_PER_CHUNK>] \\
	-td <TDPID> \\
	-u <USERNAME> \\
	-pw <PASSWORD> \\
	-t <TABLE> \\
	-lwm <LOW WATERMARK> \\
	-hwm <LOW WATERMARK> \\
	-o <OUTPUT FILE>

	Helper utility for chunking the history of a teradata TABLE into manageable size
	chunks for exports. The output of the script is a query defining each batch.
	It will split data into sub-chunks based on a date or timestamp column
	(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
	approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
	an optimization as to not have excess overhead of too many export jobs in the
	case that there are many sparse partitions (e.g. early in the history of
	application logs).
	The script is adapted to work properly when added to the PATH variable. This will allow you to use
	this script from any location.
	Flags:
	-h, --help
	Shows this help message.
	-td, --TDPID
	Required. Teradata instance ID.
	-u, --username <USERNAME>
	Required. Teradata Username.
	-pw, --password <PASSWORD>
	Required. Teradata Password.
	-t, --TABLE
	Required. ID of TABLE to chunk.
	-c, --chunking-col
	Column name of timestamp on which to perform chunking.
	Default: "cre_ts"
	-g, --granularity
	subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
	Default: "HOUR"
	-r, --rows-per-chunk
	target number of rows per chunk.
	Default: "100,000"
	-lwm, --low-watermark
	timestamp where export time range should start.
	Default: "1970-01-01 00:00:00"
	-hwm, --high-watermark
	timestamp where export time range should end.
	Default: $(date --utc +"%y-%m-%d %H:%M:%S")

	-l --cloud-log-name
	Optional. Google Cloud Logging Name.
	The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
	See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
	for other possible naming conventions.
	-v, --verbose
	Add even more verbosity when running the script.

	These are supported commands use in various situations:
	help
	Print help
	run
	Run a TPT for every sub_files file in the GCS sub-files-path.
	EOF
	echo
	}

	################################################################################
	# ENTRYPOINT CONTROL LOOP #
	################################################################################

	if [[ "$#" -eq 0 ]]; then
	echo "You must provide at least one command."
	usage
	exit 1
	fi

	CMD=$1
	shift
	if [ "${CMD}" == "help" ]
	then
	usage
	exit 0
	fi

	if [ "${CMD}" != "run" ]
	then
	usage
	exit 1
	fi


	################################################################################
	# CLI ARGUMENT PARSING #
	################################################################################
	if [[ "$#" -eq 0 ]]; then
	echo "must set required options."
	usage
	exit 1
	else
	# Parse Flags.
	while (( "$#" ))
	do
	case "${1}" in
	-h\|--help)
	usage;
	exit 0 ;;
	-t\|--TABLE)
	export TABLE="${2}";
	shift 2 ;;
	-td\|--TDPID)
	export TDPID="${2}";
	shift 2 ;;
	-u\|--username)
	export USERNAME="${2}";
	shift 2 ;;
	-pw\|--password)
	export PASSWORD="${2}";
	shift 2 ;;
	-c\|--chunking-col)
	export CHUNKING_COL="${2}";
	shift 2 ;;
	-g\|--granularity)
	export GRANULARITY="${2}";
	shift 2 ;;
	-r\|--rows-per-chunk)
	export EXISTS_BEHAVIOR="${2}";
	shift 2 ;;
	-lwm\|--low-watermark)
	export LWM="${2}";
	shift 2 ;;
	-hwm\|--high-watermark)
	export HWM="${2}";
	shift 2 ;;
	-o\|--output-file)
	export OUTPUT_FILE="${2}";
	shift 2 ;;
	-l\|--cloud-log-name)
	export LOG_NAME="${2}";
	shift 2 ;;
	-v\|--verbose)
	export VERBOSE="true";
	export GSUTIL="gsutil"
	export GCLOUD="gcloud"

	echo "Verbosity turned on" >&2
	shift ;;
	*)
	usage
	echo "ERROR: Unknown argument ${1}"
	exit 1
	;;
	esac
	done

	# Check required Args
	if [ -z ${TDPID+x} ]
	then
	echo "tdpid is required"
	usage
	exit 1
	fi

	if [ -z ${TABLE+x} ]
	then
	echo "table is required"
	usage
	exit 1
	fi

	if [ -z ${USERNAME+x} ]
	then
	echo "username is required"
	usage
	exit 1
	fi

	if [ -z ${PASSWORD+x} ]
	then
	echo "password is required"
	usage
	exit 1
	fi

	if [ -z ${OUTPUT_FILE+x} ]
	then
	echo "output-file is required"
	usage
	exit 1
	fi
	# run the main program
	run_chunk_query_to_file
	debug_log "created chunks: $(cat "${OUTPUT_FILE}")"
	fi
	USING CHARACTER SET @var_utf
	DEFINE JOB tpt_export_template
	(
	APPLY TO OPERATOR ($FILE_WRITER()[@var_instances]
	ATTR (FileName=@var_outfile,Format=@var_format, TextDelimiterhex=@var_delimiter_value, DirectoryPath=@var_directorypath)
	)
	SELECT * FROM OPERATOR($EXPORT()[@var_exp_instances]
	ATTR (TdpId=@var_tdpid, UserName=@var_userid, UserPassword=@var_password,QueryBandSessInfo=@var_qb,
	SelectStmt=@var_exportquery,SpoolMode=@var_spoolmode,BlockSize=@var_blocksize,dateform=@var_dateform,
	MaxSessions=@var_maxsessions, MaxDecimalDigits = 38)
	);
	);
	var_tdpid='localhost',
	var_exportquery='select * from tpch.lineitem',
	var_delimiter_value='10',
	var_dateform='ANSIDATE',
	var_utf='UTF8',
	var_qb='UtilityDataSize=large;',
	var_maxsessions=1,
	var_instances=1,
	var_exp_instances=1,
	var_blocksize=1048472,
	var_spoolmode='Nospool',
	var_format='DELIMITED'