Table of Contents
This script orchestrates running a TPT base template with many different substituions (e.g. for each table / partition) and uploading the exported files to Google Cloud Storage.
The script assumes the following are installed on the VM
- Teradata Build Tools (
tbuild
) - Google Cloud SDK (
gcloud
/gsutil
) - Enough disk space to support largest export.
- TPT job base template with
@var_*
placeholders to be replaced by the values in each substituion file. - User running the script permissions on all logs / checkpoint dirs specified in
.twbcfg.ini
(e.g. root). - TPT job variables files on GCS for each table in the dataset you'd like to export from TD and upload to GCS.
These should each accept the following variables
@var_username
: Pass along the UN passed to this script to each TPT invocation.@var_password
: Pass along the PW passed to this script to each TPT invocation.@var_directorypath
: Set dynamically by this script for each TPT export.@var_outfile
: Name of file to use when exporting to directorypath.
An example template file and substitutions files can be found in examples/
The output file names are named based on the --bucket
and --destination-prefix
as well as the substitution file path itself.
The logic is for each substitutions files found under the substitutions path
replace the --sub-files-path
with the bucket and destination prefix.
For example if the script is invoked like so:
# Notably no output prefix
sudo ./tpt2gcs.sh run -v \
-t gs://dannyteradata/tptwrapper/example_base_template.tpt \
-u dbc \
-pw dbc \
-b output-bucket \
-s gs://dannyteradata/tptwrapper/subs/ \
-e truncate \
-l projects/dannyteradata/logs/tpt2gcs-script
And the subs path prefix contains substitutions files like this:
gsutil ls gs://dannyteradata/tptwrapper/subs/**
gs://dannyteradata/tptwrapper/subs/tpch/nations.txt # full tpch.nations table
gs://dannyteradata/tptwrapper/subs/foo/bar/$20200101.txt # single partition of foo.bar table
The exported data file would look like:
gsutil ls gs://output-bucket/**
gs://output-bucket/tpch/nations/nations.dat
gs://output-bucket/tpch/nations/_SUCCESS
gs://output-bucket/foo/bar/$20200101/$20200101.dat
gs://output-bucket/foo/bar/$20200101/_SUCCESS
Exports from Teradata are bound by Teradata resources / FastExport throughput. For this reason the gain in disk performance of Local SSDs is not worth the restrictions that come with SSDs. Instead, it's reccomended to use persistent disks which can mount a larger total disk volume (64 TB as compared to 9 TB).
We'd suggest no smaller machine type than n1-standard-8 to optimize networking throughput.
- Should do single export per partition
- VM must have enough disk space for largest partition on the table
- Handle password securely
- Decrypt teradata password file from gcs kms
- Script runs in GCP VM so uses ADC for access to GCS / KMS APIs
- needs
roles/storage.objectAdmin
on the bucket.
- needs
- Copy locally exported files to GCS.
- Writes a success file (so the BQ loader function picks up this partition)
- Clean up local disk
From this directory locally.
BUCKET=dannyteradata
# Copy Example Configuration Files
gsutil cp examples/example_base_template.tpt gs://${BUCKET}/tptwrapper/example_base_template.tpt
gsutil cp examples/nation.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp examples/lineitem.txt gs://${BUCKET}/tptwrapper/subs/
gsutil cp tpt2gcs.sh gs://${BUCKET}/tptwrapper/
SSH to the VM with teradata tbuild
installed.
VM_NAME="teradata"
PROJECT="dannysand"
gcloud beta compute ssh --zone "us-central1-c" "${VM_NAME}" --project "$PROJECT"
From the GCP VM with connectivity to Teradata (or the TD instance itself).
gsutil cp gs://dannyteradata/tptwrapper/tpt2gcs.sh .
chmod +x tpt2gcs.sh
sudo ./tpt2gcs.sh run -v \
-t gs://dannyteradata/tptwrapper/example_base_template.tpt \
-u dbc \
-pw dbc \
-b dannyteradata \
-s gs://dannyteradata/tptwrapper/subs/ \
-e truncate \
-p tptwrapper/exports/tdch/ \
-l projects/dannyteradata/logs/tpt2gcs-script
Usage:
tpt2gcs.sh [run|help] \
[-h] [-v] [-e skip | fail | truncate] [-p <DESTINATION_PREFIX>] \
-t gs://<BUCKET>/<BASE_TPT_TEMPLATE_PATH> \
-u <USERNAME> \
-pw <PASSWORD> \
-b <BUCKET> \
-s gs://<BUCKET>/<SUBSTITUTIONS_PATH>
Wrapper to manage many TPT exports to Google Cloud Storage.
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
Shows this help message.
-t, --tpt-template gs://<BUCKET>/<TPT_TEMPLATE_PATH>
Required. Base TPT template defining variable sub_files.
This script assumes that this template uses the following substitution
variables:
var_username: Pass along the UN passed to this script to each TPT invocation.
var_password: Pass along the PW passed to this script to each TPT invocation.
var_directorypath: Set dynamically by this script for each TPT export.
var_outfile: Set dynamically by this script for each TPT export.
-u, --username <USERNAME>
Required. Teradata Username.
-p, --password <PASSWORD>
Required Teradata Password.
-b, --bucket <GCS_BUCKET>
Required Destination Google Cloud Storage Bucket.
-s --sub-files-path gs://<BUCKET>/<SCRIPTS_PATH>
Reuired Path to files of comma delimited var_foo='bar',var_baz='qux',..
with which to run the template tpt script.
This scripts will essentially run the following command for each
sub_files file in this path:
tbuild -h 128M -f <template-script> -C -u '<sub_files-file-contents>'
-e --gcs-dir-exists-behavior
Optional. What to do if the target GCS directory is not empty
One of 'fail', 'truncate', 'skip'. Default is 'skip'.
-p --destination-prefix
Optional. Prefix to upload GCS files to. This should usually be the destination
dataset name in BigQuery. This should contain trailing slash if desired.
-l --cloud-log-name
Optional. Google Cloud Logging Name.
The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
for other possible naming conventions.
-v, --verbose
Add even more verbosity when running the script.
These are supported commands use in various situations:
help
Print help
run
Run a TPT for every sub_files file in the GCS sub-files-path.
There is a chunk.sh
convenience script for chunking the historical data in a
teradata table based on some timestamp column (e.g. created_ts
).
sudo ./chunk.sh run -v \
-td localhost \
-u dbc \
-pw dbc \
-t tpch.lineitem \
-c l_commitdate \
-g DAY \
-o chunks.csv
Usage:
chunk.sh [run|help] \
[-h] [-v] \
[-c <CHUNKING_COL>] \
[-g YEAR | MONTH | DAY | HOUR ] [-r <ROWS_PER_CHUNK>] \
-td <TDPID> \
-u <USERNAME> \
-pw <PASSWORD> \
-t <TABLE> \
-lwm <LOW WATERMARK> \
-hwm <LOW WATERMARK> \
-o <OUTPUT FILE>
Helper utility for chunking the history of a teradata TABLE into manageable size
chunks for exports. The output of the script is a query defining each batch.
It will split data into sub-chunks based on a date or timestamp column
(e.g. daily or hourly) then it will aggregate sub-chunks into chunks to reach an
approximate row count chunk size threshold (e.g. 100,000 rows). This is purely
an optimization as to not have excess overhead of too many export jobs in the
case that there are many sparse partitions (e.g. early in the history of
application logs).
The script is adapted to work properly when added to the PATH variable. This will allow you to use
this script from any location.
Flags:
-h, --help
Shows this help message.
-td, --TDPID
Required. Teradata instance ID.
-u, --username <USERNAME>
Required. Teradata Username.
-pw, --password <PASSWORD>
Required. Teradata Password.
-t, --TABLE
Required. ID of TABLE to chunk.
-c, --chunking-col
Column name of timestamp on which to perform chunking.
Default: "cre_ts"
-g, --granularity
subchunk granularity. Must be one of {YEAR, MONTH, DAY, HOUR}.
Default: "HOUR"
-r, --rows-per-chunk
target number of rows per chunk.
Default: "100,000"
-lwm, --low-watermark
timestamp where export time range should start.
Default: "1970-01-01 00:00:00"
-hwm, --high-watermark
timestamp where export time range should end.
Default: 20-11-25 01:16:17
-l --cloud-log-name
Optional. Google Cloud Logging Name.
The format can be: projects/[PROJECT_ID]/logs/[LOG_ID]
See https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry
for other possible naming conventions.
-v, --verbose
Add even more verbosity when running the script.
These are supported commands use in various situations:
help
Print help
run
Run a TPT for every sub_files file in the GCS sub-files-path.