cjmatta/TDCH Notes.md

## TDCH Notes.md

      
    Raw
  

              TDCH Notes.md
            
          
    ##Teradata TDCH and Parallel Transporter with MapR
##Environment
Running on Peep.local: 192.168.1.26
###Tasks

Load sample data into Teradata database
Run queries to ensure it's in there
Move data using TDCH
Move data using TD Parallel transporter

###log-synth schema
Here's the schema I'm using for log-synth:
{
  {"name": "id", "class": "id"},
  {"name": "full_name", "class": "name", "type": "first_last"},
  {"name": "address", "class": "address"},
  {"name": "state", "class": "state"}
}

To generate 1000 rows:
$ ./synth -count 1000 -schema users-schema.json -format CSV -quote OPTIMISTIC > users.csv
Teradata notes

express user guide: http://developer.teradata.com/database/articles/teradata-express-14-0-for-vmware-user-guide
web interface: http://192.168.1.26/c u/p: admin/teradata
Use bteq for interactive sql
.logon 127.0.0.1/dbc
Password: dbc
####Create user table
CREATE TABLE vmtest.users,
NO FALLBACK, 
NO BEFORE JOURNAL, 
NO AFTER JOURNAL,
CHECKSUM = DEFAULT ( 
id INTEGER NOT NULL,
name VARCHAR(128) NOT NULL,
address VARCHAR(256) NOT NULL,
state VARCHAR(64) NOT NULL
) PRIMARY INDEX ( id );

script:
bteq<<EOF
.LOGON 127.0.0.1/dbc,dbc;
DROP TABLE vmtest.users;
CREATE TABLE vmtest.users,
NO FALLBACK, 
NO BEFORE JOURNAL, 
NO AFTER JOURNAL,
CHECKSUM = DEFAULT ( 
id INTEGER NOT NULL,
name VARCHAR(128) NOT NULL,
address VARCHAR(256) NOT NULL,
state VARCHAR(64) NOT NULL
) PRIMARY INDEX ( id );
EOF

####tdload
/opt/teradata/client/14.10/tbuild/bin/tdload -f ~/users.csv -u vmtest -p vmtest -h 127.0.0.1 -t vmtest.users
Seems like there are a bunch of databases made when you use tdload and get an error, when there's an error you should remove all of them:
#!/bin/sh
bteq<<EOF
.LOGON 127.0.0.1/dbc,dbc;
drop table vmtest.users;
drop table vmtest.users_ET;
drop table vmtest.users_Log;
drop table vmtest.users_UV;
EOF

Show database information: help database <dbname>;
Drop all tables associated with the erroring table
Ensure that there are no 'continue files' in:
/opt/teradata/client/14.10/tbuild/checkpoint/root
Make sure that string fields aren't quoted from input data
need to specify format for date: MM/DD/YYYY
####Connectivity from Teradata to MapR
Ensure the NFS mount can be seen from Teradata:
# showmount -e 192.168.1.32
Export list for 192.168.1.32:
/mapr               *
/mapr/demo.mapr.com *

Mount the NFS share:
$ mount -o hard,rw,nolock,intr 192.168.1.32:/mapr /mapr

####TDCH
TDCH Tutorial
Using teradata connector 1.3.2 (downloaded the HW one): download here
Install on MapR nodes, use this script after rpm install to move files to the correct locations:
#!/bin/bash
TDCH_INSTALL=$(ls -ld /usr/lib/tdch/* | awk '{print $NF}');
if [[ ! -d $TDCH_INSTALL ]]
then
	echo "TDCH not installed!";
	exit 1;
fi

HADOOP_INSTALL=$(ls -ld /opt/mapr/hadoop/hadoop-* | awk '{print $NF}');

if [[ $? -ne 0 ]]
then
	echo "Couldn't find hadoop install!";
	exit 1;
fi

echo "Copying files into ${HADOOP_INSTALL}/lib..."
cp $TDCH_INSTALL/lib/*.jar $HADOOP_INSTALL/lib/
chmod 644 $HADOOP_INSTALL/lib/tdgssconfig.jar
chmod 644 $HADOOP_INSTALL/lib/teradata-connector-1.3.2.jar
chmod 644 $HADOOP_INSTALL/lib/terajdbc4.jar


echo "Copying config files into ${HADOOP_INSTALL}/conf..."
cp $TDCH_INSTALL/conf/teradata-export-properties.xml.template $HADOOP_INSTALL/conf/teradata-export-properties.xml
chown mapr:root $HADOOP_INSTALL/conf/teradata-export-properties.xml
chmod 644 $HADOOP_INSTALL/conf/teradata-export-properties.xml

cp $TDCH_INSTALL/conf/teradata-import-properties.xml.template $HADOOP_INSTALL/conf/teradata-import-properties.xml
chown mapr:root $HADOOP_INSTALL/conf/teradata-import-properties.xml
chmod 644 $HADOOP_INSTALL/conf/teradata-import-properties.xml
Edit /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml:
<property>
 <name>mapred.child.java.opts</name>
 <value>-Djava.security.egd=file:/dev/./urandom</value>
</property>
Copy table to local directory with TDCH:
$ hadoop com.teradata.hadoop.tool.TeradataImportTool -url jdbc:teradata://192.168.1.26/database=vmtest -username vmtest -password vmtest -classname com.teradata.jdbc.TeraDriver -fileformat textfile -jobtype hdfs -method split.by.hash -targetpaths /user/mapr/import_sample -nummappers 1 -sourcetable users

Move resulting file to a hive dir:
$ mv /mapr/demo.mapr.com/user/mapr/import_sample/part-m-00000 /mapr/demo.mapr.com/user/mapr/hive/users.tsv

Create external Hive table
$ hive
hive> create external table users (
id INT,
name STRING,
address STRING,
state STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/mapr/hive/users.tsv';

Query to ensure the data is there:
hive> select * from users limit 10;
OK
469	Marsha Sterling	2594 Lazy Goose Glade	pa
0	Robert Davis	45551 Windy Flower Swale	ny
938	Mae Graham	76831 Little Embers Journey	nj
265	Sandra Beaudoin	48189 Easy Forest Acres	tn
61	Dean Long	29964 Fallen Pony Falls	tx
734	Ronnie Freed	95785 Gentle Oak Woods	wa
326	James Reyes	2768 Pied Bird Lagoon	al
530	Kenneth Wolbert	48542 Square Rabbit Dale	oh
795	Kenneth Chamblee	32702 Turning Pony Gate	pa
999	Heather Gamez	88053 Pied Treasure Pass	fl
Time taken: 0.398 seconds, Fetched: 10 row(s)
hive>

Transferring a table from TeraData to Hive directly

TDCH needs to be told where a bunch of jar files live on MapR:
Script:
#!/bin/bash

export HIVE_HOME=/opt/mapr/hive/hive-0.13
export HCAT_HOME=$HIVE_HOME/hcatalog

export LIB_JARS=$HCAT_HOME/share/hcatalog/hive-hcatalog-core-0.13.0-mapr-1405.jar,$HIVE_HOME/lib/hive-metastore-0.13.0-mapr-1405.jar,$HIVE_HOME/lib/libthrift-0.9.0.jar,$HIVE_HOME/lib/hive-exec-0.13.0-mapr-1405.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar,$HIVE_HOME/lib/jdo-api-3.0.1.jar,$HIVE_HOME/lib/hive-cli-0.13.0-mapr-1405.jar

export HADOOP_CLASSPATH=$HCAT_HOME/share/hcatalog/hive-hcatalog-core-0.13.0-mapr-1405.jar:$HIVE_HOME/lib/hive-metastore-0.13.0-mapr-1405.jar:$HIVE_HOME/lib/libthrift-0.9.0.jar:$HIVE_HOME/lib/hive-exec-0.13.0-mapr-1405.jar:$HIVE_HOME/lib/libfb303-0.9.0.jar:$HIVE_HOME/lib/jdo-api-3.0.1.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:$HADOOP_HOME/lib/slf4j-api-1.6.1.jar:$HIVE_HOME/lib/antlr-runtime-3.4.jar:$HIVE_HOME/lib/datanucleus-core-3.2.10.jar:$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar:$HADOOP_HOME/lib/mysql-connector-java-5.1.25-bin.jar:$HIVE_HOME/lib/hive-cli-0.13.0-mapr-1405.jar


hadoop com.teradata.hadoop.tool.TeradataImportTool \
-libjars $LIB_JARS \
-url jdbc:teradata://192.168.1.26/database=vmtest \
-username vmtest \
-password vmtest \
-jobtype hive \
-fileformat rcfile \
-sourcetable users \
-sourcefieldnames "id,name,address,state" \
-nummappers 1 \
-targetdatabase default \
-targettable users_tdch \
-targettableschema "id int, name string, address string, state string" \
-targetfieldnames "id,name,address,state"

  
## TDCH Options
[root@maprdemo hadoop-0.20.2]# hadoop com.teradata.hadoop.tool.TeradataImportTool
14/09/30 06:16:20 INFO tool.ConnectorImportTool: ConnectorImportTool starts at 1412082980438
hadoop jar teradata-hadoop-connector.jar
     com.teradata.connector.common.tool.ConnectorImportTool
     [-conf <conf file>] (optional)
     [-jobtype <job type>] (values: hdfs, hive, and hcat, default is hdfs)
     [-fileformat <file format>] (values: sequencefile, textfile, avrofile, orcfile and rcfile, default is textfile)
     [-classname <classname>] (optional)
     [-url <url>] (optional)
     [-username <username>] (optional)
     [-password <password>] (optional)
     [-batchsize <batchsize>] (optional, default value is 10000)
     [-accesslock <true|false>] (optional, default value is false)
     [-queryband <queryband>] (optional)
     [-targetpaths <path>] (optional, applicable for hdfs and hive jobs)]
     [-sourcetable <tablename>] (optional, use -sourcetable or -sourcequery but not both)
     [-sourceconditions <conditions>] (optional, use with -sourcetable option)
     [-sourcefieldnames <fieldnames>] (optional, comma delimited format)
     [-sourcerecordschema <record schema>] (optional, comma delimited format)
     [-targetrecordschema <record schema>] (optional, comma delimited format)
     [-sourcequery <query>] (optional, use either -sourcetable or -sourcequery but not both)
     [-sourcecountquery <countquery>] (optional, use with -sourcequery option)
     [-targetdatabase <database>] (optional)
     [-targettable <table>] (optional)
     [-targetfieldnames <fields>] (optional, comma separated format
     [-targettableschema <schema>] (optional, comma separated format
     [-targetpartitionschema <schema>] (optional, comma separated format, used with -targettableschema only
     [-separator <separator>] (optional,used to separate fields in text)
     [-lineseparator <lineseparator>] (optional, used to separate different lines, only useful for hive and hcat job)
     [-enclosedby <enclosed-by-character> (optional, used to enclose text, only useful for hdfs job)]
     [-escapedby  <escaped-by-character> (optional, used to escape special characters, only useful for hdfs job)]
     [-nullstring <string>] (optional, a string to replace null value of string type)
     [-nullnonstring <string>] (optional, a string to replace null value of non-string type, only useful for hdfs job)
    [-method <method>] (optional import method, values: split.by.partition, split.by.hash, split.by.value and split.by.amp only for Teradata version 14.10.00.02, default is split.by.hash)
    [-nummappers <num>] (optional, default is 2)
    [-splitbycolumn <column name>] (optional for split.by.hash and split.by.value methods)
    [-forcestage <true>] (optional force to use stage table, default is false)
    [-stagetablename <table name>] (optional, stage table name should be 20 characters or less)
    [-stagedatabase <database>] (optional)
    [-numpartitionsinstaging <num>] (optional)
    [-hiveconf <target path>] (optional, required for hive and hcat jobs launched on non-name nodes)]
    [-usexview <true|false>] (optional, default is true)
    [-avroschema <avro schema>] (optional, an inline Avro schema definition)
    [-avroschemafile <path>] (optional, a file path for Avro schema definition)
    [-debugoption <debug option value>] (optional, a debug option to close some stages of the whole hadoop job)
    [-sourcedateformat <date format>] (optional, a default date format for all converters that convert string to date)
    [-targetdateformat <date format>] (optional, a default date format for all converters that convert date  to string )
    [-sourcetimeformat <time format>] (optional, a default time format for all converters that convert string to time)
    [-targettimeformat <time format>] (optional, a default time format for all converters that convert time  to string )
    [-sourcetimestampformat <timestamp format>] (optional, a default timestamp format for all converters that convert string to timestamp)
    [-targettimestampformat <timestamp format>] (optional, a default timestamp format for all converters that convert timestamp  to string )
    [-sourcetimezoneid <timezone id>] (optional, a default input timezone id for all converters that convert timestamp to another timestamp
    [-targettimezoneid <timezone id>] (optional, a default output timezone id for all converters that convert timestamp to another timestamp
    [-h|help] (optional)
	[root@maprdemo hadoop-0.20.2]# hadoop com.teradata.hadoop.tool.TeradataImportTool
	14/09/30 06:16:20 INFO tool.ConnectorImportTool: ConnectorImportTool starts at 1412082980438
	hadoop jar teradata-hadoop-connector.jar
	com.teradata.connector.common.tool.ConnectorImportTool
	[-conf <conf file>] (optional)
	[-jobtype <job type>] (values: hdfs, hive, and hcat, default is hdfs)
	[-fileformat <file format>] (values: sequencefile, textfile, avrofile, orcfile and rcfile, default is textfile)
	[-classname <classname>] (optional)
	[-url <url>] (optional)
	[-username <username>] (optional)
	[-password <password>] (optional)
	[-batchsize <batchsize>] (optional, default value is 10000)
	[-accesslock <true\|false>] (optional, default value is false)
	[-queryband <queryband>] (optional)
	[-targetpaths <path>] (optional, applicable for hdfs and hive jobs)]
	[-sourcetable <tablename>] (optional, use -sourcetable or -sourcequery but not both)
	[-sourceconditions <conditions>] (optional, use with -sourcetable option)
	[-sourcefieldnames <fieldnames>] (optional, comma delimited format)
	[-sourcerecordschema <record schema>] (optional, comma delimited format)
	[-targetrecordschema <record schema>] (optional, comma delimited format)
	[-sourcequery <query>] (optional, use either -sourcetable or -sourcequery but not both)
	[-sourcecountquery <countquery>] (optional, use with -sourcequery option)
	[-targetdatabase <database>] (optional)
	[-targettable <table>] (optional)
	[-targetfieldnames <fields>] (optional, comma separated format
	[-targettableschema <schema>] (optional, comma separated format
	[-targetpartitionschema <schema>] (optional, comma separated format, used with -targettableschema only
	[-separator <separator>] (optional,used to separate fields in text)
	[-lineseparator <lineseparator>] (optional, used to separate different lines, only useful for hive and hcat job)
	[-enclosedby <enclosed-by-character> (optional, used to enclose text, only useful for hdfs job)]
	[-escapedby <escaped-by-character> (optional, used to escape special characters, only useful for hdfs job)]
	[-nullstring <string>] (optional, a string to replace null value of string type)
	[-nullnonstring <string>] (optional, a string to replace null value of non-string type, only useful for hdfs job)
	[-method <method>] (optional import method, values: split.by.partition, split.by.hash, split.by.value and split.by.amp only for Teradata version 14.10.00.02, default is split.by.hash)
	[-nummappers <num>] (optional, default is 2)
	[-splitbycolumn <column name>] (optional for split.by.hash and split.by.value methods)
	[-forcestage <true>] (optional force to use stage table, default is false)
	[-stagetablename <table name>] (optional, stage table name should be 20 characters or less)
	[-stagedatabase <database>] (optional)
	[-numpartitionsinstaging <num>] (optional)
	[-hiveconf <target path>] (optional, required for hive and hcat jobs launched on non-name nodes)]
	[-usexview <true\|false>] (optional, default is true)
	[-avroschema <avro schema>] (optional, an inline Avro schema definition)
	[-avroschemafile <path>] (optional, a file path for Avro schema definition)
	[-debugoption <debug option value>] (optional, a debug option to close some stages of the whole hadoop job)
	[-sourcedateformat <date format>] (optional, a default date format for all converters that convert string to date)
	[-targetdateformat <date format>] (optional, a default date format for all converters that convert date to string )
	[-sourcetimeformat <time format>] (optional, a default time format for all converters that convert string to time)
	[-targettimeformat <time format>] (optional, a default time format for all converters that convert time to string )
	[-sourcetimestampformat <timestamp format>] (optional, a default timestamp format for all converters that convert string to timestamp)
	[-targettimestampformat <timestamp format>] (optional, a default timestamp format for all converters that convert timestamp to string )
	[-sourcetimezoneid <timezone id>] (optional, a default input timezone id for all converters that convert timestamp to another timestamp
	[-targettimezoneid <timezone id>] (optional, a default output timezone id for all converters that convert timestamp to another timestamp
	[-h\|help] (optional)