Hadoop
Pre-Requisites:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Edit two files:
- /etc/hadoop/core-site.xml as:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- /etc/hadoop/hdfs-site.xml as:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
May need to format the namenode the first time: hdfs namenode -format
Start the HDFS server:
- sbin/start-dfs.sh
References:
Hive
- Set $HADOOP_PATH in conf/hive-env.sh
Init Derby Schema:
schematool -dbType derby -initSchema
CREATE EXTERNAL TABLE presto_test(
column1 string,
column2 string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION
'hdfs://localhost:9000/Users/christine.le/hadoop/test_data';
OR:
hive -f ddl.hql
** Start Metastore service:
hive --service metastore
Helpful command to check Derby DB:
schematool -dbType derby -info
Load data: https://stackoverflow.com/questions/19320611/hadoop-hive-loading-data-from-csv-on-a-local-machine
hdfs dfs -copyFromLocal /Users/christine.le/hadoop/test_data/athena_test.csv hdfs://localhost:9000/Users/christine.le/hadoop/test_data/athena_test.csv
There are a couple ways to load data:
- Load data from hive:
hive> load data local inpath '/Users/christine.le/hadoop/test_data/athena_test.csv' OVERWRITE INTO TABLE presto_test;
Loading data to table default.presto_test
- Load data from HDFS:
C02W214KHV2J:sbin christine.le$ hdfs dfs -copyFromLocal /Users/christine.le/hadoop/test_data/athena_test.csv hdfs://localhost:9000/Users/christine.le/hadoop/test_data/athena_test.csv
Check if data loaded:
presto> select * from default.presto_test;
Presto
- /catalog/hive.properties:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://127.0.0.1:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
Start the Presto server:
/bin/launcher run
In Presto CLI:
./presto --server localhost:8080 --catalog hive
select * from default.presto_test;
Troubleshooting:
If getting connection errors or " ssh: Could not resolve hostname c02w214khv2j: nodename nor servname provided, or not known"
stop-dfs.sh
hdfs namenode -format
start-dfs.sh
Ocassionally, the datanode daemon does not get started. When this happens, either:
- Run
hdfs namenode -format
OR - Try deleting your hadoop subdirectories under your /tmp directory
Tips:
- Use jps to make sure datanode and namenode daemons are up
- Reference log files in hadoop/logs ./presto --server localhost:8080 --catalog hive --debug
Misc handy commands to remember:
sudo killall -HUP mDNSResponder