Skip to content

Instantly share code, notes, and snippets.

@christine-le
Last active February 27, 2019 18:36
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save christine-le/2a5dd75c9e0a2f87bc1edda42c9b8206 to your computer and use it in GitHub Desktop.
Save christine-le/2a5dd75c9e0a2f87bc1edda42c9b8206 to your computer and use it in GitHub Desktop.
HDFS, Hive, and Presto Setups

Hadoop

Pre-Requisites:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

Edit two files:

  • /etc/hadoop/core-site.xml as:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
  • /etc/hadoop/hdfs-site.xml as:
<configuration>
 <property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:derby:metastore_db;create=true</value>
   <description>JDBC connect string for a JDBC metastore</description>
 </property>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
 </property>
</configuration>

May need to format the namenode the first time: hdfs namenode -format

Start the HDFS server:

  • sbin/start-dfs.sh

References:


Hive

  • Set $HADOOP_PATH in conf/hive-env.sh

Init Derby Schema:

schematool -dbType derby -initSchema
CREATE EXTERNAL TABLE presto_test(
  column1 string, 
  column2 string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
LOCATION
  'hdfs://localhost:9000/Users/christine.le/hadoop/test_data';

OR:

hive -f ddl.hql 

** Start Metastore service:

hive --service metastore

Helpful command to check Derby DB:

schematool -dbType derby -info

Load data: https://stackoverflow.com/questions/19320611/hadoop-hive-loading-data-from-csv-on-a-local-machine

hdfs dfs -copyFromLocal /Users/christine.le/hadoop/test_data/athena_test.csv hdfs://localhost:9000/Users/christine.le/hadoop/test_data/athena_test.csv

There are a couple ways to load data:

  1. Load data from hive:
hive> load data local inpath '/Users/christine.le/hadoop/test_data/athena_test.csv' OVERWRITE INTO TABLE presto_test;
Loading data to table default.presto_test
  1. Load data from HDFS:
C02W214KHV2J:sbin christine.le$ hdfs dfs -copyFromLocal /Users/christine.le/hadoop/test_data/athena_test.csv hdfs://localhost:9000/Users/christine.le/hadoop/test_data/athena_test.csv

Check if data loaded:

presto> select * from default.presto_test;

Presto

  • /catalog/hive.properties:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://127.0.0.1:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

Start the Presto server: /bin/launcher run

In Presto CLI: ./presto --server localhost:8080 --catalog hive

select * from default.presto_test;


Troubleshooting:

If getting connection errors or " ssh: Could not resolve hostname c02w214khv2j: nodename nor servname provided, or not known"

stop-dfs.sh
hdfs namenode -format
start-dfs.sh

Ocassionally, the datanode daemon does not get started. When this happens, either:

  1. Run hdfs namenode -format OR
  2. Try deleting your hadoop subdirectories under your /tmp directory

Tips:

  • Use jps to make sure datanode and namenode daemons are up
  • Reference log files in hadoop/logs ./presto --server localhost:8080 --catalog hive --debug

Misc handy commands to remember:

  • sudo killall -HUP mDNSResponder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment