Created
December 13, 2015 17:53
-
-
Save nsabharwal/251d7d8881b7bcbd3b50 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
HDFS test | |
Make sure that google connector is defined in Hadoop CLASSPATH as decribed in the blog | |
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/ | |
15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2 | |
15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/' | |
Found 3 items | |
drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns | |
drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test | |
drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp | |
[hdfs@hdpgcp-1-1435537523061 ~]$ | |
Hive test | |
bash-4.1# su - hive | |
[hive@hdpgcptest-1-1435590069329 ~]$ hive | |
hive> create table testns ( info string) location 'gs://hivetest/testns'; | |
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found) | |
hive> | |
To avoid the above error, we have to copy gcs connector into all the nodes under hive-client | |
cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib | |
Let’s run following Apache Hive test | |
Data Set: http://seanlahman.com/files/database/lahman591-csv.zip | |
We are writing to gs://hivetest | |
hive> create table batting (col_value STRING) location 'gs://hivetest/batting'; | |
OK | |
Time taken: 1.518 seconds | |
Run the following command to verify the location, 'gs://hivetest/batting' | |
hive> show create table batting; | |
OK | |
CREATE TABLE `batting`( | |
`col_value` string) | |
ROW FORMAT SERDE | |
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | |
STORED AS INPUTFORMAT | |
'org.apache.hadoop.mapred.TextInputFormat' | |
OUTPUTFORMAT | |
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | |
LOCATION | |
'gs://hivetest/batting' | |
TBLPROPERTIES ( | |
'transient_lastDdlTime'='1435766262') | |
Time taken: 0.981 seconds, Fetched: 12 row(s) | |
hive> select count(1) from batting; | |
Upload Batting.csv | |
hive> drop table batting; | |
You will notice that Batting.csv is deleted from the storage, as it was locally managed table. | |
In case of external table, Batting.csv won’t be removed from the storage bucket. | |
In case you want to test MR using Hive | |
hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar; | |
Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path | |
Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] | |
hive> select count(1) from batting; | |
Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d | |
Total jobs = 1 | |
Launching Job 1 out of 1 | |
Number of reduce tasks determined at compile time: 1 | |
In order to change the average load for a reducer (in bytes): | |
set hive.exec.reducers.bytes.per.reducer=<number> | |
In order to limit the maximum number of reducers: | |
set hive.exec.reducers.max=<number> | |
In order to set a constant number of reducers: | |
set mapreduce.job.reduces=<number> | |
Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/ | |
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003 | |
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 | |
2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0% | |
2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec | |
2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec | |
MapReduce Total cumulative CPU time: 4 seconds 600 msec | |
Ended Job = job_1435841827745_0003 | |
MapReduce Jobs Launched: | |
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS | |
Total MapReduce CPU Time Spent: 4 seconds 600 msec | |
OK | |
95196 | |
Time taken: 29.855 seconds, Fetched: 1 row(s) | |
hive> | |
Sparksql | |
First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found” | |
export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar | |
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) | |
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5 | |
scala> | |
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ") | |
scala> sqlContext.sql("select count(*) from batting").collect().foreach(println) | |
15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes | |
15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1) | |
15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool | |
15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s | |
[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment