Skip to content

Instantly share code, notes, and snippets.

@nsabharwal
Created December 13, 2015 17:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nsabharwal/251d7d8881b7bcbd3b50 to your computer and use it in GitHub Desktop.
Save nsabharwal/251d7d8881b7bcbd3b50 to your computer and use it in GitHub Desktop.
HDFS test
Make sure that google connector is defined in Hadoop CLASSPATH as decribed in the blog
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/
15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'
Found 3 items
drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns
drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test
drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp
[hdfs@hdpgcp-1-1435537523061 ~]$
Hive test
bash-4.1# su - hive
[hive@hdpgcptest-1-1435590069329 ~]$ hive
hive> create table testns ( info string) location 'gs://hivetest/testns';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)
hive>
To avoid the above error, we have to copy gcs connector into all the nodes under hive-client
cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib
Let’s run following Apache Hive test
Data Set: http://seanlahman.com/files/database/lahman591-csv.zip
We are writing to gs://hivetest
hive> create table batting (col_value STRING) location 'gs://hivetest/batting';
OK
Time taken: 1.518 seconds
Run the following command to verify the location, 'gs://hivetest/batting'
hive> show create table batting;
OK
CREATE TABLE `batting`(
`col_value` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'gs://hivetest/batting'
TBLPROPERTIES (
'transient_lastDdlTime'='1435766262')
Time taken: 0.981 seconds, Fetched: 12 row(s)
hive> select count(1) from batting;
Upload Batting.csv
hive> drop table batting;
You will notice that Batting.csv is deleted from the storage, as it was locally managed table.
In case of external table, Batting.csv won’t be removed from the storage bucket.
In case you want to test MR using Hive
hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;
Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path
Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]
hive> select count(1) from batting;
Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%
2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec
2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec
MapReduce Total cumulative CPU time: 4 seconds 600 msec
Ended Job = job_1435841827745_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 600 msec
OK
95196
Time taken: 29.855 seconds, Fetched: 1 row(s)
hive>
Sparksql
First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”
export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5
scala>
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")
scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)
15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes
15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)
15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s
[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment