nsabharwal/Google Cloud Storage Test

## Google Cloud Storage Test
HDFS test

Make sure that google connector is defined in Hadoop CLASSPATH as decribed in the blog


[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/

15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2

15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'

Found 3 items

drwx------   - hdfs hdfs         0 2015-06-28 15:29 gs://hivetest/ns

drwx------   - hdfs hdfs         0 2015-06-28 12:44 gs://hivetest/test

drwx------   - hdfs hdfs         0 2015-06-28 15:30 gs://hivetest/tmp

[hdfs@hdpgcp-1-1435537523061 ~]$

Hive test

bash-4.1# su - hive

[hive@hdpgcptest-1-1435590069329 ~]$ hive

hive> create table testns ( info string) location 'gs://hivetest/testns';

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)

hive>

To avoid the above error, we have to copy gcs connector into all the nodes under hive-client

 cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib

 Let’s run following Apache Hive test

Data Set: http://seanlahman.com/files/database/lahman591-csv.zip

 We are writing to gs://hivetest

 hive> create table batting (col_value STRING) location 'gs://hivetest/batting';

OK

Time taken: 1.518 seconds

Run the following command to verify the location, 'gs://hivetest/batting'

 hive> show create table batting;

OK

CREATE TABLE `batting`(

`col_value` string)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'gs://hivetest/batting'

TBLPROPERTIES (

'transient_lastDdlTime'='1435766262')

Time taken: 0.981 seconds, Fetched: 12 row(s)

hive> select count(1) from batting;

Upload Batting.csv

hive> drop table batting;

You will notice that Batting.csv is deleted from the storage, as it was locally managed table.

In case of external table, Batting.csv won’t be removed from the storage bucket.

In case you want to test MR using Hive

hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;

Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path

Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]

hive> select count(1) from batting;

Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/

Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%

2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec

2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec

MapReduce Total cumulative CPU time: 4 seconds 600 msec

Ended Job = job_1435841827745_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1   Cumulative CPU: 4.6 sec   HDFS Read: 187 HDFS Write: 6 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 600 msec

OK

95196

Time taken: 29.855 seconds, Fetched: 1 row(s)

hive>

Sparksql

First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”

export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5

 scala>

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")

 scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)

15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes

15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)

15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool

15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s

[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s
	HDFS test

	Make sure that google connector is defined in Hadoop CLASSPATH as decribed in the blog


	[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/

	15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2

	15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'

	Found 3 items

	drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns

	drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test

	drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp

	[hdfs@hdpgcp-1-1435537523061 ~]$

	Hive test

	bash-4.1# su - hive

	[hive@hdpgcptest-1-1435590069329 ~]$ hive

	hive> create table testns ( info string) location 'gs://hivetest/testns';

	FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)

	hive>

	To avoid the above error, we have to copy gcs connector into all the nodes under hive-client

	cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib

	Let’s run following Apache Hive test

	Data Set: http://seanlahman.com/files/database/lahman591-csv.zip

	We are writing to gs://hivetest

	hive> create table batting (col_value STRING) location 'gs://hivetest/batting';

	OK

	Time taken: 1.518 seconds

	Run the following command to verify the location, 'gs://hivetest/batting'

	hive> show create table batting;

	OK

	CREATE TABLE `batting`(

	`col_value` string)

	ROW FORMAT SERDE

	'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

	STORED AS INPUTFORMAT

	'org.apache.hadoop.mapred.TextInputFormat'

	OUTPUTFORMAT

	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

	LOCATION

	'gs://hivetest/batting'

	TBLPROPERTIES (

	'transient_lastDdlTime'='1435766262')

	Time taken: 0.981 seconds, Fetched: 12 row(s)

	hive> select count(1) from batting;

	Upload Batting.csv

	hive> drop table batting;

	You will notice that Batting.csv is deleted from the storage, as it was locally managed table.

	In case of external table, Batting.csv won’t be removed from the storage bucket.

	In case you want to test MR using Hive

	hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;

	Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path

	Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]

	hive> select count(1) from batting;

	Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d

	Total jobs = 1

	Launching Job 1 out of 1

	Number of reduce tasks determined at compile time: 1

	In order to change the average load for a reducer (in bytes):

	set hive.exec.reducers.bytes.per.reducer=<number>

	In order to limit the maximum number of reducers:

	set hive.exec.reducers.max=<number>

	In order to set a constant number of reducers:

	set mapreduce.job.reduces=<number>

	Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/

	Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003

	Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

	2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%

	2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec

	2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec

	MapReduce Total cumulative CPU time: 4 seconds 600 msec

	Ended Job = job_1435841827745_0003

	MapReduce Jobs Launched:

	Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS

	Total MapReduce CPU Time Spent: 4 seconds 600 msec

	OK

	95196

	Time taken: 29.855 seconds, Fetched: 1 row(s)

	hive>

	Sparksql

	First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”

	export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar

	scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

	sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5

	scala>

	scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")

	scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)

	15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes

	15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)

	15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool

	15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s

	[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s