vinodnerella/sqoop-split-by-example

## sqoop-split-by-example
Sqoop takes a long time to retrieve the minimum and maximum values of the column specified in the --split-by parameter that are needed for breaking the data into multiple independent tasks.


mysql -u retail_dba -h nn01.itversity.com -pitversity

use retail_export;


mysql> create table int_splitby (col1 integer, col2 char(10));
Query OK, 0 rows affected (0.34 sec)

mysql> select * from int_splitby;
Empty set (0.00 sec)

mysql> INSERT INTO int_splitby values(1,"One");
Query OK, 1 row affected (0.05 sec)

mysql> select * from int_splitby;
+------+------+
| col1 | col2 |
+------+------+
|    1 | One  |
+------+------+
1 row in set (0.01 sec)

mysql> INSERT INTO int_splitby values(1,"One");
Query OK, 1 row affected (0.06 sec)

mysql> INSERT INTO int_splitby values(NULL,"One");
Query OK, 1 row affected (0.04 sec)

mysql> INSERT INTO int_splitby values(NULL,NULL);
Query OK, 1 row affected (0.10 sec)

mysql> INSERT INTO int_splitby values(3,"Three");
Query OK, 1 row affected (0.17 sec)

mysql> INSERT INTO int_splitby values(4,"Four");
Query OK, 1 row affected (0.05 sec)

mysql> select * from int_splitby;
+------+-------+
| col1 | col2  |
+------+-------+
|    1 | One   |
|    1 | One   |
| NULL | One   |
| NULL | NULL  |
|    3 | Three |
|    4 | Four  |
+------+-------+


[nerellavinod@gw01 ~]$ sqoop import --connect "jdbc:mysql://nn01.itversity.com:3306/retail_export" --username retail_dba -password itversity --query 'select * from int_splitby WHERE $CONDITIONS' --hive-import --hive-table INT_SPLIT_1 --split-by col1 --target-dir /user/nerellavinod/sqoopimport/
Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/10/30 11:30:16 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245
17/10/30 11:30:16 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/10/30 11:30:16 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
17/10/30 11:30:16 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
17/10/30 11:30:17 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/10/30 11:30:17 INFO tool.CodeGenTool: Beginning code generation
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE  (1 = 0)
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE  (1 = 0)
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE  (1 = 0)
17/10/30 11:30:18 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.5.0.0-1245/hadoop-mapreduce
Note: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/10/30 11:30:20 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.jar
17/10/30 11:30:20 INFO mapreduce.ImportJobBase: Beginning query import.
17/10/30 11:30:23 INFO impl.TimelineClientImpl: Timeline service address: http://rm01.itversity.com:8188/ws/v1/timeline/
17/10/30 11:30:23 INFO client.RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8050
17/10/30 11:30:23 INFO client.AHSProxy: Connecting to Application History server at rm01.itversity.com/172.16.1.106:10200
17/10/30 11:30:32 INFO db.DBInputFormat: Using read commited transaction isolation
17/10/30 11:30:32 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(col1), MAX(col1) FROM (select * from int_splitby WHERE  (1 = 1) ) AS t1
17/10/30 11:30:32 INFO db.IntegerSplitter: Split size: 0; Num splits: 4 from: 1 to: 4
17/10/30 11:30:32 INFO mapreduce.JobSubmitter: number of splits:4
17/10/30 11:30:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507687444776_5127
17/10/30 11:30:33 INFO impl.YarnClientImpl: Submitted application application_1507687444776_5127
17/10/30 11:30:33 INFO mapreduce.Job: The url to track the job: http://rm01.itversity.com:8088/proxy/application_1507687444776_5127/
17/10/30 11:30:33 INFO mapreduce.Job: Running job: job_1507687444776_5127
17/10/30 11:30:39 INFO mapreduce.Job: Job job_1507687444776_5127 running in uber mode : false
17/10/30 11:30:39 INFO mapreduce.Job:  map 0% reduce 0%
17/10/30 11:30:44 INFO mapreduce.Job:  map 100% reduce 0%
17/10/30 11:30:45 INFO mapreduce.Job: Job job_1507687444776_5127 completed successfully
17/10/30 11:30:45 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=640796
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=393
		HDFS: Number of bytes written=27
		HDFS: Number of read operations=16
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=8
	Job Counters
		Launched map tasks=4
		Other local map tasks=4
		Total time spent by all maps in occupied slots (ms)=24298
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=12149
		Total vcore-milliseconds taken by all map tasks=12149
		Total megabyte-milliseconds taken by all map tasks=24881152
	Map-Reduce Framework
		Map input records=4
		Map output records=4
		Input split bytes=393
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=178
		CPU time spent (ms)=3700
		Physical memory (bytes) snapshot=994680832
		Virtual memory (bytes) snapshot=14829936640
		Total committed heap usage (bytes)=915931136
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=27
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Transferred 27 bytes in 22.8895 seconds (1.1796 bytes/sec)
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Retrieved 4 records.
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners
17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE  (1 = 0)
17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE  (1 = 0)
17/10/30 11:30:45 INFO hive.HiveImport: Loading uploaded data into Hive

Logging initialized using configuration in jar:file:/usr/hdp/2.5.0.0-1245/hive/lib/hive-common-1.2.1000.2.5.0.0-1245.jar!/hive-log4j.properties
OK
Time taken: 1.84 seconds
Loading data to table default.int_split_1
Table default.int_split_1 stats: [numFiles=4, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 0.403 seconds


[nerellavinod@gw01 ~]$ hive

Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties
hive (default)> select * from int_split_1;
OK
1	One
1	One
3	Three
4	Four
Time taken: 2.232 seconds, Fetched: 4 row(s)
hive (default)>
	Sqoop takes a long time to retrieve the minimum and maximum values of the column specified in the --split-by parameter that are needed for breaking the data into multiple independent tasks.


	mysql -u retail_dba -h nn01.itversity.com -pitversity

	use retail_export;


	mysql> create table int_splitby (col1 integer, col2 char(10));
	Query OK, 0 rows affected (0.34 sec)

	mysql> select * from int_splitby;
	Empty set (0.00 sec)

	mysql> INSERT INTO int_splitby values(1,"One");
	Query OK, 1 row affected (0.05 sec)

	mysql> select * from int_splitby;
	+------+------+
	\| col1 \| col2 \|
	+------+------+
	\| 1 \| One \|
	+------+------+
	1 row in set (0.01 sec)

	mysql> INSERT INTO int_splitby values(1,"One");
	Query OK, 1 row affected (0.06 sec)

	mysql> INSERT INTO int_splitby values(NULL,"One");
	Query OK, 1 row affected (0.04 sec)

	mysql> INSERT INTO int_splitby values(NULL,NULL);
	Query OK, 1 row affected (0.10 sec)

	mysql> INSERT INTO int_splitby values(3,"Three");
	Query OK, 1 row affected (0.17 sec)

	mysql> INSERT INTO int_splitby values(4,"Four");
	Query OK, 1 row affected (0.05 sec)

	mysql> select * from int_splitby;
	+------+-------+
	\| col1 \| col2 \|
	+------+-------+
	\| 1 \| One \|
	\| 1 \| One \|
	\| NULL \| One \|
	\| NULL \| NULL \|
	\| 3 \| Three \|
	\| 4 \| Four \|
	+------+-------+


	[nerellavinod@gw01 ~]$ sqoop import --connect "jdbc:mysql://nn01.itversity.com:3306/retail_export" --username retail_dba -password itversity --query 'select * from int_splitby WHERE $CONDITIONS' --hive-import --hive-table INT_SPLIT_1 --split-by col1 --target-dir /user/nerellavinod/sqoopimport/
	Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail.
	Please set $ACCUMULO_HOME to the root of your Accumulo installation.
	17/10/30 11:30:16 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245
	17/10/30 11:30:16 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
	17/10/30 11:30:16 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
	17/10/30 11:30:16 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
	17/10/30 11:30:17 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
	17/10/30 11:30:17 INFO tool.CodeGenTool: Beginning code generation
	17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
	17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
	17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
	17/10/30 11:30:18 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.5.0.0-1245/hadoop-mapreduce
	Note: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.java uses or overrides a deprecated API.
	Note: Recompile with -Xlint:deprecation for details.
	17/10/30 11:30:20 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.jar
	17/10/30 11:30:20 INFO mapreduce.ImportJobBase: Beginning query import.
	17/10/30 11:30:23 INFO impl.TimelineClientImpl: Timeline service address: http://rm01.itversity.com:8188/ws/v1/timeline/
	17/10/30 11:30:23 INFO client.RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8050
	17/10/30 11:30:23 INFO client.AHSProxy: Connecting to Application History server at rm01.itversity.com/172.16.1.106:10200
	17/10/30 11:30:32 INFO db.DBInputFormat: Using read commited transaction isolation
	17/10/30 11:30:32 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(col1), MAX(col1) FROM (select * from int_splitby WHERE (1 = 1) ) AS t1
	17/10/30 11:30:32 INFO db.IntegerSplitter: Split size: 0; Num splits: 4 from: 1 to: 4
	17/10/30 11:30:32 INFO mapreduce.JobSubmitter: number of splits:4
	17/10/30 11:30:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507687444776_5127
	17/10/30 11:30:33 INFO impl.YarnClientImpl: Submitted application application_1507687444776_5127
	17/10/30 11:30:33 INFO mapreduce.Job: The url to track the job: http://rm01.itversity.com:8088/proxy/application_1507687444776_5127/
	17/10/30 11:30:33 INFO mapreduce.Job: Running job: job_1507687444776_5127
	17/10/30 11:30:39 INFO mapreduce.Job: Job job_1507687444776_5127 running in uber mode : false
	17/10/30 11:30:39 INFO mapreduce.Job: map 0% reduce 0%
	17/10/30 11:30:44 INFO mapreduce.Job: map 100% reduce 0%
	17/10/30 11:30:45 INFO mapreduce.Job: Job job_1507687444776_5127 completed successfully
	17/10/30 11:30:45 INFO mapreduce.Job: Counters: 30
	File System Counters
	FILE: Number of bytes read=0
	FILE: Number of bytes written=640796
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=393
	HDFS: Number of bytes written=27
	HDFS: Number of read operations=16
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=8
	Job Counters
	Launched map tasks=4
	Other local map tasks=4
	Total time spent by all maps in occupied slots (ms)=24298
	Total time spent by all reduces in occupied slots (ms)=0
	Total time spent by all map tasks (ms)=12149
	Total vcore-milliseconds taken by all map tasks=12149
	Total megabyte-milliseconds taken by all map tasks=24881152
	Map-Reduce Framework
	Map input records=4
	Map output records=4
	Input split bytes=393
	Spilled Records=0
	Failed Shuffles=0
	Merged Map outputs=0
	GC time elapsed (ms)=178
	CPU time spent (ms)=3700
	Physical memory (bytes) snapshot=994680832
	Virtual memory (bytes) snapshot=14829936640
	Total committed heap usage (bytes)=915931136
	File Input Format Counters
	Bytes Read=0
	File Output Format Counters
	Bytes Written=27
	17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Transferred 27 bytes in 22.8895 seconds (1.1796 bytes/sec)
	17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Retrieved 4 records.
	17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners
	17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
	17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
	17/10/30 11:30:45 INFO hive.HiveImport: Loading uploaded data into Hive

	Logging initialized using configuration in jar:file:/usr/hdp/2.5.0.0-1245/hive/lib/hive-common-1.2.1000.2.5.0.0-1245.jar!/hive-log4j.properties
	OK
	Time taken: 1.84 seconds
	Loading data to table default.int_split_1
	Table default.int_split_1 stats: [numFiles=4, numRows=0, totalSize=27, rawDataSize=0]
	OK
	Time taken: 0.403 seconds


	[nerellavinod@gw01 ~]$ hive

	Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties
	hive (default)> select * from int_split_1;
	OK
	1 One
	1 One
	3 Three
	4 Four
	Time taken: 2.232 seconds, Fetched: 4 row(s)
	hive (default)>