Skip to content

Instantly share code, notes, and snippets.

@vinodnerella
Created October 30, 2017 15:36
Show Gist options
  • Save vinodnerella/16c441fe3abec5277d861a8235057f76 to your computer and use it in GitHub Desktop.
Save vinodnerella/16c441fe3abec5277d861a8235057f76 to your computer and use it in GitHub Desktop.
Sqoop takes a long time to retrieve the minimum and maximum values of the column specified in the --split-by parameter that are needed for breaking the data into multiple independent tasks.
mysql -u retail_dba -h nn01.itversity.com -pitversity
use retail_export;
mysql> create table int_splitby (col1 integer, col2 char(10));
Query OK, 0 rows affected (0.34 sec)
mysql> select * from int_splitby;
Empty set (0.00 sec)
mysql> INSERT INTO int_splitby values(1,"One");
Query OK, 1 row affected (0.05 sec)
mysql> select * from int_splitby;
+------+------+
| col1 | col2 |
+------+------+
| 1 | One |
+------+------+
1 row in set (0.01 sec)
mysql> INSERT INTO int_splitby values(1,"One");
Query OK, 1 row affected (0.06 sec)
mysql> INSERT INTO int_splitby values(NULL,"One");
Query OK, 1 row affected (0.04 sec)
mysql> INSERT INTO int_splitby values(NULL,NULL);
Query OK, 1 row affected (0.10 sec)
mysql> INSERT INTO int_splitby values(3,"Three");
Query OK, 1 row affected (0.17 sec)
mysql> INSERT INTO int_splitby values(4,"Four");
Query OK, 1 row affected (0.05 sec)
mysql> select * from int_splitby;
+------+-------+
| col1 | col2 |
+------+-------+
| 1 | One |
| 1 | One |
| NULL | One |
| NULL | NULL |
| 3 | Three |
| 4 | Four |
+------+-------+
[nerellavinod@gw01 ~]$ sqoop import --connect "jdbc:mysql://nn01.itversity.com:3306/retail_export" --username retail_dba -password itversity --query 'select * from int_splitby WHERE $CONDITIONS' --hive-import --hive-table INT_SPLIT_1 --split-by col1 --target-dir /user/nerellavinod/sqoopimport/
Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/10/30 11:30:16 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245
17/10/30 11:30:16 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/10/30 11:30:16 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
17/10/30 11:30:16 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
17/10/30 11:30:17 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/10/30 11:30:17 INFO tool.CodeGenTool: Beginning code generation
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
17/10/30 11:30:18 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
17/10/30 11:30:18 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.5.0.0-1245/hadoop-mapreduce
Note: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/10/30 11:30:20 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-nerellavinod/compile/875e42d1f9de6ced335702fe025a4b18/QueryResult.jar
17/10/30 11:30:20 INFO mapreduce.ImportJobBase: Beginning query import.
17/10/30 11:30:23 INFO impl.TimelineClientImpl: Timeline service address: http://rm01.itversity.com:8188/ws/v1/timeline/
17/10/30 11:30:23 INFO client.RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8050
17/10/30 11:30:23 INFO client.AHSProxy: Connecting to Application History server at rm01.itversity.com/172.16.1.106:10200
17/10/30 11:30:32 INFO db.DBInputFormat: Using read commited transaction isolation
17/10/30 11:30:32 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(col1), MAX(col1) FROM (select * from int_splitby WHERE (1 = 1) ) AS t1
17/10/30 11:30:32 INFO db.IntegerSplitter: Split size: 0; Num splits: 4 from: 1 to: 4
17/10/30 11:30:32 INFO mapreduce.JobSubmitter: number of splits:4
17/10/30 11:30:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507687444776_5127
17/10/30 11:30:33 INFO impl.YarnClientImpl: Submitted application application_1507687444776_5127
17/10/30 11:30:33 INFO mapreduce.Job: The url to track the job: http://rm01.itversity.com:8088/proxy/application_1507687444776_5127/
17/10/30 11:30:33 INFO mapreduce.Job: Running job: job_1507687444776_5127
17/10/30 11:30:39 INFO mapreduce.Job: Job job_1507687444776_5127 running in uber mode : false
17/10/30 11:30:39 INFO mapreduce.Job: map 0% reduce 0%
17/10/30 11:30:44 INFO mapreduce.Job: map 100% reduce 0%
17/10/30 11:30:45 INFO mapreduce.Job: Job job_1507687444776_5127 completed successfully
17/10/30 11:30:45 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=640796
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=393
HDFS: Number of bytes written=27
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=24298
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=12149
Total vcore-milliseconds taken by all map tasks=12149
Total megabyte-milliseconds taken by all map tasks=24881152
Map-Reduce Framework
Map input records=4
Map output records=4
Input split bytes=393
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=178
CPU time spent (ms)=3700
Physical memory (bytes) snapshot=994680832
Virtual memory (bytes) snapshot=14829936640
Total committed heap usage (bytes)=915931136
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=27
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Transferred 27 bytes in 22.8895 seconds (1.1796 bytes/sec)
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Retrieved 4 records.
17/10/30 11:30:45 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners
17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
17/10/30 11:30:45 INFO manager.SqlManager: Executing SQL statement: select * from int_splitby WHERE (1 = 0)
17/10/30 11:30:45 INFO hive.HiveImport: Loading uploaded data into Hive
Logging initialized using configuration in jar:file:/usr/hdp/2.5.0.0-1245/hive/lib/hive-common-1.2.1000.2.5.0.0-1245.jar!/hive-log4j.properties
OK
Time taken: 1.84 seconds
Loading data to table default.int_split_1
Table default.int_split_1 stats: [numFiles=4, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 0.403 seconds
[nerellavinod@gw01 ~]$ hive
Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties
hive (default)> select * from int_split_1;
OK
1 One
1 One
3 Three
4 Four
Time taken: 2.232 seconds, Fetched: 4 row(s)
hive (default)>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment