split generation in tez
2017-02-16 15:56:48,725 [INFO] [InputInitializer {Map 1} #0] |dag.RootInputInitializerManager|: Starting InputInitializer for Input: sample_07 on vertex vertex_1486830296338_0025_1_00 [Map 1]
invoke org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator#initialize
2017-02-16 15:56:48,729 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize realInputFormatName : org.apache.hadoop.hive.ql.io.HiveInputFormat
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize inputFormat org.apache.hadoop.hive.ql.io.HiveInputFormat@293c29b7
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize totalResource : 3072 taskResource : 1024 availableSlots : 3
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize conf.getLong(MIN_SPLIT_SIZE, 1) 1
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize blockSize 134217728
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize minGrouping 16777216
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: The preferred split size is 16777216
2017-02-16 15:56:48,739 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize waves : 1.7 availableSlots : 3
InputSplit[] splits = inputFormat.getSplits(jobConf, (int) (availableSlots * waves));((int ) 3*1.7)
// HiveInputFormat
2017-02-16 15:56:48,739 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits | start with numSplits 5
2017-02-16 15:56:48,740 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits dirs.size 1
// Class<? extends InputFormat> inputFormatClass = part.getInputFileFormatClass();
2017-02-16 15:56:48,741 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits inputFormatClass class org.apache.hadoop.mapred.TextInputFormat
if (dirs.length != 0) {
LOG.info("Generating splits");
LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" numSplits "+numSplits);
LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" currentDirs.size() "+currentDirs.size());
LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" dirs.length "+dirs.length);
LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" currentInputFormatClass "+currentInputFormatClass);
addSplitsForGroup(currentDirs, currentTableScan, newjob,
getInputFormatFromCache(currentInputFormatClass, job),
currentInputFormatClass, currentDirs.size()*(numSplits / dirs.length),
currentTable, result);
}
2017-02-16 15:56:48,774 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | addSplitsForGroup conf.setInputFormat class org.apache.hadoop.mapred.TextInputFormat
2017-02-16 15:56:48,774 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | addSplitsForGroup splits 5
// invoke FileInputFormat
InputSplit[] iss = inputFormat.getSplits(conf, splits);
for (InputSplit is : iss) {
result.add(new HiveInputSplit(is, inputFormatClass.getName()));
}
2017-02-16 15:56:48,861 [DEBUG] [InputInitializer {Map 1} #0] |mapred.FileInputFormat|: Total # of splits generated by getSplits: 5, TimeTaken: 85
2017-02-16 15:56:48,862 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 5
// HiveSplitGenerator
2017-02-16 15:56:48,862 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 5. 3 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
// then grouping logic get called in HiveSplitGenerator
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? true
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,892 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,892 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: # Src groups for split generation: 2
2017-02-16 15:56:48,893 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 5 for bucket 1
org.apache.hadoop.mapred.split.TezMapredSplitsGrouper#getGroupedSplits(org.apache.hadoop.conf.Configuration, org.apache.hadoop.mapred.InputSplit[], int, java.lang.String, org.apache.hadoop.mapred.split.SplitSizeEstimator)
based on tez.grouping.split-count
2017-02-16 15:56:48,894 [INFO] [InputInitializer {Map 1} #0] |split.TezMapredSplitsGrouper|: Grouping splits in Tez
2017-02-16 15:56:48,894 [INFO] [InputInitializer {Map 1} #0] |split.TezMapredSplitsGrouper|: Using original number of splits: 5 desired splits: 5
2017-02-16 15:56:48,897 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Original split size is 5 grouped split size is 5, for bucket: 1
2017-02-16 15:56:48,899 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of grouped splits: 5