Skip to content

Instantly share code, notes, and snippets.

@rajkrrsingh
Last active January 31, 2018 17:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rajkrrsingh/ba003f66e278a6905257d9826ef5483e to your computer and use it in GitHub Desktop.
Save rajkrrsingh/ba003f66e278a6905257d9826ef5483e to your computer and use it in GitHub Desktop.
how tez initial paralleism work (split calculation)

split generation in tez

2017-02-16 15:56:48,725 [INFO] [InputInitializer {Map 1} #0] |dag.RootInputInitializerManager|: Starting InputInitializer for Input: sample_07 on vertex vertex_1486830296338_0025_1_00 [Map 1]

invoke org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator#initialize

2017-02-16 15:56:48,729 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize realInputFormatName : org.apache.hadoop.hive.ql.io.HiveInputFormat

2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize inputFormat org.apache.hadoop.hive.ql.io.HiveInputFormat@293c29b7

2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize totalResource : 3072 taskResource : 1024 availableSlots : 3

2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize conf.getLong(MIN_SPLIT_SIZE, 1) 1
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize blockSize 134217728
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize minGrouping 16777216
2017-02-16 15:56:48,738 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: The preferred split size is 16777216

2017-02-16 15:56:48,739 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: InputInitializer {Map 1} #0 | initialize waves : 1.7 availableSlots : 3

InputSplit[] splits = inputFormat.getSplits(jobConf, (int) (availableSlots * waves));((int ) 3*1.7)

// HiveInputFormat


2017-02-16 15:56:48,739 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits | start with numSplits 5 

2017-02-16 15:56:48,740 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits dirs.size 1

// Class<? extends InputFormat> inputFormatClass = part.getInputFileFormatClass();

2017-02-16 15:56:48,741 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | getSplits inputFormatClass class org.apache.hadoop.mapred.TextInputFormat
if (dirs.length != 0) {
      LOG.info("Generating splits");
      LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" numSplits "+numSplits);
      LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" currentDirs.size() "+currentDirs.size());
      LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" dirs.length "+dirs.length);
      LOG.info(Thread.currentThread().getName()+" | "+Thread.currentThread().getStackTrace()[1].getMethodName()+" currentInputFormatClass "+currentInputFormatClass);
      addSplitsForGroup(currentDirs, currentTableScan, newjob,
          getInputFormatFromCache(currentInputFormatClass, job),
          currentInputFormatClass, currentDirs.size()*(numSplits / dirs.length),
          currentTable, result);
    }


    2017-02-16 15:56:48,774 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | addSplitsForGroup conf.setInputFormat class org.apache.hadoop.mapred.TextInputFormat
    2017-02-16 15:56:48,774 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: InputInitializer {Map 1} #0 | addSplitsForGroup splits 5
// invoke FileInputFormat
    InputSplit[] iss = inputFormat.getSplits(conf, splits);
    for (InputSplit is : iss) {
      result.add(new HiveInputSplit(is, inputFormatClass.getName()));
    }

    2017-02-16 15:56:48,861 [DEBUG] [InputInitializer {Map 1} #0] |mapred.FileInputFormat|: Total # of splits generated by getSplits: 5, TimeTaken: 85
2017-02-16 15:56:48,862 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 5

// HiveSplitGenerator    
2017-02-16 15:56:48,862 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 5. 3 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat


// then grouping logic get called in HiveSplitGenerator

2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? true
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,891 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,892 [DEBUG] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Adding split hdfs://rk253.openstack:8020/apps/hive/warehouse/sample_07/sample1.csv to src new group? false
2017-02-16 15:56:48,892 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: # Src groups for split generation: 2
2017-02-16 15:56:48,893 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 5 for bucket 1

org.apache.hadoop.mapred.split.TezMapredSplitsGrouper#getGroupedSplits(org.apache.hadoop.conf.Configuration, org.apache.hadoop.mapred.InputSplit[], int, java.lang.String, org.apache.hadoop.mapred.split.SplitSizeEstimator)
based on tez.grouping.split-count
2017-02-16 15:56:48,894 [INFO] [InputInitializer {Map 1} #0] |split.TezMapredSplitsGrouper|: Grouping splits in Tez
2017-02-16 15:56:48,894 [INFO] [InputInitializer {Map 1} #0] |split.TezMapredSplitsGrouper|: Using original number of splits: 5 desired splits: 5
2017-02-16 15:56:48,897 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Original split size is 5 grouped split size is 5, for bucket: 1
2017-02-16 15:56:48,899 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of grouped splits: 5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment