yenchenlin/a.md

## a.md

      
    Raw
  

              a.md
            
          
    #Assignment 1
Final Output

3780008173
3780008179
3780008201
3780008213
3780008233
3780008251
3780008269
3780008273
3780008291
3780008329

Explain how you design your mapper and reducer

Background

在 Hapdoop 系統中，為了要能夠 serialize data ，因此原本的 Java types 都必須 implement Writable 這個 interface。舉例來說，若要以 Long 當作 input / output 的 < key, value > type 時，必須用 LongWritable 這個 wrapper type 代替。但就本質而言，兩者存的數字是同一個。
Mapper

  public static class TokenizerMapper
  		// 1
       extends Mapper<Object, Text, NullWritable, LongWritable>{
    // 2
    private TreeMap<Long, String> numbers = new TreeMap<Long, String>();

定義 Mapper output 的 < key, value > type 為 <NullWritable , LongWritable>
在 Mapper 中宣告一個 type 為 TreeMap < Long, String > 的變數 numbers ，用來找出 local top 10。numbers以讀進來的數字為 key ，value 則填入空字串 " "。選擇用 TreeMap 是因為他會依照 key 的大小排列 data，因此可以讓我們在尋找 local top 10 的過程中方便的刪掉比較小的數字

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      // 1
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        // 2
        Long number = Long.parseLong(itr.nextToken());
        // 3
        numbers.put(number, " ");
        // 4
        if (numbers.size() > N)
          numbers.remove(numbers.firstKey());
      }
    }

利用 StringTokenizer 幫助我們 parse input data
將 parse 出來的數字轉為 Long type
將這個數字當作 key， " " 當作 value 存入剛剛定義的 numbers 中
如果 numbers 裡面存超過 10 筆資料，就把第一筆（key 最小的資料）刪除

    //1
    protected void cleanup(Context context) throws IOException, InterruptedException {
      for ( Long l : numbers.keySet() ) {
        //2
        context.write(NullWritable.get(), new LongWritable(l));
      } 
    }
  }

因為我們要等 Mapper 掃過每筆資料以後才 output local top 10，所以不能把 write outut 的 code 寫在 map function 裡面 ; 這裡利用 cleanup function 是 mapper 最後 call  的 function 的特性，在此做 output 的動作。
將 numbers 裡的資料寫到 output。注意 這邊把所有存在 numbers 的數字都 output 到 NullWritable.get() 這個 key，如此一來全部的 local top 10 都會存在 Reducer 的同一個 key 上。

Reducer

  // 1
  public static class IntSumReducer
       extends Reducer<NullWritable, LongWritable, NullWritable, LongWritable> {
    // 2
    private TreeMap<Long, String> numbers = new TreeMap<Long, String>();
    public void reduce(NullWritable key, Iterable<LongWritable> values, Context context
                       ) throws IOException, InterruptedException {
      
      for (LongWritable value : values) {
        // 3
        numbers.put(value.get(), " ");
        // 4
        if (numbers.size() > N)
          numbers.remove(numbers.firstKey());
      }
      // 5
      for (Long l : numbers.keySet()) {
        context.write(NullWritable.get(), new LongWritable(l));
      }
    }
  }

定義 Reducer input / output 的 < key, value > 形式為 <NullWritable , LongWritable>
在 Reducer 中宣告一個 type 為 TreeMap < Long, String > 的變數 numbers ，用來找出 global top 10。numbers以讀進來的數字為 key ，value 則填入空字串 " "。選擇用 TreeMap 是因為他會依照 key 的大小排列 data，因此可以讓我們方便的刪掉比較小的數字
將每個 mapper 傳過來的數字當作 key，" " 當作 value 存入剛剛定義的 numbers 中
如果 numbers 裡面存超過 10 筆資料，就把第一筆（key 最小的資料）刪除
因為 Reducer 只有一個 input key (NullWritable.get())，iterate 完 values 以後便可直接將 numbers 內尚存的每個 key 寫到 output，如此即可找出 global top 10