afonsoaugusto/Intro_to_mapreduce_udacity.md

## Intro_to_mapreduce_udacity.md

      
    Raw
  

              Intro_to_mapreduce_udacity.md
            
          
    https://classroom.udacity.com/courses/ud617

Example code

cd /home/training/udacity_training
[training@localhost udacity_training]$ head -50 data/purchases.txt | ./code/mapper.py | sort | ./code/reducer.py 

[training@localhost udacity_training]$ pwd
/home/training/udacity_training
[training@localhost udacity_training]$ hadoop fs -put data/purchases.txt myinput

[training@localhost udacity_training]$ hadoop fs -ls
Found 1 items
-rw-r--r--   1 training supergroup  211312924 2018-11-01 02:10 myinput

[training@localhost udacity_training]$ hs code/mapper.py code/reducer.py myinput output
packageJobJar: [code/mapper.py, code/reducer.py, /tmp/hadoop-training/hadoop-unjar8756022019060858567/] [] /tmp/streamjob3443365309394671497.jar tmpDir=null
18/11/01 02:14:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
18/11/01 02:14:43 WARN snappy.LoadSnappy: Snappy native library is available
18/11/01 02:14:43 INFO snappy.LoadSnappy: Snappy native library loaded
18/11/01 02:14:43 INFO mapred.FileInputFormat: Total input paths to process : 1
18/11/01 02:14:44 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
18/11/01 02:14:44 INFO streaming.StreamJob: Running job: job_201811010053_0004
18/11/01 02:14:44 INFO streaming.StreamJob: To kill this job, run:
18/11/01 02:14:44 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201811010053_0004
18/11/01 02:14:44 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201811010053_0004
18/11/01 02:14:45 INFO streaming.StreamJob:  map 0%  reduce 0%
18/11/01 02:14:55 INFO streaming.StreamJob:  map 23%  reduce 0%
18/11/01 02:14:58 INFO streaming.StreamJob:  map 33%  reduce 0%
18/11/01 02:15:01 INFO streaming.StreamJob:  map 46%  reduce 0%
18/11/01 02:15:05 INFO streaming.StreamJob:  map 50%  reduce 0%
18/11/01 02:15:13 INFO streaming.StreamJob:  map 75%  reduce 0%
18/11/01 02:15:15 INFO streaming.StreamJob:  map 89%  reduce 25%
18/11/01 02:15:18 INFO streaming.StreamJob:  map 100%  reduce 25%
18/11/01 02:15:24 INFO streaming.StreamJob:  map 100%  reduce 75%
18/11/01 02:15:27 INFO streaming.StreamJob:  map 100%  reduce 86%
18/11/01 02:15:30 INFO streaming.StreamJob:  map 100%  reduce 97%
18/11/01 02:15:32 INFO streaming.StreamJob:  map 100%  reduce 100%
18/11/01 02:15:33 INFO streaming.StreamJob: Job complete: job_201811010053_0004
18/11/01 02:15:33 INFO streaming.StreamJob: Output: output
[training@localhost udacity_training]$ 


[training@localhost udacity_training]$ hadoop fs -ls
Found 2 items
-rw-r--r--   1 training supergroup  211312924 2018-11-01 02:10 myinput
drwxr-xr-x   - training supergroup          0 2018-11-01 02:15 output
[training@localhost udacity_training]$ 

[training@localhost udacity_training]$ hadoop fs -ls output
Found 3 items
-rw-r--r--   1 training supergroup          0 2018-11-01 02:15 output/_SUCCESS
drwxr-xr-x   - training supergroup          0 2018-11-01 02:14 output/_logs
-rw-r--r--   1 training supergroup       2296 2018-11-01 02:15 output/part-00000
[training@localhost udacity_training]$ 


[training@localhost udacity_training]$ cd data/
[training@localhost data]$ hadoop fs -get output/part-00000 mylocalfile.txt
[training@localhost data]$ ls
access_log.gz  mylocalfile.txt  purchases.txt
[training@localhost data]$ 
Quiz sales per category

1 - Change program mapper for select mapper item and cost:
  print "{0}\t{1}".format(item, cost)

2 - New process mapreduce:
  hs code/mapper_category.py code/reducer.py myinput output_category

3 - Extract result:
hadoop fs -get output_category/part-00000 data/result_category.txt

4 -  filter by question:
[training@localhost data]$ cat result_category.txt | grep -E "Toys|Electronics"
Consumer Electronics 	57452374.13
Toys 	57463477.11
[training@localhost data]$ 

Quiz Highest Sale

1 - Change mapper for store and cost:
  print "{0}\t{1}".format(store, cost)

2 - Change reduce to get max:
if salesTotal < float(thisSale):
    salesTotal = float(thisSale)

3 - New process mapreduce:
hs code/mapper_store.py code/reducer_max.py myinput output_max_store

4 - Extract result:
hadoop fs -get output_max_store/part-00000 data/result_max_store.txt

5 - filter by question:
[training@localhost data]$ cat result_max_store.txt | grep -E "Reno|Toledo|Chandler"
Chandler        499.98
Reno    499.99
Toledo  499.98
[training@localhost data]$

cat purchases.txt | grep Reno | awk -F $'\t' '{print $5}' | sort -g | tail -1
499.99

cat purchases.txt | grep Chandler | awk -F $'\t' '{print $5}' | sort -g  | tail -1
499.98

cat purchases.txt | grep Toledo | awk -F $'\t' '{print $5}' | sort -g  | tail -1
499.98

Quiz total sales

1 - Change mapper for store and cost:
print "{0}\t{1}".format(date, cost)

2 - Change reducer for accumulate values:
  thisKey, thisSale = data_mapped
  count += 1
  salesTotal += float(thisSale)

print count, "\t", "%.2f" % salesTotal

3 - run job and get values:
hs code/mapper_sales.py code/reducer_sales.py myinput output_sales2
hadoop fs -get output_sales2/part-00000 data/result_sales.txt

4 - cat in file:
[training@localhost data]$ cat result_sales.txt 
4138476 	1034457953.26
[training@localhost data]$ 

X - Example with shell:
[training@localhost data]$ wc -l purchases.txt
4138476 purchases.txt
[training@localhost data]$
[training@localhost data]$ cat purchases.txt | awk -F $'\t' '{print $5}'| paste -sd+ | bc
1034457953.26
[training@localhost data]$

Part II

The logfile is in Common Log Format:
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
%h %l %u %t "%r" %>s %b
Where:
%h is the IP address of the client
%l is identity of the client, or "-" if it's unavailable
%u is username of the client, or "-" if it's unavailable
%t is the time that the server finished processing the request. The format is [day/month/year:hour:minute:second zone]
%r is the request line from the client is given (in double quotes). It contains the method, path, query-string, and protocol or the request.
%>s is the status code that the server sends back to the client. You will see see mostly status codes 200 (OK - The request has succeeded), 304 (Not Modified) and 404 (Not Found). See more information on status codes in W3C.org
%b is the size of the object returned to the client, in bytes. It will be "-" in case of status code 304.

1 - Extract log in access_log.gz
gzip -d access_log.gz 

[training@localhost data]$ ls -lth access_log 
-rw-rw-r-- 1 training training 482M Dez 18  2012 access_log
[training@localhost data]$ 

[training@localhost data]$ head access_log; echo " - " ;tail access_log;
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 202
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 209
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET / HTTP/1.1" 200 9157
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/reset.css HTTP/1.1" 200 1014
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/960.css HTTP/1.1" 200 6206
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/the-associates.css HTTP/1.1" 200 15779
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/the-associates.js HTTP/1.1" 200 4492
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lightbox.js HTTP/1.1" 200 25960
10.223.157.186 - - [15/Jul/2009:15:50:36 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 200 168
 - 
10.190.174.142 - - [03/Dec/2011:13:28:06 -0800] "GET /images/filmpics/0000/2229/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 184976
10.190.174.142 - - [03/Dec/2011:13:28:08 -0800] "GET /assets/js/javascript_combined.js HTTP/1.1" 200 20404
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /assets/img/home-logo.png HTTP/1.1" 200 3892
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /images/filmmediablock/360/019.jpg HTTP/1.1" 200 74446
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmmediablock/360/g_still_04.jpg HTTP/1.1" 200 761555
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /images/filmmediablock/360/07082218.jpg HTTP/1.1" 200 154609
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmpics/0000/2229/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 184976
10.190.174.142 - - [03/Dec/2011:13:28:11 -0800] "GET /images/filmmediablock/360/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 60117
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmmediablock/360/Chacha.jpg HTTP/1.1" 200 109379
10.190.174.142 - - [03/Dec/2011:13:28:11 -0800] "GET /images/filmmediablock/360/GOEMON-NUKI-000159.jpg HTTP/1.1" 200 161657
[training@localhost data]$ 
host    identity    username    timeRequest request statusCode  sizeRequest (if "-" case status code 304)
TimeRquest =  [day/month/year:hour:minute:second zone]
status code 304 = Not Modified

Quiz Hits to page

1 - Change mapper for pattern acess_log:
[training@localhost code]$ cat mapper_acess_log.py 
#!/usr/bin/python

# Format of each line is:
# date\ttime\tstore name\titem description\tcost\tmethod of payment
#
# We want elements 2 (store name) and 4 (cost)
# We need to write them out to standard output, separated by a tab

import sys
import re

for line in sys.stdin:
    data = re.compile(r'^(\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)').split(line.strip()) 
    if len(data) == 12:
        vazio1, host, identity, username, time, zone, operationHttp, page, typeHttp, statusCode, sizeRequest, vazio2 = data
        print "{0}\t{1}".format(page, statusCode)

[training@localhost code]$ 


[training@localhost data]$ head access_log  |  ../code/mapper_acess_log.py 
/	403
/favicon.ico	404
/	200
/assets/js/lowpro.js	200
/assets/css/reset.css	200
/assets/css/960.css	200
/assets/css/the-associates.css	200
/assets/js/the-associates.js	200
/assets/js/lightbox.js	200
/assets/img/search-button.gif	200
[training@localhost data]$ 

2 - Run job mapReduce accumulating pages:
hadoop fs -rm -r output*
hadoop fs -rm -r myinput
hadoop fs -put data/access_log myinput
hs code/mapper_access_log.py code/reducer_access_log.py myinput output_access_log

3 - Extract result and filter by "/assets/js/the-associates.js"
hadoop fs -get output_access_log/part-00000 data/result_access_log.txt
[training@localhost udacity_training]$ cat data/result_access_log.txt | grep "/assets/js/the-associates.js"
/assets/js/the-associates.js 	2456
[training@localhost udacity_training]$ 

Quiz hits from ip

1 - Change mapper for host:
2 - run job and extract result:
hs code/mapper_access_log_ip.py code/reducer_access_log.py myinput output_access_log_ip
hadoop fs -get output_access_log_ip/part-00000 data/result_access_log_ip.txt
grep "10.99.99.186" data/result_access_log_ip.txt

[training@localhost udacity_training]$ grep "10.99.99.186" data/result_access_log_ip.txt
10.99.99.186 	6
[training@localhost udacity_training]$ 

[training@localhost data]$ grep "10.99.99.186" access_log | wc -l
6
[training@localhost data]$ 

Quiz most popular page

1 - Change mapper to get path in urls:
...
from urlparse import urlparse
...
        try:
           parts = urlparse(page)
           path = 2
           print "{0}\t{1}".format(parts[path], statusCode)
        except:
           print "{0}\t{1}".format(page, statusCode)

2 -  run_job and extract result:
hadoop fs -rm -r output*
hs code/mapper_access_log_page.py code/reducer_access_log.py myinput output_access_log
hadoop fs -get output_access_log/part-00000 data/result_access_log_page.txt
cat data/result_access_log_page.txt | sort -r -g -k 2| head -5


[training@localhost udacity_training]$ cat data/result_access_log_page.txt | sort -r -g -k 2| head -1
/displaytitle.php 	263781
[training@localhost udacity_training]$

[training@localhost udacity_training]$ cat data/result_access_log_page.txt | sort -r -g -k 2| head -5
/displaytitle.php 	263781
/downloadSingle.php 	141482
/assets/css/combined.css 	117352
/assets/js/javascript_combined.js 	106979
/ 	100503
[training@localhost udacity_training]$ 

/assets/css/combined.css is correct??? wtf.
  [training@localhost udacity_training]$ hadoop fs -rmr output*
  rmr: DEPRECATED: Please use 'rm -r' instead.
  Deleted output
  Deleted output_acess_log
  Deleted output_acess_log2
  Deleted output_acess_log3
  Deleted output_category
  Deleted output_max_store
  Deleted output_sales
  Deleted output_sales2

  hadoop fs -rm -r myinput
  rm -rf ~udacity_training/data/result_access_log.txt

  cat access_log | ../code/mapper_access_log.py  > map.log &
  cat map.log |sort > map_sort.log  &
  cat map_sort.log | ../code/reducer_access_log.py  > data.log &
  grep "/assets/js/the-associates.js" data.log


  from urlparse import urlparse
  uri = 'http://www.the-associates.co.uk/displaytitle.php?id=537'
  uri = '/assets/js/lowpro.js'
  parts = urlparse(uri)