cd /home/training/udacity_training
[training@localhost udacity_training]$ head -50 data/purchases.txt | ./code/mapper.py | sort | ./code/reducer.py
[training@localhost udacity_training]$ pwd
/home/training/udacity_training
[training@localhost udacity_training]$ hadoop fs -put data/purchases.txt myinput
[training@localhost udacity_training]$ hadoop fs -ls
Found 1 items
-rw-r--r-- 1 training supergroup 211312924 2018-11-01 02:10 myinput
[training@localhost udacity_training]$ hs code/mapper.py code/reducer.py myinput output
packageJobJar: [code/mapper.py, code/reducer.py, /tmp/hadoop-training/hadoop-unjar8756022019060858567/] [] /tmp/streamjob3443365309394671497.jar tmpDir=null
18/11/01 02:14:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
18/11/01 02:14:43 WARN snappy.LoadSnappy: Snappy native library is available
18/11/01 02:14:43 INFO snappy.LoadSnappy: Snappy native library loaded
18/11/01 02:14:43 INFO mapred.FileInputFormat: Total input paths to process : 1
18/11/01 02:14:44 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
18/11/01 02:14:44 INFO streaming.StreamJob: Running job: job_201811010053_0004
18/11/01 02:14:44 INFO streaming.StreamJob: To kill this job, run:
18/11/01 02:14:44 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201811010053_0004
18/11/01 02:14:44 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201811010053_0004
18/11/01 02:14:45 INFO streaming.StreamJob: map 0% reduce 0%
18/11/01 02:14:55 INFO streaming.StreamJob: map 23% reduce 0%
18/11/01 02:14:58 INFO streaming.StreamJob: map 33% reduce 0%
18/11/01 02:15:01 INFO streaming.StreamJob: map 46% reduce 0%
18/11/01 02:15:05 INFO streaming.StreamJob: map 50% reduce 0%
18/11/01 02:15:13 INFO streaming.StreamJob: map 75% reduce 0%
18/11/01 02:15:15 INFO streaming.StreamJob: map 89% reduce 25%
18/11/01 02:15:18 INFO streaming.StreamJob: map 100% reduce 25%
18/11/01 02:15:24 INFO streaming.StreamJob: map 100% reduce 75%
18/11/01 02:15:27 INFO streaming.StreamJob: map 100% reduce 86%
18/11/01 02:15:30 INFO streaming.StreamJob: map 100% reduce 97%
18/11/01 02:15:32 INFO streaming.StreamJob: map 100% reduce 100%
18/11/01 02:15:33 INFO streaming.StreamJob: Job complete: job_201811010053_0004
18/11/01 02:15:33 INFO streaming.StreamJob: Output: output
[training@localhost udacity_training]$
[training@localhost udacity_training]$ hadoop fs -ls
Found 2 items
-rw-r--r-- 1 training supergroup 211312924 2018-11-01 02:10 myinput
drwxr-xr-x - training supergroup 0 2018-11-01 02:15 output
[training@localhost udacity_training]$
[training@localhost udacity_training]$ hadoop fs -ls output
Found 3 items
-rw-r--r-- 1 training supergroup 0 2018-11-01 02:15 output/_SUCCESS
drwxr-xr-x - training supergroup 0 2018-11-01 02:14 output/_logs
-rw-r--r-- 1 training supergroup 2296 2018-11-01 02:15 output/part-00000
[training@localhost udacity_training]$
[training@localhost udacity_training]$ cd data/
[training@localhost data]$ hadoop fs -get output/part-00000 mylocalfile.txt
[training@localhost data]$ ls
access_log.gz mylocalfile.txt purchases.txt
[training@localhost data]$
1 - Change program mapper for select mapper item and cost:
print "{0}\t{1}".format(item, cost)
2 - New process mapreduce:
hs code/mapper_category.py code/reducer.py myinput output_category
3 - Extract result:
hadoop fs -get output_category/part-00000 data/result_category.txt
4 - filter by question:
[training@localhost data]$ cat result_category.txt | grep -E "Toys|Electronics"
Consumer Electronics 57452374.13
Toys 57463477.11
[training@localhost data]$
1 - Change mapper for store and cost:
print "{0}\t{1}".format(store, cost)
2 - Change reduce to get max:
if salesTotal < float(thisSale):
salesTotal = float(thisSale)
3 - New process mapreduce:
hs code/mapper_store.py code/reducer_max.py myinput output_max_store
4 - Extract result:
hadoop fs -get output_max_store/part-00000 data/result_max_store.txt
5 - filter by question:
[training@localhost data]$ cat result_max_store.txt | grep -E "Reno|Toledo|Chandler"
Chandler 499.98
Reno 499.99
Toledo 499.98
[training@localhost data]$
cat purchases.txt | grep Reno | awk -F $'\t' '{print $5}' | sort -g | tail -1
499.99
cat purchases.txt | grep Chandler | awk -F $'\t' '{print $5}' | sort -g | tail -1
499.98
cat purchases.txt | grep Toledo | awk -F $'\t' '{print $5}' | sort -g | tail -1
499.98
1 - Change mapper for store and cost:
print "{0}\t{1}".format(date, cost)
2 - Change reducer for accumulate values:
thisKey, thisSale = data_mapped
count += 1
salesTotal += float(thisSale)
print count, "\t", "%.2f" % salesTotal
3 - run job and get values:
hs code/mapper_sales.py code/reducer_sales.py myinput output_sales2
hadoop fs -get output_sales2/part-00000 data/result_sales.txt
4 - cat in file:
[training@localhost data]$ cat result_sales.txt
4138476 1034457953.26
[training@localhost data]$
X - Example with shell: [training@localhost data]$ wc -l purchases.txt 4138476 purchases.txt [training@localhost data]$
[training@localhost data]$ cat purchases.txt | awk -F $'\t' '{print $5}'| paste -sd+ | bc
1034457953.26
[training@localhost data]$
The logfile is in Common Log Format:
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
%h %l %u %t "%r" %>s %b Where:
%h is the IP address of the client
%l is identity of the client, or "-" if it's unavailable
%u is username of the client, or "-" if it's unavailable
%t is the time that the server finished processing the request. The format is [day/month/year:hour:minute:second zone]
%r is the request line from the client is given (in double quotes). It contains the method, path, query-string, and protocol or the request.
%>s is the status code that the server sends back to the client. You will see see mostly status codes 200 (OK - The request has succeeded), 304 (Not Modified) and 404 (Not Found). See more information on status codes in W3C.org
%b is the size of the object returned to the client, in bytes. It will be "-" in case of status code 304.
1 - Extract log in access_log.gz
gzip -d access_log.gz
[training@localhost data]$ ls -lth access_log
-rw-rw-r-- 1 training training 482M Dez 18 2012 access_log
[training@localhost data]$
[training@localhost data]$ head access_log; echo " - " ;tail access_log;
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 202
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 209
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET / HTTP/1.1" 200 9157
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/reset.css HTTP/1.1" 200 1014
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/960.css HTTP/1.1" 200 6206
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/the-associates.css HTTP/1.1" 200 15779
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/the-associates.js HTTP/1.1" 200 4492
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lightbox.js HTTP/1.1" 200 25960
10.223.157.186 - - [15/Jul/2009:15:50:36 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 200 168
-
10.190.174.142 - - [03/Dec/2011:13:28:06 -0800] "GET /images/filmpics/0000/2229/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 184976
10.190.174.142 - - [03/Dec/2011:13:28:08 -0800] "GET /assets/js/javascript_combined.js HTTP/1.1" 200 20404
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /assets/img/home-logo.png HTTP/1.1" 200 3892
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /images/filmmediablock/360/019.jpg HTTP/1.1" 200 74446
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmmediablock/360/g_still_04.jpg HTTP/1.1" 200 761555
10.190.174.142 - - [03/Dec/2011:13:28:09 -0800] "GET /images/filmmediablock/360/07082218.jpg HTTP/1.1" 200 154609
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmpics/0000/2229/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 184976
10.190.174.142 - - [03/Dec/2011:13:28:11 -0800] "GET /images/filmmediablock/360/GOEMON-NUKI-000163.jpg HTTP/1.1" 200 60117
10.190.174.142 - - [03/Dec/2011:13:28:10 -0800] "GET /images/filmmediablock/360/Chacha.jpg HTTP/1.1" 200 109379
10.190.174.142 - - [03/Dec/2011:13:28:11 -0800] "GET /images/filmmediablock/360/GOEMON-NUKI-000159.jpg HTTP/1.1" 200 161657
[training@localhost data]$
host identity username timeRequest request statusCode sizeRequest (if "-" case status code 304)
TimeRquest = [day/month/year:hour:minute:second zone]
status code 304 = Not Modified
1 - Change mapper for pattern acess_log:
[training@localhost code]$ cat mapper_acess_log.py
#!/usr/bin/python
# Format of each line is:
# date\ttime\tstore name\titem description\tcost\tmethod of payment
#
# We want elements 2 (store name) and 4 (cost)
# We need to write them out to standard output, separated by a tab
import sys
import re
for line in sys.stdin:
data = re.compile(r'^(\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)').split(line.strip())
if len(data) == 12:
vazio1, host, identity, username, time, zone, operationHttp, page, typeHttp, statusCode, sizeRequest, vazio2 = data
print "{0}\t{1}".format(page, statusCode)
[training@localhost code]$
[training@localhost data]$ head access_log | ../code/mapper_acess_log.py
/ 403
/favicon.ico 404
/ 200
/assets/js/lowpro.js 200
/assets/css/reset.css 200
/assets/css/960.css 200
/assets/css/the-associates.css 200
/assets/js/the-associates.js 200
/assets/js/lightbox.js 200
/assets/img/search-button.gif 200
[training@localhost data]$
2 - Run job mapReduce accumulating pages:
hadoop fs -rm -r output*
hadoop fs -rm -r myinput
hadoop fs -put data/access_log myinput
hs code/mapper_access_log.py code/reducer_access_log.py myinput output_access_log
3 - Extract result and filter by "/assets/js/the-associates.js"
hadoop fs -get output_access_log/part-00000 data/result_access_log.txt
[training@localhost udacity_training]$ cat data/result_access_log.txt | grep "/assets/js/the-associates.js"
/assets/js/the-associates.js 2456
[training@localhost udacity_training]$
1 - Change mapper for host:
2 - run job and extract result:
hs code/mapper_access_log_ip.py code/reducer_access_log.py myinput output_access_log_ip
hadoop fs -get output_access_log_ip/part-00000 data/result_access_log_ip.txt
grep "10.99.99.186" data/result_access_log_ip.txt
[training@localhost udacity_training]$ grep "10.99.99.186" data/result_access_log_ip.txt
10.99.99.186 6
[training@localhost udacity_training]$
[training@localhost data]$ grep "10.99.99.186" access_log | wc -l
6
[training@localhost data]$
1 - Change mapper to get path in urls:
...
from urlparse import urlparse
...
try:
parts = urlparse(page)
path = 2
print "{0}\t{1}".format(parts[path], statusCode)
except:
print "{0}\t{1}".format(page, statusCode)
2 - run_job and extract result:
hadoop fs -rm -r output*
hs code/mapper_access_log_page.py code/reducer_access_log.py myinput output_access_log
hadoop fs -get output_access_log/part-00000 data/result_access_log_page.txt
cat data/result_access_log_page.txt | sort -r -g -k 2| head -5
[training@localhost udacity_training]$ cat data/result_access_log_page.txt | sort -r -g -k 2| head -1
/displaytitle.php 263781
[training@localhost udacity_training]$
[training@localhost udacity_training]$ cat data/result_access_log_page.txt | sort -r -g -k 2| head -5
/displaytitle.php 263781
/downloadSingle.php 141482
/assets/css/combined.css 117352
/assets/js/javascript_combined.js 106979
/ 100503
[training@localhost udacity_training]$
/assets/css/combined.css is correct??? wtf.
[training@localhost udacity_training]$ hadoop fs -rmr output*
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted output
Deleted output_acess_log
Deleted output_acess_log2
Deleted output_acess_log3
Deleted output_category
Deleted output_max_store
Deleted output_sales
Deleted output_sales2
hadoop fs -rm -r myinput
rm -rf ~udacity_training/data/result_access_log.txt
cat access_log | ../code/mapper_access_log.py > map.log &
cat map.log |sort > map_sort.log &
cat map_sort.log | ../code/reducer_access_log.py > data.log &
grep "/assets/js/the-associates.js" data.log
from urlparse import urlparse
uri = 'http://www.the-associates.co.uk/displaytitle.php?id=537'
uri = '/assets/js/lowpro.js'
parts = urlparse(uri)