This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Compiling ybot.analytics.ga.aggregate | |
Exception in thread "main" java.lang.NoSuchMethodError: clojure.lang.RT.keyword(Ljava/lang/String;Ljava/lang/String;)Lclojure/lang/Keyword; (util.clj:5) | |
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:2911) | |
at clojure.lang.Compiler.compile1(Compiler.java:5933) | |
at clojure.lang.Compiler.compile1(Compiler.java:5923) | |
at clojure.lang.Compiler.compile(Compiler.java:5992) | |
at clojure.lang.RT.compile(RT.java:368) | |
at clojure.lang.RT.load(RT.java:407) | |
at clojure.lang.RT.load(RT.java:381) | |
at clojure.core$load$fn__4519.invoke(core.clj:4915) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn lemmatize-text | |
"Apply a lucene tokenizer to cleaned text content as a lazy-seq" | |
[page-text] | |
(let [reader (java.io.StringReader. page-text) | |
analyzer (-> | |
(resource-to-temp-file | |
"stanford_nlp_models/bidirectional-distsim-wsj-0-18.tagger" | |
".tagger") | |
(.getAbsolutePath) | |
(MaxentTagger.) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# mac port installs bash_completion in /opt/local | |
if [ -f /opt/local/etc/bash_completion ]; then | |
. /opt/local/etc/bash_completion | |
# * | |
export GIT_PS1_SHOWDIRTYSTATE=1 | |
# $ | |
export GIT_PS1_SHOWSTASHSTATE=1 | |
# % | |
# export GIT_PS1_SHOWUNTRACKEDFILES=1 | |
export PS1='\[\e[32m\]λ \w\[\e[36m\]$(__git_ps1 " (%s)") [$(~/.rvm/bin/rvm-prompt i v)]\[\e[0m\]\n\[\e[32m\]→\[\e[0m\] ' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn tokenize-strings [in-path out-path] | |
(let [src (hfs-textline in-path)] | |
(?<- (hfs-textline out-path :sinkmode :replace) | |
[!line ?token] | |
(src !line) | |
(tokenize-string !line :> ?token) | |
(:distinct false)))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defmapcatop tokenize-string {:stateful true} | |
([] (load-analyzer StandardAnalyzer/STOP_WORDS_SET)) | |
([analyzer text] | |
(emit-tokens (tokenize-text analyzer text))) | |
([analyzer] nil)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn tokenizer-seq | |
"Build a lazy-seq out of a tokenizer with TermAttribute" | |
[^TokenStream tokenizer ^TermAttribute term-att] | |
(lazy-seq | |
(when (.incrementToken tokenizer) | |
(cons (.term term-att) (tokenizer-seq tokenizer term-att))))) | |
(defn load-analyzer [^java.util.Set stopwords] | |
(StandardAnalyzer. Version/LUCENE_CURRENT stopwords)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns ybot.analytics.edb | |
(:use [ybot bootstrap datastores]) | |
(:use elephantdb.cascalog.core) | |
(:import [elephantdb.persistence JavaBerkDB] | |
[org.apache.hadoop.io BytesWritable])) | |
(bootstrap-ybot) | |
(defn ser-long [val] | |
(BytesWritable. (.getBytes (str val)))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defmapop decode-ybtag [json] | |
(let [{:keys [psn co vi pvi si sts ts ln ce sd lo r ua la na np nc | |
c_st g_C g_r g_c dma nv]} (json/parse-string json true)] | |
[psn co vi pvi si sts ts ln ce sd lo r ua la na np nc | |
c_st g_C g_r g_c dma nv])) | |
(defn glob-ybtag-json [pattern] | |
(let [tap (globhfs-textline pattern)] | |
(<- [!psn !co !vi !pvi !si !sts !ts !ln !ce !sd !lo !r !ua !la !na !np !nc | |
!c_st !g_C !g_r !g_c !dma !nv] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn hfs-report | |
[path] | |
"Loads the log data from an HDFS path into Hbase." | |
(?<- (hbase-tap "urls" "?url-hash" "urls" "?url" "?crawl-date" | |
"?crawl-time" "?response-code" "?status" "?host") | |
[?url-hash ?url ?crawl-date ?crawl-time ?response-code ?status ?host] | |
((hfs-textline path) ?text) | |
(fetch-value-hash ?text :url :> ?url-hash) | |
(fetch-value ?text :url :> ?url) | |
(fetch-value ?text :crawl-date :> ?crawl-date) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns ybot.hadoop.pail | |
(:use cascalog.api | |
[cascalog.io :only (with-fs-tmp)]) | |
(:import [backtype.cascading.tap PailTap PailTap$PailTapOptions] | |
[backtype.hadoop.pail Pail])) | |
(defn- pail-tap | |
[path colls structure] | |
(let [seqs (into-array java.util.List colls) | |
spec (PailTap/makeSpec nil structure) |