Skip to content

Instantly share code, notes, and snippets.

@zev
Created May 16, 2012 02:13
Show Gist options
  • Save zev/2706753 to your computer and use it in GitHub Desktop.
Save zev/2706753 to your computer and use it in GitHub Desktop.
Custom Cascalog tap from a db query
(defn convert-rs
[rs keys]
(vec (doall
(map #(vec (map-values % keys)) rs))))
(defn map-values
[r keys]
(map #(% r) keys))
(defn users-query
[]
(sql/with-connection db
(sql/with-query-results rs ["select id,username from users"]
; rs will be a sequence of maps
; one for each record in the result set.
(convert-rs rs [:id :username]))))
;; Need to assign the above functions to an evaluated symbol for cascalog to let us use it as a source
;; Cascading has some extra libs for proper db sources, which might help speed up Cascalog runs
(def users (users-query))
(?<- (stdout) [?username ?cnt]
(users ?user_id "foo")
(count ?cnt))
@zev
Copy link
Author

zev commented May 16, 2012

What are the problems with using this type of source versus https://github.com/cwensel/cascading.jdbc/ or https://github.com/cascading/cascading-dbmigrate?
Is that this query will be run across all mappers while the others will be run once and put into hfs for the mappers to take in chunks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment