Skip to content

Instantly share code, notes, and snippets.

@sunilnandihalli
Created February 28, 2012 04:11
Show Gist options
  • Save sunilnandihalli/1929345 to your computer and use it in GitHub Desktop.
Save sunilnandihalli/1929345 to your computer and use it in GitHub Desktop.
problem with lazy join of two large sorted csv files...
(ns pythia.soj.join
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]
[clojure.java.shell :as sh]))
(defn write-csv-record [wrtr record]
(binding [*out* wrtr]
(println (apply str (interpose \, record)))))
(defn map-seq-to-csv-with-out-using-write-csv [seq-of-maps output-file]
(let [[all-keys] (with-open [wrtr (io/writer (str output-file ".tmp"))]
(reduce (fn [[c-keys keys-to-id-map] m]
(let [[n-keys n-keys-to-id-map :as w] (reduce (fn [[c-keys keys-to-id-map] k]
(if (contains? keys-to-id-map k)
[c-keys keys-to-id-map]
[(conj c-keys k)
(assoc keys-to-id-map k (count c-keys))]))
[c-keys keys-to-id-map] (keys m))
rec ((apply juxt n-keys) m)]
(write-csv-record wrtr rec) w)) [[] {}] seq-of-maps))]
(sh/sh "sh" :in (str "echo " (apply str (interpose \, (map name all-keys))) " | cat - " output-file ".tmp >" output-file "; rm -vf " output-file ".tmp;"))))
(defn csv-to-map-seq [fname & {:keys [with-header key-map] :or {with-header false}}]
(let [row-seq (csv/read-csv (io/reader fname))
all-keys (let [premapped-keys (if with-header (map keyword (first row-seq)) (range 1 1000))]
(if-not key-map premapped-keys
(map (fn [k] (if (contains? key-map k) (key-map k) k)) premapped-keys)))]
(map #(zipmap all-keys %)
(if with-header (rest row-seq) row-seq))))
(defn lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields
([s1 s2 f output-generator]
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields s1 s2 f f output-generator))
([s1 s2 f1 f2 output-generator]
(lazy-seq
(loop [[x & xs :as wx] s1 [y & [yn :as ys] :as wy] s2]
(let [[xk yk] [(f1 x) (f2 y)]
ck (compare xk yk)]
(cond
(= ck 0) (cons (output-generator x y)
(let [nyk (f2 yn)]
(if-not (= nyk xk)
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields xs ys f1 f2 output-generator)
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields wx ys f1 f2 output-generator))))
(< ck 0) (recur xs wy)
(> ck 0) (recur wx ys)))))))
(defn join-csv-based-on-field-with-only-second-file-allowed-to-have-duplicate-fields [f1 f2 field-key]
(let [[s1 s2] (map #(csv-to-map-seq % :with-header true) [f1 f2])]
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields s1 s2 field-key merge)))
@sunilnandihalli
Copy link
Author

join-csv-based-on-field-with-only-second-file-allowed-to-have-duplicate-fields is the main entry point. f1 and f2 are csv files with headers which are sorted on field key...

@sunilnandihalli
Copy link
Author

when I am writing code, the return value of the above function is directly written to file using map-seq-to-csv-with-out-using-write-csv . I just realized that I had missed saying that..

@halgari
Copy link

halgari commented Feb 28, 2012

may I be the first to say "Holy long function names, Batman!". Seriously, this is what name spaces are for....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment