Created
April 10, 2015 18:13
-
-
Save orb/25d2b7d0c2f4e9ed57c0 to your computer and use it in GitHub Desktop.
refactoring some come
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns dataprep.core | |
(:require [clojure.java.io :as io] | |
[clojure.string :as s] | |
[clojure.data.csv :as csv] | |
[semantic-csv.core :as sc] | |
[clj-http.lite.client :as http])) | |
(defn clean-latlon [angle] | |
"The purpose of this function is to verify that lat long is clean." | |
(s/replace angle #"[^0-9\.\-]" "")) | |
(defn split-point [address] | |
"Takes the :Address value from a record and splits it up. Returns a map | |
containing the address and two new latitude and longitude fields." | |
(let [[addr-part lat-part] (s/split address #"\n\(") | |
[lat long] (s/split lat-part #",\s")] | |
{:address addr-part | |
:latitude (clean-latlon lat) | |
:longitude (clean-latlon long)})) | |
(defn record->restaurant [record] | |
(merge (split-point (:Address record)) | |
{:facility_id (get record (keyword "Facility ID")) | |
:name (get record (keyword "Restaurant Name")) | |
:zip (get record (keyword "Zip Code"))})) | |
(defn record->inspection [record] | |
{:date (get record (keyword "Inspection Date")) | |
:score (get record :Score) | |
:facility_id (get record (keyword "Facility ID")) | |
:description (get record (keyword "Process Description"))}) | |
(defn read-input-csv [filename] | |
(with-open [in-file (io/reader filename)] | |
(doall | |
(->> (csv/read-csv in-file) | |
(sc/remove-comments) | |
(sc/mappify))))) | |
(defn write-csv [filename dataset] | |
(with-open [writer (io/writer filename)] | |
(csv/write-csv writer (sc/vectorize dataset)))) | |
(defn split-csv [in restaurant-output inspection-output] | |
(let [records (read-input-csv in) | |
restaurants (map record->restaurant records) | |
inspections (map record->inspection records)] | |
(write-csv restaurant-output restaurants) | |
(write-csv inspection-output inspections))) |
Thanks a lot for this. In regards to your point about the use of keyword by mappify I discovered this today:
dataprep.core=> (doc sc/mappify)
-------------------------
semantic-csv.core/mappify
([rows] [{:keys [keyify header structs], :or {keyify true}, :as opts} rows])
Takes a sequence of row vectors, as commonly produced by csv parsing libraries, and returns
a sequence of maps. By default, the first row vector will be interpreted as a header, and
used as the keys for the maps. However, this and other behaviour are customizable via an
optional `opts` map with the following options:
* `:keyify` - bool; specify whether header/column names should be turned into keywords (default: `true`).
* `:header` - specify the header to use for map keys, preventing first row of data from being consumed as header.
* `:structs` - bool; use structs instead of hash-maps or array-maps, for performance boost (default: `false`).
So I just changed line 38 to (sc/mappify {:keyify false})
and then we can clean up all the keyword usage. Thanks a ton.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here's an initial refactor.
I broke up
split-csv
to extract some of the IO. I think it reads better, but I did have to add thedoall
inread-input-csv
to make sure all the input was consumed within thewith-open
.I also split
rebuild-records
intorecord->restaurant
andrecord->inspection
since there was no interaction there and we wanted to use them separately. After seeing the refactor, there's no reason you couldn't put them back together again using these functions. Note:(map (fn [row] (rebuild-records row)) _ROWS_)
is the same as(map rebuild-records _ROWS_)
I really dislike the use of keywords by mapify. Rather than rewrite it, I've used the
get
form, which makes the code clearer here and let's you move away from the keyword form later. I would definitely do that if I had a few more minutes here.I changed
split-point
to use destructuring to clarify your intent. This code was the only part that I felt was bad. I still don't love this code. I'd use a regexp here instead ofsplit
. I also moved theclean-latlong
into here so that the data would be a good base forrecord->restaurant
.