Skip to content

Instantly share code, notes, and snippets.

@orb
Created April 10, 2015 18:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save orb/25d2b7d0c2f4e9ed57c0 to your computer and use it in GitHub Desktop.
Save orb/25d2b7d0c2f4e9ed57c0 to your computer and use it in GitHub Desktop.
refactoring some come
(ns dataprep.core
(:require [clojure.java.io :as io]
[clojure.string :as s]
[clojure.data.csv :as csv]
[semantic-csv.core :as sc]
[clj-http.lite.client :as http]))
(defn clean-latlon [angle]
"The purpose of this function is to verify that lat long is clean."
(s/replace angle #"[^0-9\.\-]" ""))
(defn split-point [address]
"Takes the :Address value from a record and splits it up. Returns a map
containing the address and two new latitude and longitude fields."
(let [[addr-part lat-part] (s/split address #"\n\(")
[lat long] (s/split lat-part #",\s")]
{:address addr-part
:latitude (clean-latlon lat)
:longitude (clean-latlon long)}))
(defn record->restaurant [record]
(merge (split-point (:Address record))
{:facility_id (get record (keyword "Facility ID"))
:name (get record (keyword "Restaurant Name"))
:zip (get record (keyword "Zip Code"))}))
(defn record->inspection [record]
{:date (get record (keyword "Inspection Date"))
:score (get record :Score)
:facility_id (get record (keyword "Facility ID"))
:description (get record (keyword "Process Description"))})
(defn read-input-csv [filename]
(with-open [in-file (io/reader filename)]
(doall
(->> (csv/read-csv in-file)
(sc/remove-comments)
(sc/mappify)))))
(defn write-csv [filename dataset]
(with-open [writer (io/writer filename)]
(csv/write-csv writer (sc/vectorize dataset))))
(defn split-csv [in restaurant-output inspection-output]
(let [records (read-input-csv in)
restaurants (map record->restaurant records)
inspections (map record->inspection records)]
(write-csv restaurant-output restaurants)
(write-csv inspection-output inspections)))
@orb
Copy link
Author

orb commented Apr 10, 2015

Here's an initial refactor.

I broke up split-csv to extract some of the IO. I think it reads better, but I did have to add the doall in read-input-csv to make sure all the input was consumed within the with-open.

I also split rebuild-records into record->restaurant and record->inspection since there was no interaction there and we wanted to use them separately. After seeing the refactor, there's no reason you couldn't put them back together again using these functions. Note: (map (fn [row] (rebuild-records row)) _ROWS_) is the same as (map rebuild-records _ROWS_)

I really dislike the use of keywords by mapify. Rather than rewrite it, I've used the get form, which makes the code clearer here and let's you move away from the keyword form later. I would definitely do that if I had a few more minutes here.

I changed split-point to use destructuring to clarify your intent. This code was the only part that I felt was bad. I still don't love this code. I'd use a regexp here instead of split. I also moved the clean-latlong into here so that the data would be a good base for record->restaurant.

@nickmcdonnough
Copy link

Thanks a lot for this. In regards to your point about the use of keyword by mappify I discovered this today:

dataprep.core=> (doc sc/mappify)
-------------------------
semantic-csv.core/mappify
([rows] [{:keys [keyify header structs], :or {keyify true}, :as opts} rows])
  Takes a sequence of row vectors, as commonly produced by csv parsing libraries, and returns
  a sequence of maps. By default, the first row vector will be interpreted as a header, and
  used as the keys for the maps. However, this and other behaviour are customizable via an
  optional `opts` map with the following options:

  * `:keyify` - bool; specify whether header/column names should be turned into keywords (default: `true`).
  * `:header` - specify the header to use for map keys, preventing first row of data from being consumed as header.
  * `:structs` - bool; use structs instead of hash-maps or array-maps, for performance boost (default: `false`).

So I just changed line 38 to (sc/mappify {:keyify false}) and then we can clean up all the keyword usage. Thanks a ton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment