Skip to content

Instantly share code, notes, and snippets.

@jmindek
Last active August 29, 2015 13:57
Show Gist options
  • Save jmindek/9553911 to your computer and use it in GitHub Desktop.
Save jmindek/9553911 to your computer and use it in GitHub Desktop.
Fixing US Census Gazetteer file formatting with Clojure
;; The US Census data provides a lot of interesting data.
;; Currently, I am interested in latitudes and longitudes for cities.
;; The Gazetteer file provides that for cities in the US.
;; You can grab the 2010 data here - http://www.census.gov/geo/maps-data/data/gazetteer2010.html
;; At first glance the contents of the file looks well formatted.
;; However, there are some minor formatting issues that will cause problems when importing this data into a data store.
;; Here are two small Clojure functions to fix those issues and output a well formatted file.
(ns utils.core
(:require [clojure.string :as cs]))
(defn fix-census-gazetteer-files
[in-file]
(-> (slurp in-file)
(cs/replace #"(\w+,(?:\s[A-Z][a-z]+(?:\sof)?)+)" "\"$1\"")
(cs/replace #"\s+(CDP|[a-z]+)" "")
(cs/replace #"\t+(\s?)+" ",")
(cs/replace #"\s+\d" "")
(cs/replace #"(\s?)+\r\n" "\n")))
(defn write-fixed-census-gazetteer-file
[out-file string]
(spit out-file string))
;; Run together to fix the downloaded file and output the well-formatted file.
utils.core> (write-fixed-census-gazetteer-file "/tmp/fix-gazetteer.txt" (fix-census-gazetteer-files "/home/jmindek/Downloads/Gaz_places_national.txt"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment