Skip to content

Instantly share code, notes, and snippets.

@fmw
Created April 30, 2013 17:05
Show Gist options
  • Save fmw/5490159 to your computer and use it in GitHub Desktop.
Save fmw/5490159 to your computer and use it in GitHub Desktop.
fast way to parse huge XML documents in Clojure
(ns xmltest.core
(:require [clojure.data.xml :as xml])
(:import [java.io FileInputStream]
[java.util.zip GZIPInputStream]))
(defn parse [filename]
(xml/parse (FileInputStream. filename)))
(defn parse-gzipped [filename]
(xml/parse (GZIPInputStream. (FileInputStream. filename))))
(defn get-title-values-from-file
[tree]
(map (fn [page]
(->> (filter #(= (:tag %) :title) (:content page))
(first)
(:content)
(apply str)))
(:content tree)))
(comment
(require '[xmltest.core :as c])
;; big.xml.gz is a gzipped file containing a billion <page> tags,
;; with a compressed size of 234M (original is 3.4GB).
(->> (c/parse-gzipped "/home/fmw/clj/xmltest/big.xml.gz")
(c/get-title-values-from-file)
(take 100000))) ;; remove (take 100000) to get the full sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment