Skip to content

Instantly share code, notes, and snippets.

@fmw
Created April 30, 2013 17:05
Show Gist options
  • Save fmw/5490162 to your computer and use it in GitHub Desktop.
Save fmw/5490162 to your computer and use it in GitHub Desktop.
quick way to parse huge xml files in Clojure
(ns xmltest.core
(:require [clojure.data.xml :as xml])
(:import [java.io FileInputStream]
[java.util.zip GZIPInputStream]))
(defn parse [filename]
(xml/parse (FileInputStream. filename)))
(defn parse-gzipped [filename]
(xml/parse (GZIPInputStream. (FileInputStream. filename))))
(defn get-title-values-from-file
[tree]
(map (fn [page]
(->> (filter #(= (:tag %) :title) (:content page))
(first)
(:content)
(apply str)))
(:content tree)))
(comment
(require '[xmltest.core :as c])
;; big.xml.gz is a gzipped file containing a billion <page> tags,
;; with a compressed size of 234M (original is 3.4GB).
(->> (c/parse-gzipped "/home/fmw/clj/xmltest/big.xml.gz")
(c/get-title-values-from-file)
(take 100000))) ;; remove (take 100000) to get the full sequence
@bluemont
Copy link

Thanks. clojure.data.xml seems promising.

I'm currently trying out with-open and clojure.java.io:

(ns example.core
  (:require [clojure.data.xml :as xml]
            [clojure.java.io :as io])) 

(with-open [rdr (io/reader filename)]
  (xml/parse rdr))

I've seen some errors like:

XMLStreamException ParseError at [row,col]:[89,245]
Message: Stream closed  com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:592)

@bluemont
Copy link

Strangely, your way, using (FileInputStream. filename) worked. But my way, with io/reader did not.

@bluemont
Copy link

Does your code close the file when it is done? My understanding is that with-open handles that.

@bluemont
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment