Skip to content

Instantly share code, notes, and snippets.

@bluemont
Created April 30, 2013 21:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bluemont/8e8b717a732446fc9a8e to your computer and use it in GitHub Desktop.
Save bluemont/8e8b717a732446fc9a8e to your computer and use it in GitHub Desktop.
Parsing Wiktionary `bz2-reader` borrowed from: http://www.paullegato.com/blog/reading-bzip2-files-clojure/
(ns wiktionary.core
(:gen-class)
(:import
[org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream])
(:require
[clojure.data.xml :as xml]
[clojure.java.io :as io]))
(defn bz2-reader
"Returns a streaming Reader for a compressed bzip2 file."
[filename]
(-> filename
io/file
io/input-stream
BZip2CompressorInputStream.
io/reader))
(defn parse
"Returns a lazy tree of Element records by parsing the XML in filename."
[filename]
(xml/parse (io/reader filename)))
(defn parse-bz2
"Returns a lazy tree of Element records by parsing the XML from filename,
which must be in bzip2 format."
[filename]
(xml/parse (bz2-reader filename)))
(defn title-from-page
"Returns a lazy sequence of title strings for a page."
[page]
(->> (filter #(= (:tag %) :title) (:content page))
(first)
(:content)
(apply str)))
(defn titles-seq
"Returns a lazy sequence of strings for the input tree."
[tree]
(map title-from-page (:content tree)))
(def xml-file "/Volumes/Extra/wiktionary/wiktionary-en-all.xml")
(def bz2-file "/Volumes/Extra/wiktionary/wiktionary-en-all.xml.bz2")
(comment
(take 1 (line-seq (io/reader xml-file)))
; ("<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.8/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd\" version=\"0.8\" xml:lang=\"en\">")
(last (line-seq (io/reader xml-file)))
; "</mediawiki>"
(count (line-seq (io/reader xml-file)))
; 115498403
(take 1 (line-seq (bz2-reader bz2-file)))
; ("<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.8/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd\" version=\"0.8\" xml:lang=\"en\">")
(last (line-seq (bz2-reader bz2-file)))
; " </siteinfo>"
; *** Why is this different from above? ***
(count (line-seq (bz2-reader bz2-file)))
; 49
(last (take 360 (titles-seq (parse xml-file))))
; "abode"
(last (take 52002 (titles-seq (parse xml-file))))
; "splendid"
(last (take 360 (titles-seq (parse-bz2 bz2-file))))
; XMLStreamException ParseError at [row,col]:[50,1]
; Message: XML document structures must start and end within the same entity. com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:594)
(last (take 52002 (titles-seq (parse-bz2 bz2-file))))
; XMLStreamException ParseError at [row,col]:[50,1]
; Message: XML document structures must start and end within the same entity. com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:594)
)
(defproject wiktionary "0.1.0-SNAPSHOT"
:description "Parse Wiktionary XML Data Dump"
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.xml "0.0.7"]
[org.apache.commons/commons-compress "1.5"]]
:main wiktionary.core)
@bluemont
Copy link
Author

Just to see if it was the Reader aspect of bz2-reader, I tried the following.

I modified the import:

  (:import
    [java.io FileInputStream]
    [org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream])

And added:

(defn bz2-input-stream
  "Returns an input stream for a compressed bzip2 file."
  [filename]
  (-> filename
      FileInputStream.
      BZip2CompressorInputStream.))

(defn parse-bz2-2
  "Returns a lazy tree of Element records by parsing the XML from filename,
   which must be in bzip2 format."
  [filename]
  (xml/parse (bz2-input-stream filename)))

But there is no substantive change:

(last (line-seq (io/reader (bz2-input-stream bz2-file))))
; "  </siteinfo>"
; *** Why is this different from above? ***

(last (take 360 (titles-seq (parse-bz2-2 bz2-file))))
; XMLStreamException ParseError at [row,col]:[50,1]
; Message: XML document structures must start and end within the same entity.  com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:594)

@bluemont
Copy link
Author

This is the fix -- note the use of the [second form of the BZip2CompressorInputStream constructor](http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html#BZip2CompressorInputStream%28java.io.InputStream, boolean%29) where decompressConcatenated is set to true:

(defn bz2-reader
  "Returns a streaming Reader for a compressed bzip2 file."
  [filename]
  (-> filename
      io/file
      io/input-stream
      (BZip2CompressorInputStream. true)
      io/reader))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment