-
-
Save bluemont/8e8b717a732446fc9a8e to your computer and use it in GitHub Desktop.
Parsing Wiktionary `bz2-reader` borrowed from: http://www.paullegato.com/blog/reading-bzip2-files-clojure/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns wiktionary.core | |
(:gen-class) | |
(:import | |
[org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream]) | |
(:require | |
[clojure.data.xml :as xml] | |
[clojure.java.io :as io])) | |
(defn bz2-reader | |
"Returns a streaming Reader for a compressed bzip2 file." | |
[filename] | |
(-> filename | |
io/file | |
io/input-stream | |
BZip2CompressorInputStream. | |
io/reader)) | |
(defn parse | |
"Returns a lazy tree of Element records by parsing the XML in filename." | |
[filename] | |
(xml/parse (io/reader filename))) | |
(defn parse-bz2 | |
"Returns a lazy tree of Element records by parsing the XML from filename, | |
which must be in bzip2 format." | |
[filename] | |
(xml/parse (bz2-reader filename))) | |
(defn title-from-page | |
"Returns a lazy sequence of title strings for a page." | |
[page] | |
(->> (filter #(= (:tag %) :title) (:content page)) | |
(first) | |
(:content) | |
(apply str))) | |
(defn titles-seq | |
"Returns a lazy sequence of strings for the input tree." | |
[tree] | |
(map title-from-page (:content tree))) | |
(def xml-file "/Volumes/Extra/wiktionary/wiktionary-en-all.xml") | |
(def bz2-file "/Volumes/Extra/wiktionary/wiktionary-en-all.xml.bz2") | |
(comment | |
(take 1 (line-seq (io/reader xml-file))) | |
; ("<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.8/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd\" version=\"0.8\" xml:lang=\"en\">") | |
(last (line-seq (io/reader xml-file))) | |
; "</mediawiki>" | |
(count (line-seq (io/reader xml-file))) | |
; 115498403 | |
(take 1 (line-seq (bz2-reader bz2-file))) | |
; ("<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.8/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd\" version=\"0.8\" xml:lang=\"en\">") | |
(last (line-seq (bz2-reader bz2-file))) | |
; " </siteinfo>" | |
; *** Why is this different from above? *** | |
(count (line-seq (bz2-reader bz2-file))) | |
; 49 | |
(last (take 360 (titles-seq (parse xml-file)))) | |
; "abode" | |
(last (take 52002 (titles-seq (parse xml-file)))) | |
; "splendid" | |
(last (take 360 (titles-seq (parse-bz2 bz2-file)))) | |
; XMLStreamException ParseError at [row,col]:[50,1] | |
; Message: XML document structures must start and end within the same entity. com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:594) | |
(last (take 52002 (titles-seq (parse-bz2 bz2-file)))) | |
; XMLStreamException ParseError at [row,col]:[50,1] | |
; Message: XML document structures must start and end within the same entity. com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next (XMLStreamReaderImpl.java:594) | |
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defproject wiktionary "0.1.0-SNAPSHOT" | |
:description "Parse Wiktionary XML Data Dump" | |
:dependencies [[org.clojure/clojure "1.5.1"] | |
[org.clojure/data.xml "0.0.7"] | |
[org.apache.commons/commons-compress "1.5"]] | |
:main wiktionary.core) |
This is the fix -- note the use of the [second form of the BZip2CompressorInputStream
constructor](http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html#BZip2CompressorInputStream%28java.io.InputStream, boolean%29) where decompressConcatenated
is set to true
:
(defn bz2-reader
"Returns a streaming Reader for a compressed bzip2 file."
[filename]
(-> filename
io/file
io/input-stream
(BZip2CompressorInputStream. true)
io/reader))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just to see if it was the Reader aspect of
bz2-reader
, I tried the following.I modified the import:
(:import [java.io FileInputStream] [org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream])
And added:
But there is no substantive change: