Skip to content

Instantly share code, notes, and snippets.

@kornysietsma
Last active January 2, 2021 18:41
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kornysietsma/5939456 to your computer and use it in GitHub Desktop.
Save kornysietsma/5939456 to your computer and use it in GitHub Desktop.
parsing wikipedia dumps in clojure
(ns wikiparse.core
(:require [clojure.java.io :as io]
[clojure.data.xml :as xml]
[clojure.zip :refer [xml-zip]]
[clojure.data.zip.xml :refer [xml-> xml1-> text]])
(:import [ org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream])
(:gen-class :main true))
(defn bz2-reader
"Returns a streaming Reader for the given compressed BZip2
file. Use within (with-open)."
[filename]
(-> filename io/file io/input-stream BZip2CompressorInputStream. io/reader))
(defn process-music-artist-page
"Process a wikipedia page, print the title if it's a musical artist"
[page]
(let [z (xml-zip page)
title (xml1-> z :title text)
page-text (xml1-> z :revision :text text)]
(if (#(re-find #"\{\{Infobox musical artist" page-text))
(println title))))
(defn wiki-music-artists
"parse up to [max] pages from a wikipedia dump, print out those that are musical artists"
[filename max]
(with-open [rdr (bz2-reader filename)]
(dorun (->> (xml/parse rdr)
:content
(filter #(= :page (:tag %)))
(take max)
(map process-music-artist-page)))))
(def wikifile "enwiki-latest-pages-articles.xml.bz2")
(defn -main
[& args]
(wiki-music-artists wikifile 100000000))
@kornysietsma
Copy link
Author

Note you can get a torrent of the wikipedia dump at http://burnbit.com/torrent/246958/enwiki_latest_pages_articles_xml_bz2 - it's 9G bzipped, or 42G if you unzip it (which is why the code above works on the bzipped version!)

The earlier version of this which just looked in page titles took 94 minutes to parse the 42G xml file, on my Macbook Pro.

This version takes 115 minutes, presumably due to the extra effort running regular expressions over the text of every page. Peak memory use is around 800M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment