Here's the assignment:
Download this raw statistics dump from Wikipedia (360mb unzipped):
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz
Write a simple script in your favourite programming language that:
- Gets all views from the English Wikipedia (these are prefixed by "en ")
- Limit those articles to the ones with at least 500 views
- Sort by number of views, highest ones first and print the first ten articles.
- Also measure the time this takes and print it out as well.
Right now we've got versions in Javascript (Node.js), PHP, Go, Python, Ruby, Bash (awk/sed/grep), Groovy and Java in both Java 8 functional style and 'old school' style.
The Bash, Groovy and Java versions were written by @breun, the Ruby version was written by @tieleman, the others by yours truly.
Some measurements on my machine (2011 Macbook Pro, no SSD):
- Go: 2.36s (2.31, 2.42, 2.36, 2.33, 2.36)
- Java (oldschool): 4.77s / 2.66s if not taking the first measurement into account (13.15, 2.59, 2.48, 3.08, 2.58, 2.58)
- Node.js: 7.10s (7.56, 7.18, 7.01, 6.89, 6.88)
- Groovy: 8.14s (9.16, 7.67, 8.60, 7.87, 7.40)
- PHP: 8.42s (8.54, 8.31, 8.47, 8.41, 8.39)
- Ruby: 8.85s (9.3, 9.38, 8.6, 8.37, 8.61)
- Python: 9.26s (8.35, 8.54, 10.43, 9.35, 9.62)
- Bash: 12.34s (12.62, 12.22, 12.78, 12.80, 11.29)
Your output should look like this:
Query took 7.56 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Special:HideBanners (40771)
Y%C5%AB_Kobayashi (34596)
Special:Search (18672)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)