Here's the assignment:
Download this raw statistics dump from Wikipedia (360mb unzipped):
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz
Write a simple script in your favourite programming language that:
- Gets all views from the English Wikipedia (these are prefixed by "en ")
- Limit those articles to the ones with at least 500 views
- Sort by number of views, highest ones first and print the first ten articles.
- Also measure the time this takes and print it out as well.
Right now we've got versions in Javascript (Node.js), PHP, Go, Python, Ruby, Bash (awk/sed/grep), Groovy and Java in both Java 8 functional style and 'old school' style.
The Bash, Groovy and Java versions were written by @breun, the Ruby version was written by @tieleman, the others by yours truly.
Some measurements on my machine (2011 Macbook Pro, no SSD):
- C: 1.63s (1.58, 1.73, 1.59, 1.62, 1.63)
- Go: 2.36s (2.31, 2.42, 2.36, 2.33, 2.36)
- Java (oldschool): 4.77s / 2.66s if not taking the first measurement into account (13.15, 2.59, 2.48, 3.08, 2.58, 2.58)
- Groovy: 4.33s (4.16s, 4.27s, 4.55, 4.42, 4.27)
- Node.js: 7.10s (7.56, 7.18, 7.01, 6.89, 6.88)
- PHP: 7.44s (7.25, 7.35, 7.28, 7.37, 7.97)
- Python: 7.45s (6.59, 7.28, 6.81, 8.99, 7.59)
- Ruby: 8.85s (9.3, 9.38, 8.6, 8.37, 8.61)
- Bash: 12.34s (12.62, 12.22, 12.78, 12.80, 11.29)
- Lua: 22.81s (24.08, 22.65, 22.11, 21.53, 23.70)
Your output should look like this:
Query took 7.56 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Special:HideBanners (40771)
Y%C5%AB_Kobayashi (34596)
Special:Search (18672)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)
My dirty version in C, but line length is limited by memory.
On my system:
my version takes 530-550 ms.
c-rack's version takes 560-600 ms.
go version takes 3150-3280 ms
Compile with: