Skip to content

Instantly share code, notes, and snippets.

@brianckeegan
Last active June 19, 2021 13:31
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save brianckeegan/d11452963fb0570fb461 to your computer and use it in GitHub Desktop.
Save brianckeegan/d11452963fb0570fb461 to your computer and use it in GitHub Desktop.
Top Wikipedia stories in 2014
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, element
import urllib2, re
# Read the HTML from the webpage on Wikipedia stats and convert to soup
soup = BeautifulSoup(urllib2.urlopen('http://stats.wikimedia.org/EN/TablesWikipediaEN.htm').read())
# Look for all the paragraphs with 2014
_p = soup.findAll('b',text=re.compile('2014'))
# Select only those paragraph parents that have exactly 152 fields, corresponding to the top-25 lists
_p2014 = [t.parent for t in _p if len(t.parent) == 152]
# Get the text out of the children tags as a list of lists
parsed = [[t.text for t in list(p.children) if type(t) != element.NavigableString] for p in _p2014]
# Convert to a dictionary keyed by month abbreviation with values as the list of text fields
parsed = {month[0].split(u'\xa0')[0]:month[1:] for month in parsed}
# Do some crazy dictionary and list comprehensions with zips to convert the values in the list
parsed = {k:[{'rank':int(a),'editors':int(b),'article':c} for a,b,c in zip(v[0::3],v[1::3],v[2::3])] for k,v in parsed.items()}
# Convert each month into a DataFrame with month information in the index
# and then concat all the dfs together, sorting on those with the most editors
ranked = pd.concat([pd.DataFrame(parsed[i],index=[i]*len(parsed[i])) for i in parsed.keys()]).sort('editors',ascending=False).reset_index()
# rename the reset index to something meaningful
ranked.rename(columns={'index':'month'},inplace=True)
# Group the articles by name, compute aggregate statistics
# Rank on the total number editors and months in the top 25
top_articles = ranked.groupby('article').agg({'month':len,'editors':np.sum,'rank':np.min}).sort(['month','editors'],ascending=False)
top_articles
@brianckeegan
Copy link
Author

Article Total editors across months Minimum rank Total months in Top 25
Deaths in 2014 1848 1 10
Islamic State of Iraq and the Levant 752 2 5
Malaysia Airlines Flight 370 1147 1 4
Ebola virus epidemic in West Africa 758 1 4
Ukraine 416 7 4
Frozen (2013 film) 344 7 4
2014 Israel-Gaza conflict 675 2 3
2014 pro-Russian unrest in Ukraine 314 8 3
War in Donbass 273 12 3
Malaysia Airlines Flight 17 755 1 2
2014 FIFA World Cup 488 1 2
2014 Crimean crisis 476 2 2
Ebola virus disease 296 6 2
2014 Russian military intervention in Ukraine 294 3 2
2014 Winter Olympics 275 2 2
2014 FIFA World Cup squads 261 6 2
SummerSlam (2014) 250 8 2
2014 Ukrainian revolution 240 4 2
2014 Hong Kong protests 236 5 2
Super Bowl XLVIII 233 4 2
Eurovision Song Contest 2014 231 4 2
Indian general election, 2014 210 8 2
Gamergate controversy 201 9 2
Narendra Modi 187 9 2
Transformers: Age of Extinction 185 15 2
Big Brother 16 (U.S.) 181 15 2
Euromaidan 179 10 2
Kick (2014 film) 177 14 2
The Amazing Spider-Man 2 177 12 2
2014 military intervention against the Islamic State of Iraq and the Levant 165 10 2
FIFA 15 158 12 2
Nash Grier 156 19 2
Bitcoin 130 21 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment