Skip to content

Instantly share code, notes, and snippets.

@mrflip
Created January 22, 2009 03:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mrflip/50401 to your computer and use it in GitHub Desktop.
Save mrflip/50401 to your computer and use it in GitHub Desktop.
== An elementary as hell solution to stephenfry's #L challenge: ==
-- Get the twitter scrape from infochimps.org
-- http://blog.infochimps.org/2008/12/29/massive-scrape-of-twitters-friend-graph/
-- (not available yet, but sometime soon)
--
-- Then run this pig (http://wiki.apache.org/pig/) code
-- to extract all twitter screen names with three or more L's
-- It's not properly case insensitive but who cares.
--
Ells_1 = FOREACH Users GENERATE screen_name;
Ells_2 = FILTER Ells_1 BY screen_name MATCHES '.*l.*l.*l.*';
STORE Ells_2 INTO 'foo/ells' ;
# ====== SHELL CODE ========
# Whip up a ruby-from-command-line oneoff to find long screen_names with a high fraction of L's
hadoop dfs -cat foo/ells/part\* | \
ruby -ne '$_ = $_.downcase.chomp!; \
ls = $_.count("l") ; \
lf = ls.to_f/$_.length ; \
puts "%7.3f\t%7d\t%7d\t%s" % [lf, ls, $_.length, $_]
' | sort -rn | head -n 50
# The output:
#
# 1.000 6 6 llllll
# 1.000 5 5 lllll
# 1.000 14 14 llllllllllllll
# 1.000 10 10 llllllllll
# 0.867 13 15 illllllllllllli
# 0.867 13 15 hulllllllllllll
# 0.857 6 7 lll_lll
# 0.833 5 6 dlllll
# 0.800 8 10 alllllllly
# 0.800 4 5 llull
# 0.800 4 5 lllel
# 0.800 4 5 llill
# 0.800 12 15 lllvlllclllvlll
# 0.800 12 15 lll8lll8lll8lll
# 0.778 7 9 halllllll
# 0.750 6 8 llilllil
# The third through sixth users' names add up to exactly 50 L's
#
# Compose a message with no other L's except the #L hashtag:
#
# @stephenfry - @llllllllllllll have you met @llllllllll @IlllllllllllllI
# and @hulllllllllllll? Your orthographic unity = easy #L project win.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment