Skip to content

Instantly share code, notes, and snippets.

@fhoffa
Last active December 22, 2015 12:28
Show Gist options
  • Save fhoffa/6472099 to your computer and use it in GitHub Desktop.
Save fhoffa/6472099 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "reddit.top.demo"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Reddit top 2,500,000 + Google BigQuery HOWTO"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "https://github.com/umbrae/reddit-top-2.5-million\n\nThis is a dataset of top posts from reddit. It contains the top 1,000 all-time posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count and are located in the manifest file within.\n\nFirst I downloaded all the .csv and ran an ETL: Merge all files, and add the filename (subreddit) as a column. For that, I used a little bit of Python."
},
{
"cell_type": "code",
"collapsed": false,
"input": "\"\"\"Small util, reads many csv files, outputs combination.\"\"\"\n\n\nimport csv\nimport sys\nimport re\n\n\ninput_files = sys.argv[1:]\n\nwriter = csv.writer(sys.stdout)\n\nfor ifile in input_files:\n afile = open(ifile)\n file_csv = csv.reader(afile)\n\n file_csv.next()\n for j, row in enumerate(file_csv):\n row = [(int(x) if re.match(r\"[-+]?\\d+$\", x) is not None else x) for x in row]\n sub = row[9].split('/')[4]\n writer.writerow([j, sub] + row[0:9])\n\n \n# Usage: python etl-reddit.py data/*.csv > big.csv",
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Then we can upload all the data to GCS and tell Google BigQuery to ingest it. Voila, we have a new BigQuery dataset!"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "* Table ID\t**bigquery-samples:reddit.full**\n* Table Size\t**1.35 GB**\n* Number of Rows\t**2,267,098**\n\n![](http://i.imgur.com/Bh1jfX0.png)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I also wanted to do 'most popular word' analysis. For that, I ran a similar Python ETL. Now I take each title, and split them in \nmany rows, each one containing one word."
},
{
"cell_type": "code",
"collapsed": false,
"input": "\"\"\"Small util, reads many csv files, outputs combination.\"\"\"\n\nimport csv\nimport sys\nimport re\n\n\ninput_files = sys.argv[1:]\n\nwriter = csv.writer(sys.stdout)\n\nfor ifile in input_files:\n afile = open(ifile)\n file_csv = csv.reader(afile)\n\n file_csv.next()\n for j, row in enumerate(file_csv):\n row = [(int(x) if re.match(r\"[-+]?\\d+$\", x) is not None else x) for x in row]\n words = str(row[4]).split()\n sub = row[9].split('/')[4]\n for i, word in enumerate(words):\n row[4] = word\n writer.writerow([j, sub, i] + row[0:9]])\n \n",
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": "* Table ID\t**bigquery-samples:reddit.words**\n* Table Size\t**2.54 GB**\n* Number of Rows\t**23,231,049**\n\n![](http://i.imgur.com/HTIwAqO.png)\n\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's look at what are the most popular words:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "#bigquery imports\nimport bq\nimport datetime\nimport pandas as pd\nimport tabulate\nclient = bq.Client.Get()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most popular words\nquery = \"\"\"\nSELECT TOP(LOWER(title)) word, COUNT(*) FROM [bigquery-samples:reddit.words]\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\ndata[0:20]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_r660ffdce05955c96_00000140f5d63731_1 ... (0s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": "[[u'the', u'734957'],\n [u'a', u'554310'],\n [u'to', u'487780'],\n [u'of', u'421421'],\n [u'i', u'377087'],\n [u'in', u'333310'],\n [u'and', u'323729'],\n [u'my', u'277087'],\n [u'for', u'264098'],\n [u'is', u'218547'],\n [u'this', u'211411'],\n [u'on', u'200500'],\n [u'-', u'196407'],\n [u'you', u'161390'],\n [u'from', u'141592'],\n [u'with', u'139100'],\n [u'it', u'111743'],\n [u'what', u'101908'],\n [u'at', u'99177'],\n [u'that', u'97618']]"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "No surprise here, the most popular words are the stop words.\n\nWondering: How can we know what words are really important?\n \nHow about if we look at the ranking of their posts. Stop words will get an average score, and less frequent, most interesting words will rise up.\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most intersting words\nquery = \"\"\"\nSELECT lower(title) word, AVG(score), COUNT(*)\nFROM [bigquery-samples:reddit.words]\nGROUP BY 1\nORDER BY 2 DESC\nLIMIT 10\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\ndata[0:20]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_rd1e3d36de9a3e03_00000140f5e32e59_3 ... (0s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": "[[u'leduff.', u'17369.0', u'1'],\n [u\"slippin'.\", u'14148.0', u'1'],\n [u'then...wait', u'12640.0', u'1'],\n [u'adoptions.', u'10813.0', u'1'],\n [u'comcast\"', u'9486.0', u'1'],\n [u'hypponen,', u'8797.0', u'1'],\n [u'u.s.-based...', u'8797.0', u'1'],\n [u'f-secure.', u'8797.0', u'1'],\n [u'baboon?', u'8454.0', u'1'],\n [u'$800m+', u'8259.0', u'1']]"
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Those are intersting words, but they only occur once (that's what the COUNT is for). We want words that occur more than once. Also there are many non word characters. Let's clean that with a massive scale REGEX too. Also we'll look at the AVG(ranked) instead of AVG(score), so we normalize between subreddits (some are much more popular than others, but we care about words performance in context of their audience). We'll also count the number of different title's where the word is, to prevent the same word being repeated in one title create an advantage."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most intersting words\nquery = \"\"\"\nSELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';:-]?[a-z0-9]+)\") word, AVG(ranked), COUNT(DISTINCT title) c\nFROM [bigquery-samples:reddit.words]\nGROUP BY 1\nHAVING c > 80\nORDER BY 2 \nLIMIT 10\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\ndata[0:20]\n\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (0s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (1s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (2s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (4s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (5s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (6s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (7s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (8s) Current status: RUNNING"
},
{
"output_type": "stream",
"stream": "stdout",
"text": " \rWaiting on bqjob_r7e8e155d6243d669_00000140f606cba3_9 ... (8s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": "[[u'gifs', u'364.5930026572188', u'130'],\n [u'upvote', u'366.83476314807905', u'98'],\n [u'aww', u'391.3242208857299', u'115'],\n [u'gif', u'395.13135126632056', u'186'],\n [u'adviceanimals', u'404.69756097560975', u'83'],\n [u'funny', u'411.80598318593735', u'274'],\n [u'x-post', u'413.86464328374416', u'258'],\n [u'meta', u'416.90042372881356', u'132'],\n [u'pics', u'417.17278106508877', u'362'],\n [u'trees', u'422.6736870167955', u'161']]"
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we get interesting words! If you want a high voted post in reddit, try adding the word \"upvote\" to it. It workd for the 98 posts with it.\n\nWhat are the worse words? (Much lower than average ranking)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most intersting words\nquery = \"\"\"\nSELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';:-]?[a-z0-9]+)\") word, AVG(ranked), COUNT(DISTINCT title) c\nFROM [bigquery-samples:reddit.words]\nGROUP BY 1\nHAVING c > 80\nORDER BY 2 DESC\nLIMIT 10\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\ndata[0:20]\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_r221dcc47a5b267b7_00000140f606f7ea_10 ... (0s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": "[[u'question', u'580.0574687903573', u'231'],\n [u'newbie', u'561.7951263537906', u'95'],\n [u'suggestions', u'560.2172002510986', u'121'],\n [u'opinions', u'550.2013385387619', u'89'],\n [u'360', u'548.688947368421', u'112'],\n [u'ps3', u'547.0086602139583', u'148'],\n [u'questions', u'545.2264497507216', u'131'],\n [u'pvp', u'542.1625899280575', u'94'],\n [u'help', u'542.0859129889144', u'425'],\n [u'advice', u'540.4483061260313', u'256']]"
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Don't say 'newbie' or 'question'! These are top tiles, but rated lower than average.\n\nNow let's segment our audience by subreddit."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most intersting words, by subr\nquery = \"\"\"\nSELECT subr, REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, AVG(ranked), COUNT(title) c\nFROM [bigquery-samples:reddit.words]\nWHERE subr > 'p'\nGROUP EACH BY 1, 2\nHAVING c > 5 and word IS NOT null\nORDER BY 1, 3\nLIMIT 10\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\ndata[0:20]\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_r76d045e63bebe4c2_00000140f60f3a07_13 ... (0s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": "[[u'paintball', u'right', u'201.75', u'8'],\n [u'paintball', u'posted', u'228.5', u'6'],\n [u'paintball', u'doing', u'238.5', u'6'],\n [u'paintball', u'every', u'286.0', u'9'],\n [u'paintball', u'down', u'295.1111111111111', u'9'],\n [u'paintball', u'share', u'297.3', u'10'],\n [u'paintball', u'their', u'297.57142857142856', u'7'],\n [u'paintball', u'fixed', u'303.2857142857143', u'7'],\n [u'paintball', u'shot', u'308.3636363636364', u'11'],\n [u'paintball', u\"don't\", u'314.4166666666667', u'12']]"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "So the topwords for r/paintball are \"right\", \"posted\", and \"doing\". Those got highly ranked, and mentioned more than 5 times.\n\nBut we have too much data here, can we get only the 3 top words for each subreddit?\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Most 3 intersting words, for each of the 2500 subr.\nquery = \"\"\"\nSELECT subr, GROUP_CONCAT(word), AVG(ranked) ranked, SUM(c) c FROM (\nSELECT subr, word, ranked, c, ROW_NUMBER() OVER(PARTITION BY subr ORDER BY ranked) rowrank\nFROM(\nSELECT subr, REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, AVG(ranked) ranked, COUNT(title) c\nFROM [bigquery-samples:reddit.words]\nGROUP EACH BY 1, 2\nHAVING c > 5 and word IS NOT null\nORDER BY 1, 3\n)\n)\nWHERE rowrank <4\nGROUP BY subr\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 3000)\n\nprint tabulate.tabulate(data[0:10])\nprint tabulate.tabulate(data[100:105])\nprint tabulate.tabulate(data[400:405])\nprint tabulate.tabulate(data[1000:1005])\nprint tabulate.tabulate(data[2000:2005])\n\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\rWaiting on bqjob_r6e49858b9b4b4a38_00000140f6182c36_25 ... (0s) Current status: DONE "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n----------------- ----------------------- ------- --\n0x10c eldrone,screenshot,face 143.36 27\n2007scape upvote,please,any 287.125 33\n24hoursupport our,task,made 172.667 21\n30ROCK post,funny,party 244.96 23\n3DS cards,should,street 251.067 31\n3Dprinting after,pretty,hours 200.825 24\n3FrameMovies american,order,reverse 136.762 21\n3amjokes she,fell,got 199.778 20\n45thworldproblems see,daros,locatio 115.557 27\n49ers downvote,upvote,who's 185.429 21\n----------------- ----------------------- ------- --"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n----------------- ------------------- ------- --\nAskScienceFiction park,toy,wasn't 268.222 18\nAskSocialScience seem,iama,new 184.769 23\nAskWomen found,can't,black 194.344 29\nAsk_Politics secret,issue,please 150.196 22\nAssistance saw,upvote,lucas 174.46 35\n----------------- ------------------- ------- --\n--------------- ------------------------ -------- --\nGames youtube,amazon,different 217.796 22\nGamesTheMovie uncharted,season,halo 51.9983 26\nGaryJohnson appear,paul's,penn 214.246 19\nGaymersGoneMild date,science,ago 207.286 21\nGearsOfWar saw,thought,came 201.318 39\n--------------- ------------------------ -------- --\n------------------- ----------------- -------- --\nUltralight was,going,like 152.698 19\nUnexpected when,get,dog 278.083 23\nUnity3D star,fox,water 180.142 25\nUniversityofReddit even,very,hey 172.678 22\nUnresolvedMysteries was,found,mystery 71.1221 32\n------------------- ----------------- -------- --\n------------- -------------------- ------- --\nparrots share,eat,people 267.944 18\npathofexile thank,show,wraeclast 272.522 21\npatientgamers bastion,humble,iii 178.938 29\npcgaming generation,every,guy 154.444 18\npcmasterrace once,fixed,war 222.909 21\n------------- -------------------- ------- --\n"
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Most popular bigrams [http://nbviewer.ipython.org/6512989](http://nbviewer.ipython.org/6512989)"
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment