Last active
December 23, 2015 23:09
-
-
Save fhoffa/6708379 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "bigquery discovering data" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import bq\nimport datetime\nimport pandas as pd\nimport tabulate\nclient = bq.Client.Get()", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "We'll play with the [bigquery-samples:reddit.full] dataset.\n\nFirst question is: How big is this dataset?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT COUNT(*) c FROM [bigquery-samples:reddit.full]\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r6927b004530b87f9_0000014161ef1857_2 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n c\n-------\n2267098" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "2,267,098 rows! Each representing one submission. What columns do we have?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "schema = client.GetTableSchema({'projectId': 'bigquery-samples', 'datasetId': 'reddit', 'tableId': 'words'})\nprint tabulate.tabulate([[x['name'], x['type']] for x in schema['fields']])\n\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "------------ ---------\nranked INTEGER\nsubr STRING\nwordpos INTEGER\ncreated_utc TIMESTAMP\nscore INTEGER\ndomain STRING\nid STRING\ntitle STRING\nauthor STRING\nups INTEGER\ndowns INTEGER\nnum_comments INTEGER\ntotal INTEGER\n------------ ---------\n" | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Let's explore this dataset.\n\nCOUNT(DISTINCT x) is a fast approximate counter of distinct values.\n\nTip: If you need a precise count, you can adjust COUNT DISTINCT, or use GROUP EACH." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT COUNT(DISTINCT subr) c FROM [bigquery-samples:reddit.full]\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r72978a08a67354ae_0000014161ef2ebf_3 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n c\n----\n2526" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "There are 2,526 different 'subr' (subreddits) in this dataset. Can we get a precise count?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT COUNT(DISTINCT subr, 5000) c FROM [bigquery-samples:reddit.full]\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])\n\n\nquery = \"\"\"\nSELECT COUNT(*) c FROM (SELECT subr FROM [bigquery-samples:reddit.full] GROUP BY 1)\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_rfee8101457737db_0000014161ef398c_4 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n c\n----\n2499" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n\rWaiting on bqjob_r672b541ac2d01a42_0000014161ef3e7e_5 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n c\n----\n2499" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Another cool, fast, approximate function is TOP(). We can quickly find the most repeated elements on a column." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT TOP(author) author, COUNT(*) c FROM [bigquery-samples:reddit.full]\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_ra0906b7992e6d7e_0000014161ef4db5_6 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nauthor c\n------------------ -----\n 77143\nanutensil 4119\natticus138 2866\ngst 1850\nmepper 1482\ndavidreiss666 1436\nNickster79 1324\nkirbyfood 1262\nAschebescher 1255\nsoupyhands 1160\nMind_Virus 1143\ncoconutwaters 1118\nFenrirIII 1039\ntone_is_everything 1005\nrising_threads_bot 999\nPoliticBot 998\nserendipitybot 996\nmarquis_of_chaos 983\ntabledresser 954\ncongressbot 947" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "So from the 23,231,049 top posts, this tells us that there are some users that have authored a lot of them.\n\nCheck them out on http://www.reddit.com/user/anutensil, http://www.reddit.com/user/mepper, http://www.reddit.com/user/davidreiss666, etc.\n\nWe also see that a lot of posts don't have an author. Maybe they are deleted accounts? \n " | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "What are the top voted submissions?\n\nLEFT(x, 80) cuts the title so the table doesn't overflow." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, score, LEFT(title, 80)\nFROM [bigquery-samples:reddit.full]\nORDER BY 2 DESC\nLIMIT 10\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r7aa1cec801843bc3_0000014161ef5be0_7 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr score f0_\n------ ------- --------------------------------------------------------------------------------\npics 21868 test post please ignore\nfunny 21304 The Bus Knight\nfunny 18793 I live in the same valley as Adam West. I decided to look him up in the phone bo\nvideos 17369 This guy is a reporter on Fox 2 here in Detroit. His name is Charlie LeDuff. He\nfunny 14883 After searching FB for people with the same name as me, I'd replicate their prof\nIAmA 14749 I am Barack Obama, President of the United States -- AMA\npics 14568 I sent Tom Hanks a 1934 Smith Corona typewriter with a typed invitation to come\nfunny 14148 Babe caught me slippin'.\nfunny 12962 I hate my job...\nfunny 12942 When I found out I could upvote by pressing 'A'" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 8 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Funny how a post titled \"please ignore\" is the one with most votes in all history.\n\nWhat are the subreddits that get the most votes?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, SUM(score) sum_score\nFROM [bigquery-samples:reddit.full]\nGROUP BY 1\nORDER BY 2 DESC\nLIMIT 20\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r392994d37bd93528_0000014161ef69fa_8 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr sum_score\n----------------- -----------\npics 3752388\nfunny 3713684\nvideos 3201899\npolitics 3157350\ntechnology 2894626\nworldnews 2876818\naww 2857179\nWTF 2665757\ntodayilearned 2636914\ngaming 2633408\ngifs 2615749\nAdviceAnimals 2614179\nmildlyinteresting 2487454\n4chan 2443942\nreactiongifs 2320331\nscience 2320319\ncringepics 2320307\natheism 2254504\ntrees 2239064\nAskReddit 2198059" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 9 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Interesting. Now what is the #1 post for each subreddit. Window functions, let's do a simple example first:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, RANK() OVER(PARTITION BY subr ORDER BY score desc) rank, LEFT(title, 40) title, DATE(created_utc) date\nFROM [bigquery-samples:reddit.full]\nWHERE subr = 'pics'\nLIMIT 20\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r1b63d608e75fe8b_0000014161ef76d8_9 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr rank title date\n------ ------ ---------------------------------------- ----------\npics 1 test post please ignore 2009-07-18\npics 2 I sent Tom Hanks a 1934 Smith Corona typ 2012-10-09\npics 3 My friend calls him \"Mr Ridiculously Pho 2012-04-03\npics 4 The only way my 3 day-old daughter would 2012-12-28\npics 5 Photo I took of the Chicago skyline from 2012-07-13\npics 6 This is called humanity. 2011-07-22\npics 7 The frog that has lived in my shower for 2012-12-28\npics 8 Came across an overgrown cemetery in the 2013-03-12\npics 9 Tried taking a picture of a sink drainin 2012-11-12\npics 10 Poster ad for the Canadian Paralympics 2012-05-04\npics 11 NY Cop giving a brand new pair boots to 2012-11-28\npics 12 I'm 7 foot. For Halloween I went as a no 2012-04-11\npics 13 I wish I was a dog just so I could wear 2012-08-09\npics 14 Dropped my camera while taking a multish 2012-12-28\npics 15 Totally worth battling traffic, the cold 2012-06-07\npics 16 Photo of a North Korean family being del 2012-08-19\npics 17 Pipe Cleaner Wolf 2013-06-07\npics 18 Just discovered Lost Egyptian city found 2013-06-08\npics 19 Kinetic light sculptures created by Brit 2013-04-08\npics 20 My friend and his daughter dressed as a 2012-10-28" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 10 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "So RANK() adds a column with the position of each post, depending on the order. Now we can filter with an external query only posts that have rank==1." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, title, date, score FROM (\n SELECT subr, RANK() OVER(PARTITION BY subr ORDER BY score desc) rank, LEFT(title, 40) title, DATE(created_utc) date, score\n FROM [bigquery-samples:reddit.full]\n)\nWHERE rank = 1\nLIMIT 20\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r2f14af0a937fa494_0000014161ef8276_10 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr title date score\n----------------- ---------------------------------------- ---------- -------\n0x10c The reason for slow 2012-08-16 430\n2007scape What Rank 235,661 cooking looks like 2013-03-09 463\n24hoursupport My reddit account of 3+ years (and other 2013-02-08 69\n30ROCK No big deal. Just a picture of Tina Fey 2013-02-01 1874\n3DS Quick question. Don't upvote please. 2013-02-04 575\n3Dprinting We 3D Printed someone's face 2013-07-17 259\n3FrameMovies [3FM] The Dark Knight Trilogy 2012-09-27 744\n3amjokes Why was the soldier pinned down? 2013-05-15 518\n45thworldproblems [PR_PHECY: 10b/2m/1o] The atomic chorus 2012-12-11 141\n49ers Update: The downvote button is now a Rav 2013-01-22 868\n4chan It's my gf 2012-12-28 7237\n4x4 While I was eating breakfast at my hotel 2013-07-20 217\n52weeksofcooking Official Weekly Challenge List 2011-11-29 108\n90daysgoal My motivation 2011-10-07 65\n90sHipHop The Pharcyde - Passin' Me By 2013-01-23 113\n90scartoons Grandpa Lou doesn't mess around 2012-11-17 208\n911truth Most Convincing Explanation for Why 9/11 2011-11-21 111\nABraThatFits What I picture every time I visit this s 2013-06-25 649\nADHD The most accurate representation of my d 2013-03-04 743\nAFL I am Mitch Robinson from the Carlton Foo 2013-08-15 308" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 11 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Can we only look at the top 20 subreddits? Easy with another subquery and the IN keyword:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, title, date, score FROM (\n SELECT subr, RANK() OVER(PARTITION BY subr ORDER BY score desc) rank, LEFT(title, 40) title, DATE(created_utc) date, score\n FROM [bigquery-samples:reddit.full]\n WHERE subr IN (\n SELECT subr FROM (\n SELECT subr, SUM(score) sum_score\n FROM [bigquery-samples:reddit.full]\n GROUP BY 1\n ORDER BY 2 DESC\n LIMIT 20))\n)\nWHERE rank = 1\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r5d6091dd80d946b5_0000014161f5776f_30 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr title date score\n----------------- ---------------------------------------- ---------- -------\n4chan It's my gf 2012-12-28 7237\nAdviceAnimals Every time I talk on the phone 2012-11-28 7866\nAskReddit Throwaway time! What's your secret that 2012-05-01 12914\nWTF HOLY SHIT THE ONION 2012-09-13 9409\natheism Hurt me good r/atheism, $.50 to Doctors 2011-12-03 6303\naww My local humane society posts pictures o 2012-09-16 10813\ncringepics Text Me. 2012-12-29 10502\nfunny The Bus Knight 2012-07-28 21304\ngaming I've been playing the same game of Civil 2012-06-12 12060\ngifs 24 Hz sound + 24 fps camera + water = ma 2013-03-12 7095\nmildlyinteresting Buried a friend knee deep in sand. When 2013-04-09 7353\npics test post please ignore 2009-07-18 21868\npolitics Obama wins the Presidency! 2008-11-05 8535\nreactiongifs When the computer freezes while watching 2012-12-27 7951\nscience Mars Science Laboratory Curiosity has la 2012-08-06 5785\ntechnology \"Dear Google Fiber: Please, please, plea 2012-11-28 9486\ntodayilearned TIL that after needing 13 liters of bloo 2013-01-11 8198\ntrees Can't believe I met [10] Guy today 2012-07-01 9759\nvideos This guy is a reporter on Fox 2 here in 2012-10-16 17369\nworldnews \"What we have... is... concrete proof of 2013-06-08 8797" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 31 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Instead of always writing a long sub-query, it would be more efficient if we create a temporary table with the results of the top subreddits query.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\n SELECT subr, SUM(score) sum_score\n FROM [bigquery-samples:reddit.full]\n GROUP BY 1\n ORDER BY 2 DESC\n LIMIT 20\n\"\"\"\nfields, data = client.ReadSchemaAndRows(\n client.Query(query, destination_table='fh-bigquery:reddit.top20', write_disposition='WRITE_TRUNCATE')['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r1332d487b3a199af_0000014161f6fe7a_34 ... (0s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r1332d487b3a199af_0000014161f6fe7a_34 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr sum_score\n----------------- -----------\npics 3752388\nfunny 3713684\nvideos 3201899\npolitics 3157350\ntechnology 2894626\nworldnews 2876818\naww 2857179\nWTF 2665757\ntodayilearned 2636914\ngaming 2633408\ngifs 2615749\nAdviceAnimals 2614179\nmildlyinteresting 2487454\n4chan 2443942\nreactiongifs 2320331\nscience 2320319\ncringepics 2320307\natheism 2254504\ntrees 2239064\nAskReddit 2198059" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 35 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Now the previous query gets simpler:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT subr, title, date, score FROM (\n SELECT subr, RANK() OVER(PARTITION BY subr ORDER BY score desc) rank, LEFT(title, 40) title, DATE(created_utc) date, score\n FROM [bigquery-samples:reddit.full]\n WHERE subr IN (\n SELECT subr FROM [fh-bigquery:reddit.top20])\n)\nWHERE rank = 1\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r2d934f3c94d36a4_0000014161f71c34_35 ... (0s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r2d934f3c94d36a4_0000014161f71c34_35 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nsubr title date score\n----------------- ---------------------------------------- ---------- -------\n4chan It's my gf 2012-12-28 7237\nAdviceAnimals Every time I talk on the phone 2012-11-28 7866\nAskReddit Throwaway time! What's your secret that 2012-05-01 12914\nWTF HOLY SHIT THE ONION 2012-09-13 9409\natheism Hurt me good r/atheism, $.50 to Doctors 2011-12-03 6303\naww My local humane society posts pictures o 2012-09-16 10813\ncringepics Text Me. 2012-12-29 10502\nfunny The Bus Knight 2012-07-28 21304\ngaming I've been playing the same game of Civil 2012-06-12 12060\ngifs 24 Hz sound + 24 fps camera + water = ma 2013-03-12 7095\nmildlyinteresting Buried a friend knee deep in sand. When 2013-04-09 7353\npics test post please ignore 2009-07-18 21868\npolitics Obama wins the Presidency! 2008-11-05 8535\nreactiongifs When the computer freezes while watching 2012-12-27 7951\nscience Mars Science Laboratory Curiosity has la 2012-08-06 5785\ntechnology \"Dear Google Fiber: Please, please, plea 2012-11-28 9486\ntodayilearned TIL that after needing 13 liters of bloo 2013-01-11 8198\ntrees Can't believe I met [10] Guy today 2012-07-01 9759\nvideos This guy is a reporter on Fox 2 here in 2012-10-16 17369\nworldnews \"What we have... is... concrete proof of 2013-06-08 8797" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 36 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "The IN keyword is equivalent to doing a JOIN" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT a.subr, title, date, score FROM (\n SELECT a.subr, RANK() OVER(PARTITION BY a.subr ORDER BY score desc) rank, LEFT(title, 40) title, DATE(created_utc) date, score\n FROM [bigquery-samples:reddit.full] a\n JOIN [fh-bigquery:reddit.top20] b\n ON a.subr = b.subr\n)\nWHERE rank = 1\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_re5ab3e58cec13c3_0000014161f5be38_33 ... (0s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_re5ab3e58cec13c3_0000014161f5be38_33 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_subr title date score\n----------------- ---------------------------------------- ---------- -------\n4chan It's my gf 2012-12-28 7237\nAdviceAnimals Every time I talk on the phone 2012-11-28 7866\nAskReddit Throwaway time! What's your secret that 2012-05-01 12914\nWTF HOLY SHIT THE ONION 2012-09-13 9409\natheism Hurt me good r/atheism, $.50 to Doctors 2011-12-03 6303\naww My local humane society posts pictures o 2012-09-16 10813\ncringepics Text Me. 2012-12-29 10502\nfunny The Bus Knight 2012-07-28 21304\ngaming I've been playing the same game of Civil 2012-06-12 12060\ngifs 24 Hz sound + 24 fps camera + water = ma 2013-03-12 7095\nmildlyinteresting Buried a friend knee deep in sand. When 2013-04-09 7353\npics test post please ignore 2009-07-18 21868\npolitics Obama wins the Presidency! 2008-11-05 8535\nreactiongifs When the computer freezes while watching 2012-12-27 7951\nscience Mars Science Laboratory Curiosity has la 2012-08-06 5785\ntechnology \"Dear Google Fiber: Please, please, plea 2012-11-28 9486\ntodayilearned TIL that after needing 13 liters of bloo 2013-01-11 8198\ntrees Can't believe I met [10] Guy today 2012-07-01 9759\nvideos This guy is a reporter on Fox 2 here in 2012-10-16 17369\nworldnews \"What we have... is... concrete proof of 2013-06-08 8797" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 34 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Now let's play with words. The table [bigquery-samples:reddit.words] has a row for each word in a title, making it easier to analyze individual words.\n\nCan we count how many times a word appeared on a top title each day?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT title, DATE(created_utc), COUNT(*)\nFROM [bigquery-samples:reddit.words]\nGROUP BY 1, 2\nLIMIT 10\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (0s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (1s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (2s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (3s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (4s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (5s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (6s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (8s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (9s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (10s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r4cda068db9995646_0000014161efcb11_15 ... (10s) Current status: DONE " | |
}, | |
{ | |
"ename": "BigqueryInvalidQueryError", | |
"evalue": "Error processing job 'data-sensing-lab:bqjob_r4cda068db9995646_0000014161efcb11_15': Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby", | |
"output_type": "pyerr", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mBigqueryInvalidQueryError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32m<ipython-input-16-aec099f5b94e>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mLIMIT\u001b[0m \u001b[1;36m10\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6\u001b[0m \"\"\"\n\u001b[1;32m----> 7\u001b[1;33m \u001b[0mfields\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mclient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mReadSchemaAndRows\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mQuery\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mquery\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'configuration'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'query'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'destinationTable'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmax_rows\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m1000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 8\u001b[0m \u001b[1;32mprint\u001b[0m \u001b[0mtabulate\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtabulate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'name'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mfields\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mQuery\u001b[1;34m(self, query, destination_table, create_disposition, write_disposition, priority, preserve_nulls, allow_large_results, dry_run, use_cache, **kwds)\u001b[0m\n\u001b[0;32m 1678\u001b[0m \u001b[0mrequest\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;34m'query'\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mquery_config\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1679\u001b[0m \u001b[0m_ApplyParameters\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdry_run\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdry_run\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1680\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mExecuteJob\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1681\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1682\u001b[0m def Load(self, destination_table_reference, source,\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mExecuteJob\u001b[1;34m(self, configuration, sync, project_id, upload_file, job_id)\u001b[0m\n\u001b[0;32m 1450\u001b[0m job = self.RunJobSynchronously(\n\u001b[0;32m 1451\u001b[0m \u001b[0mconfiguration\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mproject_id\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mproject_id\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mupload_file\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mupload_file\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1452\u001b[1;33m job_id=job_id)\n\u001b[0m\u001b[0;32m 1453\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1454\u001b[0m job = self.StartJob(\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mRunJobSynchronously\u001b[1;34m(self, configuration, project_id, upload_file, job_id)\u001b[0m\n\u001b[0;32m 1439\u001b[0m \u001b[0mjob_reference\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mBigqueryClient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mConstructObjectReference\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1440\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mWaitJob\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mjob_reference\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1441\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mRaiseIfJobError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1442\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1443\u001b[0m def ExecuteJob(self, configuration, sync=None,\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mRaiseIfJobError\u001b[1;34m(job)\u001b[0m\n\u001b[0;32m 816\u001b[0m raise BigqueryError.Create(\n\u001b[0;32m 817\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror_ls\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 818\u001b[1;33m job_ref=BigqueryClient.ConstructObjectReference(job))\n\u001b[0m\u001b[0;32m 819\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mjob\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 820\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", | |
"\u001b[1;31mBigqueryInvalidQueryError\u001b[0m: Error processing job 'data-sensing-lab:bqjob_r4cda068db9995646_0000014161efcb11_15': Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "It fails! There are too many groups for GROUP BY to group by. But there is a workaround: GROUP EACH. GROUP EACH introduces some limitations on queries, but allows much larger groupings." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT title, DATE(created_utc) date, COUNT(*) c\nFROM [bigquery-samples:reddit.words]\nGROUP EACH BY 1, 2\nHAVING c > 2\nORDER BY 2 DESC, 3\nLIMIT 10\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r5b50db9b3c30c23e_0000014161f023db_16 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\ntitle date c\n------- ---------- ---\ni 2013-08-20 3\n2 2013-08-20 3\nwhy 2013-08-20 3\ngo 2013-08-20 3\nMy 2013-08-20 3\nA 2013-08-20 3\nIf 2013-08-20 3\nwho 2013-08-20 3\nThis 2013-08-20 3\nit's 2013-08-20 3" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "As we try to count words we can see a problem here: They are not normalized, including caps and punctuation. Luckily BigQuery can also run regular expressions over massive datasets with ease:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\nFROM [bigquery-samples:reddit.words]\nWHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\nGROUP EACH BY 1, 2\nHAVING c > 2\nORDER BY 2 DESC, 3\nLIMIT 10\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r3a31e6a53d706d7_0000014161f02dd3_17 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nword date c\n---------- ---------- ---\nyear 2013-08-20 3\nstill 2013-08-20 3\nhad 2013-08-20 3\nsql 2013-08-20 3\nmap 2013-08-20 3\nexperience 2013-08-20 3\nhere 2013-08-20 3\nreally 2013-08-20 3\nnow 2013-08-20 3\nlong 2013-08-20 3" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "(the over-complicated regular expression allows for some traditionally used punctuation inside words)\n\nLet's look at what has been the timeline of the 'sql' word in the reddit top posts:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\nFROM [bigquery-samples:reddit.words]\nWHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\nGROUP EACH BY 1, 2\nHAVING word = 'sql'\nORDER BY 2 DESC\nLIMIT 30\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r4dd564868c41fdda_0000014161f03b3c_18 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nword date c\n------ ---------- ---\nsql 2013-08-20 3\nsql 2013-08-19 1\nsql 2013-08-17 1\nsql 2013-08-16 3\nsql 2013-08-15 1\nsql 2013-08-14 4\nsql 2013-08-13 1\nsql 2013-08-12 3\nsql 2013-08-11 1\nsql 2013-08-09 4\nsql 2013-08-08 2\nsql 2013-08-07 1\nsql 2013-08-06 3\nsql 2013-08-05 1\nsql 2013-08-04 2\nsql 2013-07-31 4\nsql 2013-07-30 2\nsql 2013-07-29 2\nsql 2013-07-28 1\nsql 2013-07-25 1\nsql 2013-07-23 1\nsql 2013-07-19 2\nsql 2013-07-18 1\nsql 2013-07-16 3\nsql 2013-07-14 1\nsql 2013-07-13 3\nsql 2013-07-12 2\nsql 2013-07-11 3\nsql 2013-07-10 3\nsql 2013-07-09 1" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Now let's try something fun: Can we find what other word has behaved like the word 'sql' within the top posts?\n\nFirst, we'll join this table to itself:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT a.word, b.word, a.date, a.c, b.c \nFROM (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\n GROUP EACH BY 1, 2\n HAVING word = 'sql'\n) a\nJOIN (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\n GROUP EACH BY 1, 2\n) b\nON a.date=b.date\nLIMIT 30\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (0s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (1s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (2s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (3s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (4s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (5s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (6s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (8s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (9s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (10s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (11s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (12s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (13s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (14s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (15s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (16s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (17s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (18s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (19s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (20s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (21s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (22s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (23s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (24s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (25s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (26s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (27s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (28s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (29s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (30s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (31s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (32s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (33s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (34s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (35s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (36s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (37s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (38s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (39s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (40s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (41s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (42s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (43s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (44s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (45s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (46s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (47s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (48s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (49s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (50s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (51s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (52s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (53s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (54s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (55s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (56s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (57s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (58s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (59s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (60s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (61s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (62s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (63s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (64s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (65s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (66s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (67s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (68s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (69s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (70s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (71s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (72s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (73s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (74s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (75s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (76s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (77s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (78s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (79s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (80s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (81s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (82s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (83s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (84s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (85s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (87s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (88s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (89s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (90s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (91s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (92s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (93s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (94s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (95s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (96s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (97s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (98s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (99s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (100s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (101s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (102s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (103s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (104s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (105s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (106s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (107s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (108s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (109s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (110s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (111s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (112s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (113s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (114s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (115s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (116s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (117s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (118s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (119s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (120s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (121s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (122s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (123s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (124s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (125s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (126s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (127s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (128s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (129s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (130s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (131s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (132s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (133s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (134s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (135s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (136s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (137s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (138s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (139s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (140s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (141s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (142s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (143s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (144s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (145s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (146s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (147s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (148s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (149s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (150s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (151s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (152s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (153s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (154s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (155s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (156s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (157s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (158s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (159s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (160s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (161s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (162s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (163s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (164s) Current status: RUNNING" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " \rWaiting on bqjob_r5b325e72a15a0bfd_0000014161f0448d_19 ... (164s) Current status: DONE " | |
}, | |
{ | |
"ename": "BigqueryInvalidQueryError", | |
"evalue": "Error processing job 'data-sensing-lab:bqjob_r5b325e72a15a0bfd_0000014161f0448d_19': Table too large for JOIN. Consider using JOIN EACH. For more details, https://developers.google.com/bigquery/docs/query-reference#joins", | |
"output_type": "pyerr", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mBigqueryInvalidQueryError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32m<ipython-input-20-729781de8f2e>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 17\u001b[0m \u001b[0mLIMIT\u001b[0m \u001b[1;36m30\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 18\u001b[0m \"\"\"\n\u001b[1;32m---> 19\u001b[1;33m \u001b[0mfields\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mclient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mReadSchemaAndRows\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mQuery\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mquery\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'configuration'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'query'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'destinationTable'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmax_rows\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m1000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 20\u001b[0m \u001b[1;32mprint\u001b[0m \u001b[0mtabulate\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtabulate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'name'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mfields\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mQuery\u001b[1;34m(self, query, destination_table, create_disposition, write_disposition, priority, preserve_nulls, allow_large_results, dry_run, use_cache, **kwds)\u001b[0m\n\u001b[0;32m 1678\u001b[0m \u001b[0mrequest\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;34m'query'\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mquery_config\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1679\u001b[0m \u001b[0m_ApplyParameters\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdry_run\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdry_run\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1680\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mExecuteJob\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1681\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1682\u001b[0m def Load(self, destination_table_reference, source,\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mExecuteJob\u001b[1;34m(self, configuration, sync, project_id, upload_file, job_id)\u001b[0m\n\u001b[0;32m 1450\u001b[0m job = self.RunJobSynchronously(\n\u001b[0;32m 1451\u001b[0m \u001b[0mconfiguration\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mproject_id\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mproject_id\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mupload_file\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mupload_file\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1452\u001b[1;33m job_id=job_id)\n\u001b[0m\u001b[0;32m 1453\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1454\u001b[0m job = self.StartJob(\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mRunJobSynchronously\u001b[1;34m(self, configuration, project_id, upload_file, job_id)\u001b[0m\n\u001b[0;32m 1439\u001b[0m \u001b[0mjob_reference\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mBigqueryClient\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mConstructObjectReference\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1440\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mWaitJob\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mjob_reference\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1441\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mRaiseIfJobError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1442\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1443\u001b[0m def ExecuteJob(self, configuration, sync=None,\n", | |
"\u001b[1;32m/usr/local/lib/python2.7/dist-packages/bigquery-2.0.14-py2.7.egg/bigquery_client.pyc\u001b[0m in \u001b[0;36mRaiseIfJobError\u001b[1;34m(job)\u001b[0m\n\u001b[0;32m 816\u001b[0m raise BigqueryError.Create(\n\u001b[0;32m 817\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror_ls\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 818\u001b[1;33m job_ref=BigqueryClient.ConstructObjectReference(job))\n\u001b[0m\u001b[0;32m 819\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mjob\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 820\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", | |
"\u001b[1;31mBigqueryInvalidQueryError\u001b[0m: Error processing job 'data-sensing-lab:bqjob_r5b325e72a15a0bfd_0000014161f0448d_19': Table too large for JOIN. Consider using JOIN EACH. For more details, https://developers.google.com/bigquery/docs/query-reference#joins" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 20 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "That didn't work. We can only JOIN if the second table is small enough. Luckily, we can use JOIN EACH for arbitrarily big tables.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = \"\"\"\nSELECT a.word, b.word, a.date, a.c, b.c \nFROM (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\n GROUP EACH BY 1, 2\n HAVING word = 'sql'\n LIMIT 10\n) a\nJOIN EACH (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, DATE(created_utc) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\n GROUP EACH BY 1, 2\n) b\nON a.date=b.date\nWHERE a.word != b.word\nAND b.word > 'a'\nORDER BY 2, 3 DESC\nLIMIT 30\n\"\"\"\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r668abdc1c6cb4bc1_0000014161f35c27_20 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word a_date a_c b_c\n-------- --------- ---------- ----- -----\nsql a-18 2013-06-11 2 1\nsql a-basin 2012-07-03 2 1\nsql a-listers 2012-11-15 2 1\nsql a-rod 2012-10-08 3 1\nsql a-z 2013-06-11 2 1\nsql a10 2013-02-04 2 1\nsql a1c 2013-06-11 2 1\nsql a1c 2012-11-15 2 1\nsql a6m5 2012-11-15 2 1\nsql aaa 2013-02-04 2 1\nsql aaa 2011-12-27 1 1\nsql aaaand 2012-07-03 2 1\nsql aaaawwww 2012-11-15 2 1\nsql aaah 2012-10-08 3 1\nsql aaahh 2013-06-11 2 1\nsql aalegends 2012-10-08 3 2\nsql aan 2013-06-11 2 1\nsql aapa 2012-10-08 3 1\nsql aaron 2013-06-11 2 1\nsql aaron 2013-02-04 2 1\nsql aaron 2012-10-08 3 1\nsql aarseth 2012-03-26 1 1\nsql ab-soul 2013-02-04 2 1\nsql ab-soul 2012-11-15 2 1\nsql abandon 2013-06-11 2 1\nsql abandon 2013-02-04 2 1\nsql abandon 2012-10-08 3 1\nsql abandoned 2013-06-11 2 2\nsql abandoned 2013-02-04 2 8\nsql abandoned 2012-11-15 2 6" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 21 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "You might be wondering what I'm doing here: Creating timelines of how many times a word has been mentioned, and putting their count side by side with a JOIN. Once I have that, I can look for correlations: Words that have behaved in the same way." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "base_query = \"\"\"\nSELECT a.word, b.word, CORR(a.c, b.c) corr, COUNT(*) c\nFROM (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, UTC_USEC_TO_MONTH(TIMESTAMP_TO_USEC(created_utc)) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") = '%s'\n GROUP EACH BY 1, 2\n) a\nJOIN EACH (\n SELECT REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") word, UTC_USEC_TO_MONTH(TIMESTAMP_TO_USEC(created_utc)) date, COUNT(*) c\n FROM [bigquery-samples:reddit.words]\n WHERE REGEXP_EXTRACT(LOWER(title), \"([a-z0-9][a-z-0-9]+[';-]?[a-z0-9]+)\") != ''\n GROUP EACH BY 1, 2\n) b\nON a.date=b.date\nWHERE a.word != b.word\nGROUP BY 1, 2\nHAVING c > 30\nORDER BY 3 DESC\nLIMIT 5\n\"\"\"\nquery = base_query % 'sql'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r5b346ff48c1cfff4_0000014161f366c2_21 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- -------- -------- ---\nsql server 0.962023 53\nsql david 0.959003 56\nsql policy 0.956993 55\nsql reality 0.95693 53\nsql across 0.95687 56" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 22 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'religion'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r2ee255a470a0edbb_0000014161f36d12_22 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- -------- -------- ---\nreligion brain 0.978323 61\nreligion maps 0.972462 50\nreligion network 0.970603 58\nreligion society 0.969936 56\nreligion human 0.969745 61" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 23 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'holidays'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r56e55315cff890ef_0000014161f37461_23 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- --------- -------- ---\nholidays christmas 0.994448 31\nholidays santa 0.990515 33\nholidays holiday 0.968293 33\nholidays gifts 0.942386 31\nholidays presents 0.859437 31" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 24 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'pizza'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r19f4aff7d827e811_0000014161f37b6d_24 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- -------- -------- ---\npizza great 0.988512 48\npizza office 0.986625 47\npizza food 0.986477 48\npizza making 0.98625 48\npizza better 0.986143 47" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 25 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'crazy'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r6c83abe5755ac7dc_0000014161f38240_25 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- -------- -------- ---\ncrazy coming 0.995498 62\ncrazy girl 0.995198 62\ncrazy used 0.995093 60\ncrazy john 0.99475 61\ncrazy fire 0.994627 57" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 26 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'election'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r1c170cbe5d0407c2_0000014161f38800_26 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- --------- -------- ---\nelection gaza 0.937955 40\nelection polling 0.912526 34\nelection nov 0.911871 37\nelection palestine 0.907705 31\nelection november 0.896156 46" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 27 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'homework'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r1f297722592a09b0_0000014161f38d5f_27 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- -------- -------- ---\nhomework dorm 0.917931 31\nhomework lecture 0.905694 40\nhomework class 0.904165 41\nhomework college 0.903107 41\nhomework campus 0.898893 36" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 28 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'neighbor'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r404b8034f90b36b_0000014161f3932f_28 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n-------- --------- -------- ---\nneighbor dog 0.987744 40\nneighbor apartment 0.985745 37\nneighbor town 0.984668 41\nneighbor tried 0.98371 41\nneighbor edge 0.983477 39" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 29 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "query = base_query % 'volleyball'\nfields, data = client.ReadSchemaAndRows(client.Query(query)['configuration']['query']['destinationTable'], max_rows = 1000)\nprint tabulate.tabulate(data, headers=[x['name'] for x in fields])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\rWaiting on bqjob_r2847fa7a71dd67b4_0000014161f398db_29 ... (0s) Current status: DONE " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\na_word b_word corr c\n---------- -------- -------- ---\nvolleyball olympic 0.843229 31\nvolleyball olympics 0.835825 33\nvolleyball athletes 0.805992 33\nvolleyball rises 0.803989 32\nvolleyball ceremony 0.770325 33" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 30 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Note that by the nature of this dataset and looking for correlations by month by month, the best results come up when trying seasonal words that repeat year after year. " | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment