Created
March 30, 2017 23:54
-
-
Save pkipsy/09589fe7096c2c538d9ae2f77c498992 to your computer and use it in GitHub Desktop.
Extracting Frequency Information from Google Search
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Extracting Frequency Information from Google Search Results\n", | |
"## May 2013\n", | |
"\n", | |
"This is a simple script for extracting frequency information for **bigrams** and can easily be modified for larger n-gram sequences. \n", | |
"\n", | |
"We start by reading in a .csv file of two-word sequences into a Python list." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import csv\n", | |
"pairs = csv.reader(open('pairs.csv','rU'), dialect=csv.excel_tab)\n", | |
"search_pairs = []\n", | |
"for row in pairs:\n", | |
" string_pair = ''.join(row)\n", | |
" search_pairs.append(string_pair)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Our queries will be channeled through Google's API, so we first need to specify a user key to proceed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import urllib\n", | |
"import json\n", | |
"\n", | |
"query = \"https://www.googleapis.com/customsearch/v1?key=[user_key_goes_here]\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Once that's done, we need to write the search string we'll be using for our queries. In this case, we want to know the frequency of \"word<sub>1</sub> + word<sub>2</sub>\". \n", | |
"\n", | |
"The following function will run through our bigram list, creating a dictionary that pairs each search string with the number of **Google hits** it returns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"google_hits = {}\n", | |
"for pair in search_pairs:\n", | |
" y = pair.split()\n", | |
" y = '+'.join(y)\n", | |
" search_string = \"\\\"\"+y+\"\\\"\"\n", | |
" results = urllib.urlopen(query % search_string)\n", | |
" json_res = json.loads(results.read())\n", | |
" google_hits[pair] = (int(json_res['searchInformation']['totalResults']))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally, we can visually inspect those results:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"for pair in google_hits:\n", | |
" print pair+'\\t'+str(google_hits[pair])" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.12" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment