Skip to content

Instantly share code, notes, and snippets.

@pkipsy
Created March 30, 2017 23:54
Show Gist options
  • Save pkipsy/09589fe7096c2c538d9ae2f77c498992 to your computer and use it in GitHub Desktop.
Save pkipsy/09589fe7096c2c538d9ae2f77c498992 to your computer and use it in GitHub Desktop.
Extracting Frequency Information from Google Search
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extracting Frequency Information from Google Search Results\n",
"## May 2013\n",
"\n",
"This is a simple script for extracting frequency information for **bigrams** and can easily be modified for larger n-gram sequences. \n",
"\n",
"We start by reading in a .csv file of two-word sequences into a Python list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import csv\n",
"pairs = csv.reader(open('pairs.csv','rU'), dialect=csv.excel_tab)\n",
"search_pairs = []\n",
"for row in pairs:\n",
" string_pair = ''.join(row)\n",
" search_pairs.append(string_pair)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our queries will be channeled through Google's API, so we first need to specify a user key to proceed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import urllib\n",
"import json\n",
"\n",
"query = \"https://www.googleapis.com/customsearch/v1?key=[user_key_goes_here]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once that's done, we need to write the search string we'll be using for our queries. In this case, we want to know the frequency of \"word<sub>1</sub> + word<sub>2</sub>\". \n",
"\n",
"The following function will run through our bigram list, creating a dictionary that pairs each search string with the number of **Google hits** it returns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"google_hits = {}\n",
"for pair in search_pairs:\n",
" y = pair.split()\n",
" y = '+'.join(y)\n",
" search_string = \"\\\"\"+y+\"\\\"\"\n",
" results = urllib.urlopen(query % search_string)\n",
" json_res = json.loads(results.read())\n",
" google_hits[pair] = (int(json_res['searchInformation']['totalResults']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can visually inspect those results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for pair in google_hits:\n",
" print pair+'\\t'+str(google_hits[pair])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment