pkipsy/bigram-extraction.ipynb

## bigram-extraction.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting Frequency Information from Google Search Results\n",
    "## May 2013\n",
    "\n",
    "This is a simple script for extracting frequency information for **bigrams** and can easily be modified for larger n-gram sequences. \n",
    "\n",
    "We start by reading in a .csv file of two-word sequences into a Python list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "pairs = csv.reader(open('pairs.csv','rU'), dialect=csv.excel_tab)\n",
    "search_pairs = []\n",
    "for row in pairs:\n",
    "    string_pair = ''.join(row)\n",
    "    search_pairs.append(string_pair)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our queries will be channeled through Google's API, so we first need to specify a user key to proceed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import urllib\n",
    "import json\n",
    "\n",
    "query = \"https://www.googleapis.com/customsearch/v1?key=[user_key_goes_here]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once that's done, we need to write the search string we'll be using for our queries. In this case, we want to know the frequency of \"word<sub>1</sub> + word<sub>2</sub>\". \n",
    "\n",
    "The following function will run through our bigram list, creating a dictionary that pairs each search string with the number of **Google hits** it returns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "google_hits = {}\n",
    "for pair in search_pairs:\n",
    "    y = pair.split()\n",
    "    y = '+'.join(y)\n",
    "    search_string = \"\\\"\"+y+\"\\\"\"\n",
    "    results = urllib.urlopen(query % search_string)\n",
    "    json_res = json.loads(results.read())\n",
    "    google_hits[pair] = (int(json_res['searchInformation']['totalResults']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can visually inspect those results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for pair in google_hits:\n",
    "    print pair+'\\t'+str(google_hits[pair])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Extracting Frequency Information from Google Search Results\n",
	"## May 2013\n",
	"\n",
	"This is a simple script for extracting frequency information for bigrams and can easily be modified for larger n-gram sequences. \n",
	"\n",
	"We start by reading in a .csv file of two-word sequences into a Python list."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import csv\n",
	"pairs = csv.reader(open('pairs.csv','rU'), dialect=csv.excel_tab)\n",
	"search_pairs = []\n",
	"for row in pairs:\n",
	" string_pair = ''.join(row)\n",
	" search_pairs.append(string_pair)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Our queries will be channeled through Google's API, so we first need to specify a user key to proceed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import urllib\n",
	"import json\n",
	"\n",
	"query = \"https://www.googleapis.com/customsearch/v1?key=[user_key_goes_here]\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Once that's done, we need to write the search string we'll be using for our queries. In this case, we want to know the frequency of \"word<sub>1</sub> + word<sub>2</sub>\". \n",
	"\n",
	"The following function will run through our bigram list, creating a dictionary that pairs each search string with the number of Google hits it returns."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"google_hits = {}\n",
	"for pair in search_pairs:\n",
	" y = pair.split()\n",
	" y = '+'.join(y)\n",
	" search_string = \"\\\"\"+y+\"\\\"\"\n",
	" results = urllib.urlopen(query % search_string)\n",
	" json_res = json.loads(results.read())\n",
	" google_hits[pair] = (int(json_res['searchInformation']['totalResults']))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally, we can visually inspect those results:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"for pair in google_hits:\n",
	" print pair+'\\t'+str(google_hits[pair])"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}