Skip to content

Instantly share code, notes, and snippets.

@trevormunoz
Created August 19, 2013 02:38
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save trevormunoz/6265360 to your computer and use it in GitHub Desktop.
Save trevormunoz/6265360 to your computer and use it in GitHub Desktop.
An iPython notebook demonstrating the use of the Python client library to interact with Open Refine.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "Programmatic Use of Open Refine to Facet and Cluster Names of Dishes"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "## Programmatic Use of Open Refine to Facet and Cluster Names of 'Dishes' from NYPL's *What's on the menu?*\nTrevor Mu\u00f1oz\n\n18 August 2013\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from google.refine import refine, facet\nimport pystache",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This client library assumes that Open Refine is installed and that the Refine server application is running. To create a connection to the server:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "server = refine.RefineServer()\ngrefine = refine.Refine(server)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now that we have a connection to the server from our application we can make calls that replicate the functionality available in the standard graphical user interface (GUI). For instance, I can list the projects I have in my copy of Refine (ymmv). In this case, I have just one project at the moment, containing the 8/1/2013 data release from NYPL. The list_projects command returns a JSON response containing metadata about my project(s). To do any work, I need the project identifier (2310205155087) so I can open the relevant project:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "grefine.list_projects()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 3,
"text": "{u'2310205155087': {u'created': u'2013-08-16T20:45:49Z',\n u'customMetadata': {},\n u'modified': u'2013-08-16T20:56:56Z',\n u'name': u'2013_08_01_07_05_00_data'}}"
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": "nypl_dishes = grefine.open_project('2310205155087')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now I can issue commands to facet the values in the 'name' column. To demonstrate that the failure I experienced using the standard GUI is a bottleneck in the frontend code and not the server code that is actually computing the data facets, I'll time the execution of the facet command:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "name_facet = facet.TextFacet('name')\n\n%%timeit facet_response = nypl_dishes.compute_facets(name_facet)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 loops, best of 3: 7.83 s per loop\n"
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "At 7.68 seconds for best time, this command is not amazingly quick but neither is the execution so slow that the whole application should grind to a halt."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To check that we're getting the same behavior with this command as in the GUI, let's see if the number of calculated facets in the same (370,004):"
},
{
"cell_type": "code",
"collapsed": false,
"input": "facet_response = nypl_dishes.compute_facets(name_facet)\nfacets = facet_response.facets[0]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "From this object we can access a dictionary called 'choices' that contains the facets of the data along with their associated counts. The number of keys in this dictionary is the number of facets."
},
{
"cell_type": "code",
"collapsed": false,
"input": "len(facets.choices.keys())",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 7,
"text": "370004"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now I can display the list of unique values for dish 'name' in descendingly order of their raw count. This is the same information that would appear in one of the boxes of the sidebar of the Refine GUI if it didn't choke. For the sake of space, I'll only output the first 25 here:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True)[:25]:\n print facets.choices[k].count, k",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "13 potatoes hashed in cream\n13 cold roast beef\n11 club sandwich\n10 lobster salad\n10 hot roast beef sandwich\n10 american cheese\n10 clams: little necks\n9 celery\n9 american cheese sandwich\n9 strawberry ice cream\n9 potatoes au gratin\n9 french fried potatoes\n9 chocolate ice cream\n9 new lima beans\n9 broiled oysters, celery sauce\n9 roast beef sandwich\n9 caviar, special importation\n9 ice cream\n9 vanilla ice cream\n8 bulgarzoon (scientifically fermented milk), per bottle\n8 milk toast\n8 crab meat salad\n8 cold corned beef\n8 little neck clams\n8 green turtle soup\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In the GUI, the order of operations is to calculate the facets for a column then generate clusters. From the perspective of this little script, the dependency is not clear. In any case, it's now also possible to calculate and inspect the output of Refine's clustering functionality operating on the 'name' column. At first we'll just use the default binning clusterer:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "cluster_response = nypl_dishes.compute_clusters('name')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Again, what is returned is a JSON response with information about the clusters found in the data. In the GUI, this is turned into a modal overlay where users can select the best value to represent a cluster and then normalize batches of values at the same time."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can see how many clusters were generated in this first pass:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "len(cluster_response)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 10,
"text": "25712"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "And we can inspect the data in the clusters. (For space, only the first five clusters are shown):"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for cluster in cluster_response[:5]:\n print '\\n'\n for line in cluster:\n print(pystache.render('{{count}} \\t {{value}}', line))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\n\n5 \t eggs, poached on toast (2)\n4 \t poached eggs on toast (2)\n3 \t 2 eggs, poached on toast\n3 \t eggs poached on toast (2)\n3 \t eggs, poached on toast [2]\n2 \t 2 poached (eggs) on toast\n2 \t 2 poached eggs on toast\n2 \t eggs poached on toast [2]\n2 \t poached eggs (2) on toast\n1 \t (2) poached eggs on toast\n1 \t 2 (eggs) poached on toast\n1 \t 2 eggs poached on toast\n1 \t 2 poached on toast, eggs\n1 \t [eggs] poached on toast (2)\n1 \t eggs (2) poached on toast\n1 \t eggs (2), poached on toast\n1 \t eggs - poached on toast (2)\n1 \t eggs poached (2) on toast\n1 \t eggs poached (2), on toast\n1 \t eggs, 2 poached on toast\n1 \t eggs, poached (2) on toast\n1 \t eggs: 2 poached on toast\n1 \t poached eggs (2), on toast\n1 \t poached eggs on toast, [2]\n1 \t poached eggs, 2 on toast\n\n\n2 \t 2 eggs fried\n2 \t 2 eggs, fried\n2 \t eggs fried (2)\n2 \t eggs, fried (2)\n2 \t fried eggs (2)\n1 \t (2) eggs fried\n1 \t (2) fried eggs\n1 \t 2 eggs: fried\n1 \t 2 fried eggs\n1 \t [2] fried eggs\n1 \t eggs (2) fried\n1 \t eggs (2), fried\n1 \t eggs - fried (2)\n1 \t eggs, (2) fried\n1 \t eggs, 2 fried\n1 \t eggs, fried 2\n1 \t eggs, fried [2]\n1 \t eggs: fried (2)\n1 \t fried (2 eggs)\n1 \t fried 2 [eggs]\n1 \t fried eggs (2 eggs)\n1 \t fried eggs 2\n1 \t fried eggs, (2)\n1 \t fried eggs, 2\n\n\n3 \t boiled eggs (2)\n3 \t eggs boiled (2)\n3 \t eggs, boiled (2)\n2 \t 2 boiled eggs\n2 \t 2 eggs boiled\n2 \t 2 eggs, boiled\n2 \t eggs, boiled [2]\n1 \t (2) boiled eggs\n1 \t (2) eggs boiled\n1 \t 2 eggs: boiled\n1 \t [2] boiled eggs\n1 \t boiled (2 eggs)\n1 \t boiled 2 [eggs]\n1 \t boiled eggs (2 eggs)\n1 \t boiled eggs 2\n1 \t boiled eggs, 2\n1 \t eggs (2) boiled\n1 \t eggs (2), boiled\n1 \t eggs boiled [2]\n1 \t eggs, (2) boiled\n1 \t eggs, 2 boiled\n1 \t eggs, boiled 2\n1 \t eggs: boiled (2)\n\n\n6 \t imported ginger ale, c. & c.\n2 \t c & c imported ginger ale\n2 \t ginger ale, imported c & c\n1 \t "c. & c." ginger ale, imported\n1 \t c. & c. ginger ale, imported\n1 \t c. & c. imported ginger ale\n1 \t ginger ale (c. & c.), imported\n1 \t ginger ale (imported) c. & c.\n1 \t ginger ale imported, c. & c.\n1 \t ginger ale, c & c (imported)\n1 \t ginger ale, c. & c. (imported)\n1 \t ginger ale, c. & c., imported\n1 \t ginger ale, imported, c & c\n1 \t ginger ale, imported, c. & c.\n1 \t imported (c. & c.) ginger ale\n1 \t imported c. & c. ginger ale\n1 \t imported ginger ale (c & c)\n1 \t imported ginger ale (c. & c.)\n1 \t imported ginger ale c & c\n1 \t imported ginger ale c. & c\n1 \t imported ginger ale c. & c.\n1 \t imported ginger ale, c & c\n1 \t imported ginger ale, c. & c\n\n\n7 \t broiled chicken (half)\n3 \t broiled chicken, half\n3 \t half broiled chicken\n2 \t broiled half chicken\n2 \t chicken, broiled, half\n1 \t (half) broiled chicken\n1 \t [broiled] chicken (half)\n1 \t broiled chicken - (half)\n1 \t broiled chicken half\n1 \t broiled chicken, (half)\n1 \t chicken (half) (broiled)\n1 \t chicken (half), broiled\n1 \t chicken broiled (half)\n1 \t chicken broiled half\n1 \t chicken broiled, half\n1 \t chicken half, broiled\n1 \t chicken, broiled (half)\n1 \t chicken, half, broiled\n1 \t half (broiled chicken)\n1 \t half chicken broiled\n1 \t half chicken, broiled\n"
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Directly inspecting the clusters produced by Refine shows that there is great potential for improving the quality of the data by normalizing names of dishes. "
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment