Skip to content

Instantly share code, notes, and snippets.

@dodijk
Last active August 29, 2015 14:11
Show Gist options
  • Save dodijk/924b363ef01c0a6ea06e to your computer and use it in GitHub Desktop.
Save dodijk/924b363ef01c0a6ea06e to your computer and use it in GitHub Desktop.
ClueWebService
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:5b9e8c729b75f3f39b38b375b2127ab036850b6028d2f3aee13136d78b8ff7de"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ClueWebService\n",
"\n",
"This IPython notebooks provides a few examples on how to work with a webservice that gives access to an inverted index of the TREC Category B (first 50 million English pages) subset of the [ClueWeb09 dataset](http://lemurproject.org/clueweb09/).\n",
"\n",
"Let's first get some statistics on this collection."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from urllib2 import urlopen\n",
"\n",
"CLUEWEB = \"http://zookst18.science.uva.nl:8003/\"\n",
"\n",
"print urlopen(CLUEWEB + \"stats\").read()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Repository statistics:\n",
"documents:\t50220423\n",
"unique terms:\t90411636\n",
"total terms:\t40541601698\n",
"fields:\t\ttitle heading \n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"40 billion terms, not bad at all!\n",
"\n",
"OK, now a simple search function:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def search(query):\n",
" reply = urlopen(CLUEWEB + \"term/%s\" % query)\n",
" term, normalized, tf_total, total_terms = reply.readline().split()\n",
" docs = [map(int, line.split()) for line in reply]\n",
" print \"Searched for %s (%s normalized)\" % (term, normalized)\n",
" print \"Matches %s out of %s tokens\" % (tf_total, total_terms)\n",
" print \"Found in %s docs.\" % len(docs)\n",
" print \n",
"\n",
" for i, (docid, tf, length) in enumerate(sorted(docs, key=lambda (docid, tf, length): tf, reverse=True)[:10]):\n",
" print \"%2d. Document %8d matches %4d out of %4d tokens\" % (i+1, docid, tf, length)\n",
" \n",
"search(\"encryption\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Searched for encryption (encryption normalized)\n",
"Matches 521503 out of 40541601698 tokens\n",
"Found in 263019 docs.\n",
"\n",
" 1. Document 2551842 matches 1984 out of 7632 tokens"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
" 2. Document 12029058 matches 212 out of 2117 tokens\n",
" 3. Document 2792728 matches 195 out of 6230 tokens\n",
" 4. Document 12029129 matches 182 out of 2033 tokens\n",
" 5. Document 12029130 matches 182 out of 2028 tokens\n",
" 6. Document 43670536 matches 182 out of 3634 tokens\n",
" 7. Document 43670537 matches 182 out of 3636 tokens\n",
" 8. Document 27095467 matches 178 out of 9136 tokens\n",
" 9. Document 13747567 matches 171 out of 2581 tokens\n",
"10. Document 13747809 matches 171 out of 2589 tokens\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hmm, 1984 out of 7632 tokens is 'encryption'? Seems a bit much. Could this be spam?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print urlopen(CLUEWEB + \"spam/2551842\").read()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so only 1% of all documents is considered more spammy than this. We should probably not rank that one very high.\n",
"\n",
"What more can we do?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print urlopen(CLUEWEB).read()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This web service allows a user to search indexed documents using an indexed term, \n",
"and provides access to general statistics about an index in general.\n",
"\n",
"The webservice is more of less of a \"Swiss-army knife\" for various index functions.\n",
"\n",
"Commands for retrieving data from a repository are as follows:\n",
"\n",
"Command Argument(s) Description\n",
"/ (None) Print this usage description\n",
"/term (/t) Term text Print inverted list for a term\n",
"/termpositions (/tp) Term text Print inverted list for a term, with positions\n",
"/fieldpositions /(fp) Field name Print inverted list for a field, with positions\n",
"/documentname (/dn) Document ID Print the text representation of a document ID \n",
"/documenttext (/dt) Document ID Print the text of a document\n",
"/documentdata (/dd) Document ID Print the full representation of a document\n",
"/documentvector (/dv) Document ID Print the document vector of a document\n",
"/spam (/sp) Document ID Print the spamminess percentile score*\n",
"/stats (/s) (None) Print statistics for the Repository\n",
"\n",
"* The percentile score indicates the percentage of the documents in the corpus that are \"spammier.\" \n",
" That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have \n",
" percentile-score=1, and so on. The least spammy 1% have percentile-score=99. If you just want to \n",
" label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam.\n"
]
}
],
"prompt_number": 4
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment