dodijk/ClueWebService.ipynb

## ClueWebService.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:5b9e8c729b75f3f39b38b375b2127ab036850b6028d2f3aee13136d78b8ff7de"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# ClueWebService\n",
      "\n",
      "This IPython notebooks provides a few examples on how to work with a webservice that gives access to an inverted index of the TREC Category B (first 50 million English pages) subset of the [ClueWeb09 dataset](http://lemurproject.org/clueweb09/).\n",
      "\n",
      "Let's first get some statistics on this collection."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from urllib2 import urlopen\n",
      "\n",
      "CLUEWEB = \"http://zookst18.science.uva.nl:8003/\"\n",
      "\n",
      "print urlopen(CLUEWEB + \"stats\").read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Repository statistics:\n",
        "documents:\t50220423\n",
        "unique terms:\t90411636\n",
        "total terms:\t40541601698\n",
        "fields:\t\ttitle heading \n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "40 billion terms, not bad at all!\n",
      "\n",
      "OK, now a simple search function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def search(query):\n",
      "    reply = urlopen(CLUEWEB + \"term/%s\" % query)\n",
      "    term, normalized, tf_total, total_terms = reply.readline().split()\n",
      "    docs = [map(int, line.split()) for line in reply]\n",
      "    print \"Searched for %s (%s normalized)\" % (term, normalized)\n",
      "    print \"Matches %s out of %s tokens\" % (tf_total, total_terms)\n",
      "    print \"Found in %s docs.\" % len(docs)\n",
      "    print \n",
      "\n",
      "    for i, (docid, tf, length) in enumerate(sorted(docs, key=lambda (docid, tf, length): tf, reverse=True)[:10]):\n",
      "        print \"%2d. Document %8d matches %4d out of %4d tokens\" % (i+1, docid, tf, length)\n",
      "    \n",
      "search(\"encryption\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Searched for encryption (encryption normalized)\n",
        "Matches 521503 out of 40541601698 tokens\n",
        "Found in 263019 docs.\n",
        "\n",
        " 1. Document  2551842 matches 1984 out of 7632 tokens"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 2. Document 12029058 matches  212 out of 2117 tokens\n",
        " 3. Document  2792728 matches  195 out of 6230 tokens\n",
        " 4. Document 12029129 matches  182 out of 2033 tokens\n",
        " 5. Document 12029130 matches  182 out of 2028 tokens\n",
        " 6. Document 43670536 matches  182 out of 3634 tokens\n",
        " 7. Document 43670537 matches  182 out of 3636 tokens\n",
        " 8. Document 27095467 matches  178 out of 9136 tokens\n",
        " 9. Document 13747567 matches  171 out of 2581 tokens\n",
        "10. Document 13747809 matches  171 out of 2589 tokens\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Hmm, 1984 out of 7632 tokens is 'encryption'? Seems a bit much. Could this be spam?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print urlopen(CLUEWEB + \"spam/2551842\").read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ok, so only 1% of all documents is considered more spammy than this.  We should probably not rank that one very high.\n",
      "\n",
      "What more can we do?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print urlopen(CLUEWEB).read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "This web service allows a user to search indexed documents using an indexed term, \n",
        "and provides access to general statistics about an index in general.\n",
        "\n",
        "The webservice is more of less of a \"Swiss-army knife\" for various index functions.\n",
        "\n",
        "Commands for retrieving data from a repository are as follows:\n",
        "\n",
        "Command               Argument(s)  Description\n",
        "/                     (None)       Print this usage description\n",
        "/term (/t)            Term text    Print inverted list for a term\n",
        "/termpositions (/tp)  Term text    Print inverted list for a term, with positions\n",
        "/fieldpositions /(fp) Field name   Print inverted list for a field, with positions\n",
        "/documentname (/dn)   Document ID  Print the text representation of a document ID \n",
        "/documenttext (/dt)   Document ID  Print the text of a document\n",
        "/documentdata (/dd)   Document ID  Print the full representation of a document\n",
        "/documentvector (/dv) Document ID  Print the document vector of a document\n",
        "/spam  (/sp)          Document ID  Print the spamminess percentile score*\n",
        "/stats (/s)           (None)       Print statistics for the Repository\n",
        "\n",
        "* The percentile score indicates the percentage of the documents in the corpus that are \"spammier.\" \n",
        "  That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have \n",
        "  percentile-score=1, and so on. The least spammy 1% have percentile-score=99. If you just want to \n",
        "  label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam.\n"
       ]
      }
     ],
     "prompt_number": 4
    }
   ],
   "metadata": {}
  }
 ]
}

## Indri Server.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Indri Server.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"metadata": {
	"name": "",
	"signature": "sha256:5b9e8c729b75f3f39b38b375b2127ab036850b6028d2f3aee13136d78b8ff7de"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# ClueWebService\n",
	"\n",
	"This IPython notebooks provides a few examples on how to work with a webservice that gives access to an inverted index of the TREC Category B (first 50 million English pages) subset of the [ClueWeb09 dataset](http://lemurproject.org/clueweb09/).\n",
	"\n",
	"Let's first get some statistics on this collection."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from urllib2 import urlopen\n",
	"\n",
	"CLUEWEB = \"http://zookst18.science.uva.nl:8003/\"\n",
	"\n",
	"print urlopen(CLUEWEB + \"stats\").read()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Repository statistics:\n",
	"documents:\t50220423\n",
	"unique terms:\t90411636\n",
	"total terms:\t40541601698\n",
	"fields:\t\ttitle heading \n"
	]
	}
	],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"40 billion terms, not bad at all!\n",
	"\n",
	"OK, now a simple search function:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def search(query):\n",
	" reply = urlopen(CLUEWEB + \"term/%s\" % query)\n",
	" term, normalized, tf_total, total_terms = reply.readline().split()\n",
	" docs = [map(int, line.split()) for line in reply]\n",
	" print \"Searched for %s (%s normalized)\" % (term, normalized)\n",
	" print \"Matches %s out of %s tokens\" % (tf_total, total_terms)\n",
	" print \"Found in %s docs.\" % len(docs)\n",
	" print \n",
	"\n",
	" for i, (docid, tf, length) in enumerate(sorted(docs, key=lambda (docid, tf, length): tf, reverse=True)[:10]):\n",
	" print \"%2d. Document %8d matches %4d out of %4d tokens\" % (i+1, docid, tf, length)\n",
	" \n",
	"search(\"encryption\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Searched for encryption (encryption normalized)\n",
	"Matches 521503 out of 40541601698 tokens\n",
	"Found in 263019 docs.\n",
	"\n",
	" 1. Document 2551842 matches 1984 out of 7632 tokens"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	" 2. Document 12029058 matches 212 out of 2117 tokens\n",
	" 3. Document 2792728 matches 195 out of 6230 tokens\n",
	" 4. Document 12029129 matches 182 out of 2033 tokens\n",
	" 5. Document 12029130 matches 182 out of 2028 tokens\n",
	" 6. Document 43670536 matches 182 out of 3634 tokens\n",
	" 7. Document 43670537 matches 182 out of 3636 tokens\n",
	" 8. Document 27095467 matches 178 out of 9136 tokens\n",
	" 9. Document 13747567 matches 171 out of 2581 tokens\n",
	"10. Document 13747809 matches 171 out of 2589 tokens\n"
	]
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Hmm, 1984 out of 7632 tokens is 'encryption'? Seems a bit much. Could this be spam?"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print urlopen(CLUEWEB + \"spam/2551842\").read()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"1\n"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok, so only 1% of all documents is considered more spammy than this. We should probably not rank that one very high.\n",
	"\n",
	"What more can we do?"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print urlopen(CLUEWEB).read()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"This web service allows a user to search indexed documents using an indexed term, \n",
	"and provides access to general statistics about an index in general.\n",
	"\n",
	"The webservice is more of less of a \"Swiss-army knife\" for various index functions.\n",
	"\n",
	"Commands for retrieving data from a repository are as follows:\n",
	"\n",
	"Command Argument(s) Description\n",
	"/ (None) Print this usage description\n",
	"/term (/t) Term text Print inverted list for a term\n",
	"/termpositions (/tp) Term text Print inverted list for a term, with positions\n",
	"/fieldpositions /(fp) Field name Print inverted list for a field, with positions\n",
	"/documentname (/dn) Document ID Print the text representation of a document ID \n",
	"/documenttext (/dt) Document ID Print the text of a document\n",
	"/documentdata (/dd) Document ID Print the full representation of a document\n",
	"/documentvector (/dv) Document ID Print the document vector of a document\n",
	"/spam (/sp) Document ID Print the spamminess percentile score*\n",
	"/stats (/s) (None) Print statistics for the Repository\n",
	"\n",
	"* The percentile score indicates the percentage of the documents in the corpus that are \"spammier.\" \n",
	" That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have \n",
	" percentile-score=1, and so on. The least spammy 1% have percentile-score=99. If you just want to \n",
	" label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam.\n"
	]
	}
	],
	"prompt_number": 4
	}
	],
	"metadata": {}
	}
	]
	}