Skip to content

Instantly share code, notes, and snippets.

@PraveshKoirala
Created July 24, 2016 13:41
Show Gist options
  • Save PraveshKoirala/a46066c8b062956001d2ab056cb7d488 to your computer and use it in GitHub Desktop.
Save PraveshKoirala/a46066c8b062956001d2ab056cb7d488 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple search engine\n",
"#### By Pravesh Koirala\n",
"\n",
"A search engine that will index given [toy] documents.. just to show how to do it."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"stop_words = ['and', 'or', 'the', 'a', 'an', 'is', 'some']\n",
"\n",
"doc1 = \"In second module, the input data are study used by ANN simulator to detect the quality of apple. The ANN \\\n",
"simulator program is developed in Matlab Compiler and \\\n",
"Matlab Neural Network Toolbox. It can segregate apple \\\n",
"according to defect, size, color, etc\"\n",
"\n",
"doc2 = \"William Shakespeare was an English poet, playwright, and actor, widely regarded as the greatest \\\n",
"writer in the English language and the world's pre-eminent dramatist.[2] He is often called England's \\\n",
"national poet, and the \\\"Bard of Avon\\\" His extant works, including collaborations, consist of approximately \\\n",
"38 plays,[nb 3] 154 sonnets, two long narrative poems, and a few other verses, some of uncertain authorship. \\\n",
"His plays have been translated into every major living language and are performed more often \\\n",
"than those of any other playwright.[4]\"\n",
"\n",
"doc3 = \"Politicsis the process of making uniform decisions applying to all members of a group. \\\n",
"It also involves the use of power by one person to affect the behavior of another person. \\\n",
"More narrowly, it refers to achieving and exercising positions of governance — \\\n",
"organized control over a human community, particularly a state. Furthermore, \\\n",
"politics is the study or practice of the distribution of power and resources \\\n",
"within a given community (a usually hierarchically organized population) as well as \\\n",
"the interrelationship(s) between communities\"\n",
"\n",
"document = {\n",
" \"doc1\": doc1,\n",
" \"doc2\": doc2,\n",
" \"doc3\": doc3\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import re\n",
"from collections import Counter, defaultdict\n",
"\n",
"# get a set of words, minus the stop words\n",
"def get_words(doc):\n",
" words = re.findall(r'\\w+', doc)\n",
" count_words_wo_stop = {k.lower(): v for k, v in Counter(words).items() if k not in stop_words}\n",
" return count_words_wo_stop\n",
"\n",
"index = None\n",
"# for every word, you'll have.. a list of (doc, count)\n",
"index = defaultdict(list)\n",
"\n",
"def index_words(doc, counter):\n",
" global index\n",
" for w, c in counter.items():\n",
" index[w].append((doc, c))\n",
"\n",
"# extracting words from the document and indexing it.\n",
"for key, doc in document.items():\n",
" word_dic = get_words(doc)\n",
" index_words(key, word_dic)\n",
"\n",
"# sort the word lists in default dict\n",
"for w, L in index.items():\n",
" index[w] = sorted(L)\n",
"\n",
"# Uncomment this to find how the index looks... hint: it is quite messy!\n",
"# print index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a search function\n",
"\n",
"The function will work as follows:\n",
"- Remove the stop words from the search term\n",
"- Find all related document set\n",
"- Find the intersection of document set\n",
"- Use TF-IDF to find the most interesting document among the intersected document set.\n"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.38629436112\n",
"set(['doc3', 'doc1'])\n"
]
}
],
"source": [
"import math\n",
"\n",
"total_docs = 3\n",
"\n",
"def get_interesting_docs(search_word):\n",
" return set([doc_id for doc_id, _ in index[search_word]])\n",
"\n",
"def tf_idf(word, document):\n",
" docs_containing_word = len(index[word])\n",
" if docs_containing_word:\n",
" idf = math.log(1 + total_docs / docs_containing_word )\n",
" else:\n",
" idf = 0\n",
" \n",
" total_words_in_doc = next((count for doc, count in index[word] if doc==document), 0)\n",
" \n",
" tf = 1 + math.log(total_words_in_doc) if total_words_in_doc else 0\n",
" \n",
" return tf * idf\n",
"\n",
"def search(search_term):\n",
" words = get_words(search_term)\n",
" search_set = set()\n",
" for w, c in words.items():\n",
" doc_set = get_interesting_docs(w)\n",
" # For first document set, add this as is\n",
" if not search_set:\n",
" search_set = doc_set\n",
" continue\n",
" \n",
" # For other document set, find the intersection\n",
" search_set &= doc_set\n",
" return search_set\n",
"\n",
"print tf_idf(\"second\", \"doc1\")\n",
"\n",
"# TODO: Sort the results by tf-idf\n",
"print search(\"study\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment