Skip to content

Instantly share code, notes, and snippets.

Created June 9, 2015 17:16
Show Gist options
  • Save anonymous/1bc5e51c32ddb2ec5fc1 to your computer and use it in GitHub Desktop.
Save anonymous/1bc5e51c32ddb2ec5fc1 to your computer and use it in GitHub Desktop.


Using the whoosh python library, this script indexes and then give the ability to find similar text documents.


Requires python and whoosh. To install whoosh use:

$ pip install whoosh

Preliminary structures

Notes directory structure

The script accepts the following note directory structure:

$ tree -a -L 1
├── .index
├── archive

All of the notes are contained within the archive directory. Note that the directory .index is created by the script if it does not exist.

Notes form

My notes are structured in the following form:

A reinforcing feedback loop creates more input to a stock the more that is already within it. It enhances whatever direction of change is imposed on it.

For example population growth, company profits, pollution etc..

They exist when a system element has the ability to reproduce itself or to grow at a constant fraction of itself.




The content of the note is all text above the markdown horizontal break syntax ----. Below this are the metadata-links to references, links to UIDs of other notes and tags, prepended with @. The scripts currently do not take into account anything below the horizontal break.


Note: The script has only been tested using a notes directory structure similar to that shown in the section Notes directory structure.

There are three variables at the top of script to be set.

  • notesDir - Set this to the directory that your notes are contained in.
  • notesFileExtension - Set this to a specific filetype for your notes. Default is md
  • indexDir - This folder will be created relative to where the script is located and contains all of the indexed data of the notes.

Once this has all been set and configured, enter the directory that the script is in and run the following

$ python

Now, using the following you can find which notes contain similar rare words

$ python <note path>

for example:

$ python search "archive/1502141701 Balancing feedback loops in"
from whoosh.index import open_dir
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
import glob
import os
# USER SET PARAMETERS ############
notesDir = "archive" # set this to the (relative to the script) folder that contains your notes.
notesFileExtension = "md" # set this to the file extension of the notes you want to query (txt, md, markdown etc.)
indexDir = ".index" # whoosh index folder
def createIndex():
""" Create index for whoosh to be able to query """
if not os.path.exists(indexDir):
schema = Schema(title=TEXT(stored=True),
ix = create_in(".index", schema)
writer = ix.writer()
for filename in glob.glob(notesDir + '/*.' + notesFileExtension):
noteContent = ""
with open(filename, 'r') as myfile:
for line in myfile:
if '----' not in line:
if line[0][0] is not '!':
noteContent += line
writer.add_document(title= unicode(os.path.basename(filename), 'utf-8'),
path= unicode(filename, 'utf-8'),
content=unicode(noteContent, 'utf-8'))
print "index created"
def searchSimilar(fullfilename):
""" Search for similar documents using a document pathname that
has already been indexed.
ix = open_dir(indexDir)
with ix.searcher() as searcher:
filename = os.path.basename(fullfilename)
docnum = searcher.document_number(path=unicode(fullfilename, 'utf-8'))
if docnum is None:
print "This document has not been indexed"
r = searcher.more_like(docnum, 'content', numterms=20)
if len(r) > 1:
header = "Similar files to '" + filename.replace(".md", "") + "'"
print "\n" + header + "\n" + "-"*len(header) + "\n"
for hit in r:
print hit['title'].replace(".md","")
print " score: " + str(hit.score) + "\n"
print "keywords: " + ", ".join(zip(*r.key_terms('content'))[0])
def printUsage():
print "usage:"
print " python createindex"
print " python search <filepath>"
def main():
if sys.argv[1] == 'createindex':
elif sys.argv[1] == 'search':
if sys.argv[2]:
if __name__ == "__main__":
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment