emraher/findsimilar.py

## readme.md

      
    Raw
  

              readme.md
            
          
    findsimilar

Using the whoosh python library, this script indexes and then give the ability to find similar text documents.

Requirements

Requires python and whoosh. To install whoosh use:
$ pip install whoosh

Preliminary structures

Notes directory structure

The script accepts the following note directory structure:
$ tree -a -L 1
.
├── .index
├── archive
├── findsimilar.py

All of the notes are contained within the archive directory. Note that the directory .index is created by the script if it does not exist.
Notes form

My notes are structured in the following form:
A reinforcing feedback loop creates more input to a stock the more that is already within it. It enhances whatever direction of change is imposed on it.

For example population growth, company profits, pollution etc..

They exist when a system element has the ability to reproduce itself or to grow at a constant fraction of itself.

----

{Meadows2008t}
{1502141701}

@systems-theory
@feedback-loops

The content of the note is all text above the markdown horizontal break syntax ----. Below this are the metadata-links to references, links to UIDs of other notes and tags, prepended with @. The scripts currently do not take into account anything below the horizontal break.
Usage

Note: The script has only been tested using a notes directory structure similar to that shown in the section Notes directory structure.
There are three variables at the top of script to be set.

notesDir - Set this to the directory that your notes are contained in.
notesFileExtension - Set this to a specific filetype for your notes. Default is md
indexDir - This folder will be created relative to where the script is located and contains all of the indexed data of the notes.

Once this has all been set and configured, enter the directory that the script is in and run the following
$ python createindex.py

Now, using the following you can find which notes contain similar rare words
$ python findsimilar.py <note path>

for example:
$ python findsimilar.py search "archive/1502141701 Balancing feedback loops in systems.md"


## findsimilar.py
from whoosh.index import open_dir
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
import glob
import os


# USER SET PARAMETERS ############

notesDir =              "archive"   # set this to the (relative to the script) folder that contains your notes.
notesFileExtension =    "md"        # set this to the file extension of the notes you want to query (txt, md, markdown etc.)
indexDir =              ".index"    # whoosh index folder

##################################


def createIndex():
    """ Create index for whoosh to be able to query """
    if not os.path.exists(indexDir):
        os.makedirs(indexDir)

    schema = Schema(title=TEXT(stored=True),
                    path=ID(stored=True),
                    content=TEXT(stored=True))

    ix = create_in(".index", schema)
    writer = ix.writer()

    for filename in glob.glob(notesDir + '/*.' + notesFileExtension):
        noteContent = ""
        with open(filename, 'r') as myfile:
            for line in myfile:
                if '----' not in line:
                    if line[0][0] is not '!':
                        noteContent += line
                else:
                    break

        writer.add_document(title=  unicode(os.path.basename(filename), 'utf-8'),
                            path=   unicode(filename, 'utf-8'),
                            content=unicode(noteContent, 'utf-8'))

    writer.commit()
    print "index created"


def searchSimilar(fullfilename):
    """ Search for similar documents using a document pathname that
        has already been indexed.
    """
    ix = open_dir(indexDir)

    with ix.searcher() as searcher:
        filename = os.path.basename(fullfilename)

        docnum = searcher.document_number(path=unicode(fullfilename, 'utf-8'))
        if docnum is None:
            print "This document has not been indexed"
        else:
            r = searcher.more_like(docnum, 'content', numterms=20)
            if len(r) > 1:
                header = "Similar files to '" + filename.replace(".md", "") + "'"
                print "\n" + header + "\n" + "-"*len(header) + "\n"
                for hit in r:
                    print hit['title'].replace(".md","")
                    print " score: " + str(hit.score) + "\n"

            print "keywords: " + ", ".join(zip(*r.key_terms('content'))[0])


def printUsage():
    print "usage:"
    print " python findsimilar.py createindex"
    print " python findsimilar.py search <filepath>"


def main():
    if sys.argv[1] == 'createindex':
        createIndex()
    elif sys.argv[1] == 'search':
        if sys.argv[2]:
            searchSimilar(sys.argv[2])
        else:
            printUsage()
            exit()
    else:
        printUsage()
        exit()


if __name__ == "__main__":
    main()
	from whoosh.index import open_dir
	from whoosh.index import create_in
	from whoosh.fields import *
	from whoosh.qparser import QueryParser
	import glob
	import os


	# USER SET PARAMETERS ############

	notesDir = "archive" # set this to the (relative to the script) folder that contains your notes.
	notesFileExtension = "md" # set this to the file extension of the notes you want to query (txt, md, markdown etc.)
	indexDir = ".index" # whoosh index folder

	##################################


	def createIndex():
	""" Create index for whoosh to be able to query """
	if not os.path.exists(indexDir):
	os.makedirs(indexDir)

	schema = Schema(title=TEXT(stored=True),
	path=ID(stored=True),
	content=TEXT(stored=True))

	ix = create_in(".index", schema)
	writer = ix.writer()

	for filename in glob.glob(notesDir + '/*.' + notesFileExtension):
	noteContent = ""
	with open(filename, 'r') as myfile:
	for line in myfile:
	if '----' not in line:
	if line[0][0] is not '!':
	noteContent += line
	else:
	break

	writer.add_document(title= unicode(os.path.basename(filename), 'utf-8'),
	path= unicode(filename, 'utf-8'),
	content=unicode(noteContent, 'utf-8'))

	writer.commit()
	print "index created"


	def searchSimilar(fullfilename):
	""" Search for similar documents using a document pathname that
	has already been indexed.
	"""
	ix = open_dir(indexDir)

	with ix.searcher() as searcher:
	filename = os.path.basename(fullfilename)

	docnum = searcher.document_number(path=unicode(fullfilename, 'utf-8'))
	if docnum is None:
	print "This document has not been indexed"
	else:
	r = searcher.more_like(docnum, 'content', numterms=20)
	if len(r) > 1:
	header = "Similar files to '" + filename.replace(".md", "") + "'"
	print "\n" + header + "\n" + "-"*len(header) + "\n"
	for hit in r:
	print hit['title'].replace(".md","")
	print " score: " + str(hit.score) + "\n"

	print "keywords: " + ", ".join(zip(*r.key_terms('content'))[0])


	def printUsage():
	print "usage:"
	print " python findsimilar.py createindex"
	print " python findsimilar.py search <filepath>"


	def main():
	if sys.argv[1] == 'createindex':
	createIndex()
	elif sys.argv[1] == 'search':
	if sys.argv[2]:
	searchSimilar(sys.argv[2])
	else:
	printUsage()
	exit()
	else:
	printUsage()
	exit()


	if __name__ == "__main__":
	main()