Skip to content

Instantly share code, notes, and snippets.

@anoxic
Last active December 15, 2015 22:59
Show Gist options
  • Save anoxic/5336839 to your computer and use it in GitHub Desktop.
Save anoxic/5336839 to your computer and use it in GitHub Desktop.
Creates and and search an index, returning lines from the original file.
class Text
attr_reader :index, :text, :symbol
def initialize(name, dir = :texts)
raise TextNotFoundError, "Can't find #{name}.txt" unless File.exists? "./#{dir}/#{name}.txt"
Index.new(name, dir) unless File.exists? "./#{dir}/#{name}.ind"
@text = File.new("./#{dir}/#{name}.txt")
@index = File.new("./#{dir}/#{name}.ind")
@symbol = name
end
def delim
@text.each {|l| return l[8,16].strip if l.match '! delim '}
end
def name
@text.each {|l| return l[7,255].strip if l.match '! name '}
end
end
class Index
def initialize(name, dir)
@name = name
@dir = dir
@text = File.new("./#{dir}/#{name}.txt")
@longversion = "Kind James Version"
@indexversion = 1
@delim = self.delim
@index = self.compile(self.index)
self.write
end
def put; print @index; end
def write; File.open("#{@dir}/#{@name}.ind", "w") { |f| f << @index }; end
def delim; @text.each { |l| return l[8,10].strip if l.match '! delim ' }; end
def index
index = Hash.new { |hash,key| hash[key] = {:occ=>"",:freq=>0} }
@text.each { |line|
self.line(line, @text.lineno, index) unless line.start_with?('!','>','#')
}
return index
end
def line(line, lineno, index)
line.slice!(/.*#{@delim} */)
line.downcase!
line.delete!(".,:;()[]{}?!")
words = line.split
words.each_index { |ind|
index[words[ind].to_sym][:occ] << " #{lineno},#{ind + 1}";
index[words[ind].to_sym][:freq] += 1
}
end
def compile(index)
out = "! BibleQE Index: #{@longversion}\n! version #{@indexversion}"
index = index.sort_by {|k, v| k }
index.each { |word, props|
out << "\n#{word} #{props[:freq]}#{props[:occ]}"
}
return out
end
end
class Search
attr_accessor :matches
def initialize(version, word)
index = File.new("./texts/#{version}.ind")
word.downcase!
matches = []
index.each do |line|
if line.match(/^#{word} /)
refs = line.split.drop(2)
refs.each { |r|
r.gsub!(/,.*/,'')
matches << r.to_i - 1
}
break
end
end
@matches = matches
end
end
class Result
def initialize(version, query)
result = self.get(version, query)
@text = File.new("./texts/#{version}.txt")
@words = query.split
@count = result.uniq.count
@matches = result.uniq
end
def get(version, query)
words = query.split
return Search.new(version,words.fetch(0)).matches.uniq if words.count == 1
matches = []
result = []
words.each {|w|
matches += Search.new(version,w).matches.uniq
}
matches.each {|r|
result << r if matches.select {|n| n == r}.count >= words.count
}
result.uniq
end
def matches
return "Nothing to be searched for!" if @words.count == 0
verse = @count == 1 ? "verse" : "verses"
"Found #{@count} #{verse} matching: #{@words.join(", ")}"
end
def show
verse = @text.readlines
show = "\n"
@matches.each { |match| show += verse.fetch(match) }
show
end
end
class TextNotFoundError < RuntimeError
end
if __FILE__ == $0
# Index.new(:kjv)
result = Result.new(:kjv, ARGV.join(" "))
puts result.matches
puts result.show
#Text.new(:kjv)
end

BibleQE Index File Specification Version 1 R2

Brian Zick <brian@zickzickzick.com>
April 2012 

File format

The index should be in plain text format. Encoded in Unicode UTF-8 where possible.

Overview

Each line is considered a single word's index. It then has space separated values, starting with the WORD, then its FREQUENCY, and then each OCCURANCE in the source file. There is a special case for lines at the top of the file, beginning with a !, which are comment or flag lines. The only required flag is version.

Numbering

All counting starts at 1.

Comments / Flags

All lines starting with ! are ignored, and can be used for any type of comments needed. This should be most useful in the case of a header.

Flags are special markers BibleQE looks for to tell it to process the file differently. Right now the only flag is version.

Word

When indexing, a word could be any group of glyphs not containing whitespace. However for the current implemenation, these glyphs are stripped out .,;:()?! being considered sentence punctuation, and not a part of any words.

Index

Each (non-comment) indexing line uses this format: <word> <frequency> <lineno>,<wordno> ...

See the Example section below.

Case-insensitive

Words are made case-insensitive by converting them to lowercase before indexing.

Example

Given the following verse:

"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible."

BibleQE generates this index:

! BibleQE Index: NIV test
! version 1
at 1 1,10 
by 1 1,1 
command 1 1,12 
faith 1 1,2 
formed 1 1,9 
god's 1 1,11 
is 1 1,16 
made 1 1,20 
not 1 1,19 
of 1 1,22 
out 1 1,21 
seen 1 1,17 
so 1 1,13 
that 2 1,5 1,14 
the 1 1,6 
understand 1 1,4 
universe 1 1,7 
visible 1 1,25 
was 3 1,8 1,18 1,24 
we 1 1,3 
what 2 1,15 1,23 
require "./bibleqe"
describe Text do
it "has a text" do
kjv = Text.new(:kjv)
kjv.text.is_a?(File).should == true
end
it "has an index" do
kjv = Text.new(:kjv)
kjv.index.is_a?(File).should == true
end
it "has a delimeter" do
kjv = Text.new(:kjv)
kjv.delim.should == "::"
end
it "has a name" do
kjv = Text.new(:kjv)
kjv.name.should == "King James Version"
end
it "has a symbol" do
kjv = Text.new(:kjv)
kjv.symbol.should == :kjv
end
end
describe "the search function" do
it "finds one word" do
query = Result.new(:kjv, "Jesus")
query.matches.should == "Found 5 verses matching: Jesus"
end
it "finds another word" do
query = Result.new(:kjv, "John")
query.matches.should == "Found 4 verses matching: John"
end
it "finds two words" do
query = Result.new(:kjv, "Jesus came")
query.matches.should == "Found 2 verses matching: Jesus, came"
end
it "finds 0 matches for a word" do
query = Result.new(:kjv, "sentinel")
query.matches.should == "Found 0 verses matching: sentinel"
end
it "cannot search for empty string" do
query = Result.new(:kjv, "")
query.matches.should == "Nothing to be searched for!"
end
end
# list: print matching verse references
# show: print matching verses
# matches: print number of matches

BibleQE Text (File for indexing or searching) Specification Version 1 R1

Brian Zick <brian@zickzickzick.com>
April 2012 

File format

The text should be in plain text format. Encoded in Unicode UTF-8 where possible.

Overview

Each line is considered a single verse, except for lines starting in any of the following: !#>

A reference (or short reference) begins each line, separated by a delimeter. The delimeter can be set by the text file creator using the delim flag. Anything not occuring in the text can be used. It could be a unicode charactar like or any other charactar or grouping, like: =>, ::, =, ...

Paragraphs are also allowed

Comments

Lines starting in ! are comments. They can occur anywhere in the text.

Flags

There are special comments that act as setting flags, these are:

  1. version -- text file version, currently version 1
  2. delim -- the delimiter between the reference and the verse, I prefer the double colon - ::
  3. strip -- characters to strip from the text before indexing, right now: .,:;()[]{}?! (Note that this is not available in the present implementation and will be ignored)
  4. name -- The full name of the text (where the filename is the abbrevieation)

Paragraphs

You can mark a paragraph in the text by using the mark at the beginning of the verse the paragraph begins on.

Example

! BibleQE test text
! version 1
! delim ::
! strip .,:;()[]{}?!
# Example Heading
Jhn 3:1 :: ¶ There was a man of the Pharisees, named Nicodemus, a ruler of the Jews:
Jhn 3:2 :: The same came to Jesus by night, and said unto him, Rabbi, we know that thou art a teacher come from God: for no man can do these miracles that thou doest, except God be with him.
> Example sidenote.
Jhn 3:3 :: Jesus answered and said unto him, Verily, verily, I say unto thee, Except a man be born again, he cannot see the kingdom of God.
> Note 2.
Jhn 3:4 :: Nicodemus saith unto him, How can a man be born when he is old? can he enter the second time into his mother's womb, and be born?
Jhn 3:5 :: Jesus answered, Verily, verily, I say unto thee, Except a man be born of water and [of] the Spirit, he cannot enter into the kingdom of God.
Jhn 3:6 :: That which is born of the flesh is flesh; and that which is born of the Spirit is spirit.
Jhn 3:7 :: Marvel not that I said unto thee, Ye must be born again.
! version 1
! name King James Version
! delim ::
! strip .,:;()[]{}?!
# Example Heading
Jhn 3:1 :: ¶ There was a man of the Pharisees, named Nicodemus, a ruler of the Jews:
Jhn 3:2 :: The same came to Jesus by night, and said unto him, Rabbi, we know that thou art a teacher come from God: for no man can do these miracles that thou doest, except God be with him.
> Example sidenote.
Jhn 3:3 :: Jesus answered and said unto him, Verily, verily, I say unto thee, Except a man be born again, he cannot see the kingdom of God.
> Note 2.
Jhn 3:4 :: Nicodemus saith unto him, How can a man be born when he is old? can he enter the second time into his mother's womb, and be born?
Jhn 3:5 :: Jesus answered, Verily, verily, I say unto thee, Except a man be born of water and [of] the Spirit, he cannot enter into the kingdom of God.
Jhn 3:6 :: That which is born of the flesh is flesh; and that which is born of the Spirit is spirit.
Jhn 3:7 :: Marvel not that I said unto thee, Ye must be born again.
Jhn 3:8 :: The wind bloweth where it listeth, and thou hearest the sound thereof, but canst not tell whence it cometh, and whither it goeth: so is every one that is born of the Spirit.
Jhn 3:9 :: Nicodemus answered and said unto him, How can these things be?
Jhn 3:10 :: Jesus answered and said unto him, Art thou a master of Israel, and knowest not these things?
Jhn 3:11 :: Verily, verily, I say unto thee, We speak that we do know, and testify that we have seen; and ye receive not our witness.
Jhn 3:12 :: If I have told you earthly things, and ye believe not, how shall ye believe, if I tell you [of] heavenly things?
Jhn 3:13 :: And no man hath ascended up to heaven, but he that came down from heaven, [even] the Son of man which is in heaven.
Jhn 3:14 :: And as Moses lifted up the serpent in the wilderness, even so must the Son of man be lifted up:
Jhn 3:15 :: That whosoever believeth in him should not perish, but have eternal life.
Jhn 3:16 :: For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
Jhn 3:17 :: For God sent not his Son into the world to condemn the world; but that the world through him might be saved.
Jhn 3:18 :: He that believeth on him is not condemned: but he that believeth not is condemned already, because he hath not believed in the name of the only begotten Son of God.
Jhn 3:19 :: And this is the condemnation, that light is come into the world, and men loved darkness rather than light, because their deeds were evil.
Jhn 3:20 :: For every one that doeth evil hateth the light, neither cometh to the light, lest his deeds should be reproved.
Jhn 3:21 :: But he that doeth truth cometh to the light, that his deeds may be made manifest, that they are wrought in God.
Jhn 3:22 :: ¶ After these things came Jesus and his disciples into the land of Judaea; and there he tarried with them, and baptized.
Jhn 3:23 :: And John also was baptizing in Aenon near to Salim, because there was much water there: and they came, and were baptized.
Jhn 3:24 :: For John was not yet cast into prison.
Jhn 3:25 :: Then there arose a question between [some] of John's disciples and the Jews about purifying.
Jhn 3:26 :: And they came unto John, and said unto him, Rabbi, he that was with thee beyond Jordan, to whom thou barest witness, behold, the same baptizeth, and all [men] come to him.
Jhn 3:27 :: John answered and said, A man can receive nothing, except it be given him from heaven.
Jhn 3:28 :: Ye yourselves bear me witness, that I said, I am not the Christ, but that I am sent before him.
Jhn 3:29 :: He that hath the bride is the bridegroom: but the friend of the bridegroom, which standeth and heareth him, rejoiceth greatly because of the bridegroom's voice: this my joy therefore is fulfilled.
Jhn 3:30 :: He must increase, but I [must] decrease.
Jhn 3:31 :: He that cometh from above is above all: he that is of the earth is earthly, and speaketh of the earth: he that cometh from heaven is above all.
Jhn 3:32 :: And what he hath seen and heard, that he testifieth; and no man receiveth his testimony.
Jhn 3:33 :: He that hath received his testimony hath set to his seal that God is true.
Jhn 3:34 :: For he whom God hath sent speaketh the words of God: for God giveth not the Spirit by measure [unto him].
Jhn 3:35 :: The Father loveth the Son, and hath given all things into his hand.
Jhn 3:36 :: He that believeth on the Son hath everlasting life: and he that believeth not the Son shall not see life; but the wrath of God abideth on him.
isa
http://www.scripture4all.org/ISA2_preview/ISA2_preview_6a.html
biblos interlinear
http://interlinearbible.org/genesis/1.htm
agrep
https://en.wikipedia.org/wiki/Agrep
http://www.tgries.de/agrep/
http://www.cs.sunysb.edu/~algorith/files/approximate-pattern-matching.shtml
webglimpse
http://webglimpse.net/
fuzzy matching
https://en.wikipedia.org/wiki/Approximate_string_matching
phusion passenger (for ruby web server)
https://www.phusionpassenger.com/download/#open_source
ruby quickstart
http://guides.rubyonrails.org/getting_started.html
Finished:
[x] Use a multi-line input, with one verse per line.
[x] Create an index (as an array first) capturing word frequencies
[x] In the index, catch the line number + word number (in verse)
[x] Ignore comment lines when creating index
[x] Use a "label" in the source on each line that tells the book, verse, chapter (ignore label)
[x] Basic search functions
[x] Allow sidenotes (marked with >) and headings (marked with #) in the text
[x] Handle the "Found x matches for ..." line
[x] Handle results with no matches
[x] Multi-word search
[x] Allow commandline parameters for search
[x] Full text name set in the text
Towards a web release:
[ ] Add full KJV text
[ ] Web gateway
Later:
[ ] List references
[ ] Allow ignored characters (i.e. .,:;()[]{}?!) to be set in the text (via `strip` flag)
[ ] Allow the File and Index classes to take a text from somewhere other that a file already on the system and create the system files
Questions:
[?] When indexing, will it be faster to run `downcase!` + `tr!` on the whole input vs. line by line?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment