Skip to content

Instantly share code, notes, and snippets.

@dainiusjocas
Last active March 24, 2021 18:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dainiusjocas/d6b5757c17a055f498f90030482587e5 to your computer and use it in GitHub Desktop.
Save dainiusjocas/d6b5757c17a055f498f90030482587e5 to your computer and use it in GitHub Desktop.
Ruby Percolator based on Lucene Grep example

Clone this gist:

git clone https://gist.github.com/dainiusjocas/d6b5757c17a055f498f90030482587e5
cd d6b5757c17a055f498f90030482587e5

Go here https://github.com/dainiusjocas/lucene-grep/releases/tag/v2021.03.24 and fetch the binary for your platform to this directory.

Extract the binary, e.g. unzip lmgrep*

Then run the example:

ruby ruby-lmgrep.rb

The output should be similar to:

Given the Percolator dictionary: [{"query"=>"jump"}, {"query"=>"\"quick fox\"~2^3"}]

Checks if the text matches:
Percolator on text 'The quick brown fox jumps over the lazy dog' matches: true, in: 0.0083872s
Percolator on text 'not matching' matches: false, in: 0.000291839s

>>>The matches in are returned<<<
Percolator on text 'The quick brown fox jumps over the lazy dog' matched: '{"line-number"=>3, "line"=>"The quick brown fox jumps over the lazy dog", "score"=>0.6384387910366058, "highlights"=>[{"type"=>"QUERY", "dict-entry-id"=>"1", "meta"=>{}, "score"=>0.13076457, "begin-offset"=>20, "end-offset"=>25, "query"=>"jump"}, {"type"=>"QUERY", "dict-entry-id"=>"2", "meta"=>{}, "score"=>0.5076742, "begin-offset"=>4, "end-offset"=>19, "query"=>"\"quick fox\"~2^3"}]}' in: 0.000528724s
Percolator on text 'not matching' matched: '' in: 0.000288002s
>>>Percolator is closed.<<<

The trick here is that the percolator does stemming. See query "jump".

Also, the percolator supports additive scoring of matching query clauses.

Cheers!

[
{
"query": "jump"
},
{
"query": "\"quick fox\"~2^3"
}
]
require 'open3'
require 'timeout'
require 'json'
class Percolator
def initialize(dictionary_file_path, lmgrep_path='lmgrep', params='', timeout=1)
@timeout = timeout
command = "#{lmgrep_path} --queries-file=#{dictionary_file_path} #{params} --format=json --with-empty-lines --with-details --with-scored-highlights"
@stdin, @stdout, @stderr, @wait_thr = Open3.popen3(command)
# prevent leaking file descriptors
ObjectSpace.define_finalizer(self, Proc.new do
close
puts ">>>Percolator is closed.<<<"
end)
end
def close
@stdin.close
@stdout.close
@stderr.close
end
# Returns true if matches, false otherwise
def matches?(text)
@stdin.puts text
# lmgrep works like this:
# - if an input doesn't match any query then there is an empty line writen to stdout
# - if there is a match then some JSON output is returned
# The percolation is timeout bound.
# If it times-out, stop the percolator and
# - throw an exception
# - log the output
# - create an issue here https://github.com/dainiusjocas/lucene-grep/issues
output = Timeout::timeout(@timeout) {
@stdout.gets
}
# If we got any non-blank output then the text matches some query
! output.strip.empty?
rescue Timeout::Error
puts "Percolation failed on '#{text}'"
raise "Percolation timed-out!!!"
end
# returns the actual output of lmgrep
def match(text)
@stdin.puts text
output = Timeout::timeout(@timeout) {
@stdout.gets
}
return nil if output.strip.empty?
JSON.parse(output)
rescue Timeout::Error
puts "Percolation failed on '#{text}'"
raise "Percolation timed-out!!!"
end
end
def percolate_with_bench(percolator, text)
start = Time.now
matched = percolator.matches? text
finish = Time.now
puts "Percolator on text '#{text}' matches: #{matched}, in: #{finish - start}s"
end
def percolate_for_match_with_bench(percolator, text)
start = Time.now
matched = percolator.match text
matched = matched if matched
finish = Time.now
puts "Percolator on text '#{text}' matched: '#{matched}' in: #{finish - start}s"
end
dictionary_file_path = 'queries.json'
percolator = Percolator.new(dictionary_file_path)
puts "Given the Percolator dictionary: #{JSON.parse(File.read(dictionary_file_path))}"
matching_text = "The quick brown fox jumps over the lazy dog"
non_matching_text = "not matching"
puts
puts "Checks if the text matches:"
percolate_with_bench(percolator, matching_text)
percolate_with_bench(percolator, non_matching_text)
puts
puts ">>>The matches in are returned<<<"
percolate_for_match_with_bench(percolator, matching_text)
percolate_for_match_with_bench(percolator, non_matching_text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment