nz/Sunspot 3rd generation modeling brainstorm.md

## Sunspot 3rd generation modeling brainstorm.md

      
    Raw
  

              Sunspot 3rd generation modeling brainstorm.md
            
          
    Sunspot v3

Incomplete brain-dump in progress. Thoughts and feedback welcome via Twitter (@nz_) or IM.
Sunspot is in a state of second-system syndrome. It constructed with a solid set of highly abstracted constructs to represent its DSL and eventually translate those operations into Solr operations. I hypothesize that Sunspot's design would become more flexible and intuitive if modeled off of Solr's concepts.
As a thought experiment: Explaining Sunspot's current design would make one a better programmer, with deeper knowledge of many techniques and object-oriented design abstractions involved in building a DSL for a Rails application. This explanation would not necessarily lend itself to a deeper understanding of Solr.
On the other hand, explaining Solr's concepts ought to help lend an intuitive understanding to Sunspot's own architecture. Implementation work to support a DSL should be built on top of those concepts in order to interface with the syntax and concepts that current Sunspot users find valuable.
I'll layout some modeling ideas here that represent certain Search (and later, Update) concepts within Solr (and Elasticsearch). Along with some syntax ideas for usage and how that might all tie together for implementation within the DSL.
My thinking is that this should help Sunspot more accurately represent search concepts, lending itself to an easier intuitive shared understanding of the problem domain.
Particularly large deviations from current Sunspot concepts (e.g., with filter behaviors introduced in Solr 3.4) are more thoroughly described.

At the moment, I'm mostly focusing on the abstract modeling of a search request, handling its request and response lifecycle, and exposing those concepts through a minimally enhanced DSL. Subjects not mentioned in here are issues I'm generally okay with, though I'll eventually have some thoughts on updates (esp. batching and queueing), low-level Elasticsearch and Solr adapter implementations, ORM adapters (better mongo support), and pluggable extensibility (Solr Cell).

Modeling


Sunspot::Search

Request
Response
Query
Filter
Facet
Highlight
Document


Search

Sunspot::Search
Search Request

Sunspot::Search::Request
Instructions on what to search

Query

Sunspot::Search::Query

terms
parser
fields
minimum match

Filter

Sunspot::Search::Filter

field
terms or function
cache
cost

Filters focus the main keyword search to specific subsets of your index, based on your filter queries. In many cases, your filter queries are executed against the entire index, in parallel of each other or the main query. The results of these pre-filters are then intersected to find the final set of documents. More expensive post-filters can then be applied in sequence against this resulting set of documents.
In Sunspot, the default filter behavior is to run in parallel to the main query, without attempting to cache the results of the query. This is a safe default that works well for any query filter whose query or value is volatile or otherwise large and unpredictable.
with(id: current_user.group_ids)
with(:created_at).greater_than(1.week.ago)
Filters which are likely to be reused between many queries may be cached, using a bit of system memory up front to save on CPU and Disk IO time in later queries. Cached filters are run in advance of the main query. Their cache key is the full query string, at times interpolated within Solr (e.g., date and time operations are expanded to per-second precision).
# search documents marked as published
with(state: 'published').cache(true)

# restrict the search to documents with all the selected tags
params[:tags].each do |tag|
  with(tag: tag).cache(true)
end

# find objects created this week, rounded to 24 hours for cacheability
with(created_at: [ '(NOW() - 1WEEK)/1DAY', '(NOW + 1WEEK)/1DAY' ]).cache(true)
A filter that is expensive and unlikely to be reusable can disable caching and specify a cost relative to other filters. These filters will be executed after the main query in the order of ascending cost, and are applied directly to the documents that matched the main query and pre-filters, rather than the entire index.
Specifying a cost will automatically set cache: false.
with(:location).near(current_user.location).cost(1)

Instructions on how to prepare the response

Facet


field or query
other options

Spellcheck


dictionary field
other options

Sorting & pagination

Search Response


query time
partial results
number of results

Documents

Raw documents returned from Solr and loaded into a Mash.
Results

An ActiveRecord scope: where(id: document_ids)
Facets

A mashup of the requested facets, their values from the response, and an indication of whether a particular facet field and value combination was selected in the query (i.e., present in the filters).

Usage examples

search = Article.search do
  query "foo bar"
  facet :category
  page 2
  per_page 10
end

search.facets.each do |facet|
  facet.field
  facet.selected? # => whether this field is in the filter list
  facet.values.each do |facet_value|
    facet_value.to_s
    facet_value.count
    facet_value.selected? # => whether this field and its value are in the filter list
  end
end

search.spellcheck.suggestions.each do |suggestion|
  suggestion.to_s
  suggestion.count
end

search.highlights.each do |hit, field|
end

search.results # => ActiveRecord scope: where(id: [])
search.results.each do |result|
  result # => ActiveRecord object
  search.highlights_for(result) # => cross reference with highlights list
end

# Under the hood stuff

Sunspot.search(*args) == Sunspot::Search.new(*args)
Model.search == Sunspot.search(Model) == Sunspot::Search.new(Model)

search = Sunspot::Search.new(*Searchables)
request = search.request
request.query.terms  = "foo bar"
request.query.parser = "edismax" # default
request.query.fields = [ :title, :body ] # introspect the class
request.execute

search.facets.each do |facet|
  facet.field # => e.g., "category"
  
  # facet values populated from the response object
  facet.values.each do |facet_value|
    facet_value.to_s # => e.g., "Gardening"
    facet_value.count # => e.g., 10
  end
end

  
## usage-exploration.rb
###
#
# Here is some psuedo-ruby example usage to explore the redesign of Sunspot's back-end.
# This is a highly exploded view of what might be going on under the hood of the DSL.
# It's not necessarily intended to be directly used by developers, but mostly invoked
# by the `search` method DSL that we provide now.
#
###


# In Solr, we talk to a Server. Maybe to a load balancer. All we care about is the Index.
# In Elasticsearch, we talk to a Cluster, which may have cluster-level operations.
cluster = Cluster.new("http://localhost:8983/")

# The Index is the fundamental querying endpoint for Solr. It's a primary endpoint
# for Elasticsearch, but not the only one.
index   = Index.new("test", cluster)

# A search in Solr must be executed against an index. Really all we care about here
# for Solr is the URL. *Maybe* we get into CoreAdmin operations like aliasing.
# For Elasticsearch, we should be able to create, delete and configure this index.
search  = Search.new(index)

# A search is broken down into a Request and a Response. The Search itself
# is an abstraction that blends the two. In general, we will use getters and
# setters on the Search instance itself, which will then interact with its own
# dependents.
#
# request  = search.request
# response = search.response

# Represent the main full-text query. We have the query terms, query parser, and
# for dismax-based query parsers, we have the fields and the minimum match.
#
# Generates: q=hello world&qf=title,body&defType=edismax&mm=0.5
#
search.query = Query.new(
  terms:  "hello world",
  parser: "edismax",
  fields: [ :title, :body ],
  match:  0.5 # 50% of terms are required to match
)

# Represent a filter query. A simple filter query is comprised of the field name,
# and the range, term or function query to apply. Lastly, we have cache and cost controls.
#
# fq=category:sunshine
search.filters << Filter.new(
  :category,   # field name
  "sunshine",  # term, range, or function query
  cache: true  # cache: boolean, or cost: int
)

# fq=[(NOW-1WEEK)/1DAY TO (NOW+1DAY)/1DAY]
search.filters << Filter.new(
  :created_at,
  ["(NOW-1WEEK)/1DAY", "(NOW+1DAY)/1DAY"],
  cache: true
)

###
# Now we shift from telling Solr what we want it to find,
# and into other data that we want it to respond with, or more generally,
# how we want it to respond.
###

# We need to tell Solr which fields to return in its results. I don't know
# if `field_list` or `select` or something else is the better name here.
# A sensible default is just enough to look up the object in ActiveRecord/Mongoid/etc.
search.field_list = [ :primary_key, :type ]

# Represent a facet query. A facet is comprised of the field whose values
# we want to enumerate, and some more options related to term frequencies and the like.
search.facets << Facet.new(:category)

# Sorting is a bad idea, but we'll let you go ahead and do that.
search.sort 'id desc'

# TCP connect timeout in ms
search.connect_timeout 100

# TCP read timeout in ms
search.read_timeout 10_000

# The amount of time in milliseconds we'll allow this query to run for.
search.time_allowed 1_000

# Should probably add options to control retries with backoff,
# or just make that some kind of default.

# Include debugging info in the results. Useful somewhere for logging.
# search.debug true

###
# At this point, we should have all we need to execute a reasonably complete search.
###

# Do we need to explicitly execute the search? Or can we lazily invoke this elsewhere?
search.execute

# Implementation detail reminder: we model the request in a Request object,
# and the response in a Response object.

# Return the actual ActiveRecord result objects, based on the document IDs provided.
# Actually, we should prefer to build an ActiveRecord scope here: `where(id: result_ids)`.
# Not sure if that's appropriate for instantiating across many models.
search.results.each do |result|

  # return the highlighted hit(s) for this particular result.
  # need to refresh my memory on current syntax. is this an improvement?
  search.highlights_for(result)

end

# Inspect how long the query took to execute, in ms
search.qtime

# The number of matching documents. Please bikeshed this to a better color ;)
search.count

# Explore the facet response
search.facets.each do |facet|
  facet.field     # field name
  facet.selected? # does this facet appear in the filter list?
  facet.values.each do |facet_value|
    facet_value.to_s      # term -- is #to_s a code smell? dunno; value.value or value.name is worse.
    facet_value.count     # frequency
    facet_value.selected? # does this particular field:value pair appear in the filter list?
  end
end

# Explore spellcheck results
search.spelling_suggestions.each do |suggestion|
  # etc
end


###
# THE MAIN EVENT:
#
#   How do we represent all the above with minimal backward incompatibility in the current DSLs?
#   Where can we make backwards incompatible changes to improve usability?
#
# I won't ask for zero backwards incompatibility (though that *would* be nice!)
# so long as the changes are all very clear improvements. (And aren't they all!)
#
###

# Haven't dug into my thoughts on searchable quite yet.
# I like the syntax, but want to think about how we apply the concepts
# to Solr field naming conventions and Elasticsearch mapping ideas.
# Also we define some behaviors in here later used as defaults in searches.
# So how do we keep everything in the right scope relative to where it's needed?
class Article
  searchable do
    text :title
    text :body
    string :category
    string :tags
  end
end

@search = Article.search do
  keywords params[:q]
  facet :category
  highlight :title, with: "<em>$term<em>" # or a proc; or a better sentinel?
end

# facet - backwards compat
# see the smell? what's a "row" in solr?
@search.facet(:category_id).rows.each { |facet| ... }

# facet - new and shiny
@search.facets.each do |facet|
  facet.field  # => category_id
  facet.values # => [ #<FacetValue:0x00...> ... ]
  facet.selected?
end

# highlight - backwards compat
# I guess I'm eliminating most 'hits' across the board.
# Call search.response.documents for that, I think.
@search.hits.each do |hit|
  hit.highlights(:title).each do |highlight|
    puts highlight.format { |word| "*#{word}*"}
  end
end

# highlight - new and shiny
# optional highlighter proc was defined in the search invocation
@search.results.each do |result|
  @search.highlights_for(result).each do |highlight|
    puts highlight # => "*Hello*, *world*!"
  end
end
	###
	#
	# Here is some psuedo-ruby example usage to explore the redesign of Sunspot's back-end.
	# This is a highly exploded view of what might be going on under the hood of the DSL.
	# It's not necessarily intended to be directly used by developers, but mostly invoked
	# by the `search` method DSL that we provide now.
	#
	###


	# In Solr, we talk to a Server. Maybe to a load balancer. All we care about is the Index.
	# In Elasticsearch, we talk to a Cluster, which may have cluster-level operations.
	cluster = Cluster.new("http://localhost:8983/")

	# The Index is the fundamental querying endpoint for Solr. It's a primary endpoint
	# for Elasticsearch, but not the only one.
	index = Index.new("test", cluster)

	# A search in Solr must be executed against an index. Really all we care about here
	# for Solr is the URL. Maybe we get into CoreAdmin operations like aliasing.
	# For Elasticsearch, we should be able to create, delete and configure this index.
	search = Search.new(index)

	# A search is broken down into a Request and a Response. The Search itself
	# is an abstraction that blends the two. In general, we will use getters and
	# setters on the Search instance itself, which will then interact with its own
	# dependents.
	#
	# request = search.request
	# response = search.response

	# Represent the main full-text query. We have the query terms, query parser, and
	# for dismax-based query parsers, we have the fields and the minimum match.
	#
	# Generates: q=hello world&qf=title,body&defType=edismax&mm=0.5
	#
	search.query = Query.new(
	terms: "hello world",
	parser: "edismax",
	fields: [ :title, :body ],
	match: 0.5 # 50% of terms are required to match
	)

	# Represent a filter query. A simple filter query is comprised of the field name,
	# and the range, term or function query to apply. Lastly, we have cache and cost controls.
	#
	# fq=category:sunshine
	search.filters << Filter.new(
	:category, # field name
	"sunshine", # term, range, or function query
	cache: true # cache: boolean, or cost: int
	)

	# fq=[(NOW-1WEEK)/1DAY TO (NOW+1DAY)/1DAY]
	search.filters << Filter.new(
	:created_at,
	["(NOW-1WEEK)/1DAY", "(NOW+1DAY)/1DAY"],
	cache: true
	)

	###
	# Now we shift from telling Solr what we want it to find,
	# and into other data that we want it to respond with, or more generally,
	# how we want it to respond.
	###

	# We need to tell Solr which fields to return in its results. I don't know
	# if `field_list` or `select` or something else is the better name here.
	# A sensible default is just enough to look up the object in ActiveRecord/Mongoid/etc.
	search.field_list = [ :primary_key, :type ]

	# Represent a facet query. A facet is comprised of the field whose values
	# we want to enumerate, and some more options related to term frequencies and the like.
	search.facets << Facet.new(:category)

	# Sorting is a bad idea, but we'll let you go ahead and do that.
	search.sort 'id desc'

	# TCP connect timeout in ms
	search.connect_timeout 100

	# TCP read timeout in ms
	search.read_timeout 10_000

	# The amount of time in milliseconds we'll allow this query to run for.
	search.time_allowed 1_000

	# Should probably add options to control retries with backoff,
	# or just make that some kind of default.

	# Include debugging info in the results. Useful somewhere for logging.
	# search.debug true

	###
	# At this point, we should have all we need to execute a reasonably complete search.
	###

	# Do we need to explicitly execute the search? Or can we lazily invoke this elsewhere?
	search.execute

	# Implementation detail reminder: we model the request in a Request object,
	# and the response in a Response object.

	# Return the actual ActiveRecord result objects, based on the document IDs provided.
	# Actually, we should prefer to build an ActiveRecord scope here: `where(id: result_ids)`.
	# Not sure if that's appropriate for instantiating across many models.
	search.results.each do \|result\|

	# return the highlighted hit(s) for this particular result.
	# need to refresh my memory on current syntax. is this an improvement?
	search.highlights_for(result)

	end

	# Inspect how long the query took to execute, in ms
	search.qtime

	# The number of matching documents. Please bikeshed this to a better color ;)
	search.count

	# Explore the facet response
	search.facets.each do \|facet\|
	facet.field # field name
	facet.selected? # does this facet appear in the filter list?
	facet.values.each do \|facet_value\|
	facet_value.to_s # term -- is #to_s a code smell? dunno; value.value or value.name is worse.
	facet_value.count # frequency
	facet_value.selected? # does this particular field:value pair appear in the filter list?
	end
	end

	# Explore spellcheck results
	search.spelling_suggestions.each do \|suggestion\|
	# etc
	end


	###
	# THE MAIN EVENT:
	#
	# How do we represent all the above with minimal backward incompatibility in the current DSLs?
	# Where can we make backwards incompatible changes to improve usability?
	#
	# I won't ask for zero backwards incompatibility (though that would be nice!)
	# so long as the changes are all very clear improvements. (And aren't they all!)
	#
	###

	# Haven't dug into my thoughts on searchable quite yet.
	# I like the syntax, but want to think about how we apply the concepts
	# to Solr field naming conventions and Elasticsearch mapping ideas.
	# Also we define some behaviors in here later used as defaults in searches.
	# So how do we keep everything in the right scope relative to where it's needed?
	class Article
	searchable do
	text :title
	text :body
	string :category
	string :tags
	end
	end

	@search = Article.search do
	keywords params[:q]
	facet :category
	highlight :title, with: "<em>$term<em>" # or a proc; or a better sentinel?
	end

	# facet - backwards compat
	# see the smell? what's a "row" in solr?
	@search.facet(:category_id).rows.each { \|facet\| ... }

	# facet - new and shiny
	@search.facets.each do \|facet\|
	facet.field # => category_id
	facet.values # => [ #<FacetValue:0x00...> ... ]
	facet.selected?
	end

	# highlight - backwards compat
	# I guess I'm eliminating most 'hits' across the board.
	# Call search.response.documents for that, I think.
	@search.hits.each do \|hit\|
	hit.highlights(:title).each do \|highlight\|
	puts highlight.format { \|word\| "#{word}"}
	end
	end

	# highlight - new and shiny
	# optional highlighter proc was defined in the search invocation
	@search.results.each do \|result\|
	@search.highlights_for(result).each do \|highlight\|
	puts highlight # => "Hello, world!"
	end
	end