Incomplete brain-dump in progress. Thoughts and feedback welcome via Twitter (@nz_) or IM.
Sunspot is in a state of second-system syndrome. It constructed with a solid set of highly abstracted constructs to represent its DSL and eventually translate those operations into Solr operations. I hypothesize that Sunspot's design would become more flexible and intuitive if modeled off of Solr's concepts.
As a thought experiment: Explaining Sunspot's current design would make one a better programmer, with deeper knowledge of many techniques and object-oriented design abstractions involved in building a DSL for a Rails application. This explanation would not necessarily lend itself to a deeper understanding of Solr.
On the other hand, explaining Solr's concepts ought to help lend an intuitive understanding to Sunspot's own architecture. Implementation work to support a DSL should be built on top of those concepts in order to interface with the syntax and concepts that current Sunspot users find valuable.
I'll layout some modeling ideas here that represent certain Search (and later, Update) concepts within Solr (and Elasticsearch). Along with some syntax ideas for usage and how that might all tie together for implementation within the DSL.
My thinking is that this should help Sunspot more accurately represent search concepts, lending itself to an easier intuitive shared understanding of the problem domain.
Particularly large deviations from current Sunspot concepts (e.g., with filter behaviors introduced in Solr 3.4) are more thoroughly described.
At the moment, I'm mostly focusing on the abstract modeling of a search request, handling its request and response lifecycle, and exposing those concepts through a minimally enhanced DSL. Subjects not mentioned in here are issues I'm generally okay with, though I'll eventually have some thoughts on updates (esp. batching and queueing), low-level Elasticsearch and Solr adapter implementations, ORM adapters (better mongo support), and pluggable extensibility (Solr Cell).
- Sunspot::Search
- Request
- Response
- Query
- Filter
- Facet
- Highlight
- Document
Sunspot::Search
Sunspot::Search::Request
Sunspot::Search::Query
- terms
- parser
- fields
- minimum match
Sunspot::Search::Filter
- field
- terms or function
- cache
- cost
Filters focus the main keyword search to specific subsets of your index, based on your filter queries. In many cases, your filter queries are executed against the entire index, in parallel of each other or the main query. The results of these pre-filters are then intersected to find the final set of documents. More expensive post-filters can then be applied in sequence against this resulting set of documents.
In Sunspot, the default filter behavior is to run in parallel to the main query, without attempting to cache the results of the query. This is a safe default that works well for any query filter whose query or value is volatile or otherwise large and unpredictable.
with(id: current_user.group_ids)
with(:created_at).greater_than(1.week.ago)
Filters which are likely to be reused between many queries may be cached, using a bit of system memory up front to save on CPU and Disk IO time in later queries. Cached filters are run in advance of the main query. Their cache key is the full query string, at times interpolated within Solr (e.g., date and time operations are expanded to per-second precision).
# search documents marked as published
with(state: 'published').cache(true)
# restrict the search to documents with all the selected tags
params[:tags].each do |tag|
with(tag: tag).cache(true)
end
# find objects created this week, rounded to 24 hours for cacheability
with(created_at: [ '(NOW() - 1WEEK)/1DAY', '(NOW + 1WEEK)/1DAY' ]).cache(true)
A filter that is expensive and unlikely to be reusable can disable caching and specify a cost
relative to other filters. These filters will be executed after the main query in the order of ascending cost, and are applied directly to the documents that matched the main query and pre-filters, rather than the entire index.
Specifying a cost
will automatically set cache: false
.
with(:location).near(current_user.location).cost(1)
- field or query
- other options
- dictionary field
- other options
- query time
- partial results
- number of results
Raw documents returned from Solr and loaded into a Mash.
An ActiveRecord scope: where(id: document_ids)
A mashup of the requested facets, their values from the response, and an indication of whether a particular facet field and value combination was selected in the query (i.e., present in the filters).
search = Article.search do
query "foo bar"
facet :category
page 2
per_page 10
end
search.facets.each do |facet|
facet.field
facet.selected? # => whether this field is in the filter list
facet.values.each do |facet_value|
facet_value.to_s
facet_value.count
facet_value.selected? # => whether this field and its value are in the filter list
end
end
search.spellcheck.suggestions.each do |suggestion|
suggestion.to_s
suggestion.count
end
search.highlights.each do |hit, field|
end
search.results # => ActiveRecord scope: where(id: [])
search.results.each do |result|
result # => ActiveRecord object
search.highlights_for(result) # => cross reference with highlights list
end
# Under the hood stuff
Sunspot.search(*args) == Sunspot::Search.new(*args)
Model.search == Sunspot.search(Model) == Sunspot::Search.new(Model)
search = Sunspot::Search.new(*Searchables)
request = search.request
request.query.terms = "foo bar"
request.query.parser = "edismax" # default
request.query.fields = [ :title, :body ] # introspect the class
request.execute
search.facets.each do |facet|
facet.field # => e.g., "category"
# facet values populated from the response object
facet.values.each do |facet_value|
facet_value.to_s # => e.g., "Gardening"
facet_value.count # => e.g., 10
end
end