andrewvc/laruby-elasticsearchtalk.md

## laruby-elasticsearchtalk.md

      
    Raw
  

              laruby-elasticsearchtalk.md
            
          
    An Elasticsearch in Crash Course!

By Andrew Cholakian

All examples use the Stretcher ruby gem
What is Elasticsearch?


An Information Retrieval (IR) System
A way to search your data in terms of natural language, and so much more
A distributed version of Lucene with a JSON API
A fancy clustered, eventually consistent database

Elasticsearch and Lucene

Lucene is an information retrieval library providing full-text indexing and search. Elastisearch provides a RESTish HTTP interface, clustering support, and other tools on top of it.
Modeling Data


Data is stored in an index, similar to an SQL DB
Each index can store multiple types, similar to an SQL table
Items inside the index are documents that have a type
All documents are nested JSON data
Strongly typed schema

Creating a Schema

# Setup our server
server = Stretcher::Server.new('http://localhost:9200')
# Create the index with its schema
server.index(:foo).create(mappings: {
                  tweet: {
                    properties: {
                      text: {type: 'string', 
                      analyzer: 'snowball'}}}}) rescue nil
Create some fake data

words = %w(Many dogs dog cat cats candles candleizer abscond rightly candlestick monkey monkeypulley deft deftly)
words.each.with_index {|w,idx|  
  server.index(:foo).type(:tweet).put(idx+1, {text: w })
}

The document is a simple JSON hash: {"text": "word" }
Each document has a unique ID
We use put, elasticsearch has a RESTish API

And Perform a Search!

# A simple search
server.index(:foo).search(query: {match: {text: "abscond"}}).results.map(&:text)
=> ["abscond"]

our query is actually a JSON object
our response is also JSON!

What is Analysis?

Analysis is the process whereby words are transformed into tokens.
The Snowball analyzer, for instance, turns english words into tokens based on their stems.

Analysis Using the API

server.analyze("deft", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftly", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftness", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("candle", analyzer: :snowball).tokens.map(&:token)
=> ["candl"]
server.analyze("candlestick", analyzer: :snowball).tokens.map(&:token)
=> ["candlestick"]
Analysis in Action

# Will match deft and deftly
server.index(:foo).search(query: {match: {text: "deft"}}).results.map(&:text)
=> ["deft", "deftly"]
# Will match candle, but not candlestick
server.index(:foo).search(query: {match: {text: "candle"}}).results.map(&:text)
# => ["candles"]
More kinds of Analysis

# NGram
server.analyze("news", tokenizer: "ngram", filter: "lowercase").tokens.map(&:token)
# =>  ["n", "e", "w", "s", "ne", "ew", "ws"]

# Stop word
server.analyze("The quick brown fox jumps over the lazy dog.", analyzer: :stop).tokens.map(&:token)
#=> ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

# Path Hierarchy
server.analyze("/var/lib/racoons", tokenizer: :path_hierarchy).tokens.map(&:token)
# => ["/var", "/var/lib", "/var/lib/racoons"]
Searching With An NGram

# Create the index
server.index(:users).create(settings: {analysis: {analyzer: {my_ngram: {type: "custom", tokenizer: "ngram", filter: 'lowercase'}}}}, mappings: {user: {properties: {name: {type: :string, analyzer: :my_ngram}}}})

# Store some fake data
users = %w(bender fry lela hubert cubert hermes calculon)
users.each_with_index {|name,i| server.index(:users).type(:user).put(i, {name: name}) }

# Our analyzer in action
server.index(:users).analyze("hubert", analyzer: :my_ngram).tokens.map(&:token)
# => ["h", "u", "b", "e", "r", "t", "hu", "ub", "be", "er", "rt"]

# Some queries

# Exact
server.index(:users).search(query: {match: {name: "Hubert"}}).results.map(&:name)
=> ["hubert", "cubert", "bender", "hermes", "fry", "calculon", "lela"]

# A Mis-spelled query
server.index(:users).search(query: {match: {name: "Calclulon"}}).results.map(&:name)
=> ["calculon", "lela", "cubert", "bender", "hubert"]
Boosting

# Individual docs can be boosted
server.index(:users).type(:user).put(1000, {name: "boiler", "_boost" => 1_000_000})

server.index(:users).search(query: {match: {name: "bender"}}).results.map(&:name)
# Wha?
# => ["boiler", "bender", "hermes", "cubert", "hubert", "calculon", "fry", "lela"]

server.index(:users).search(query: {match: {name: "lela"}}).results.map(&:name)
# Sweet Zombie Jesus!
=> ["boiler", "lela", "calculon", "bender", "hermes", "cubert", "hubert"]
Faceting

ElasticSearch can report counts of common terms in documents, frequently seen on the left-hand side of web-sites these are 'facets'

Let's Facet

POST /bands

PUT /bands/band/_mapping
{"band":{"properties":{"name":{"type":"string"},"genre":{"type":"string","index":"not_analyzed"}}}}

POST /_bulk
{"index": {"_index": "bands", "_type": "band", "_id": 1}}
{"name": "Stone Roses", "genre": "madchester"}
{"index": {"_index": "bands", "_type": "band", "_id": 2}}
{"name": "Aphex Twin", "genre": "IDM"}
{"index": {"_index": "bands", "_type": "band", "_id": 4}}
{"name": "Boards of Canada", "genre": "IDM"}
{"index": {"_index": "bands", "_type": "band", "_id": 5}}
{"name": "Mogwai", "genre": "Post Rock"}
{"index": {"_index": "bands", "_type": "band", "_id": 6}}
{"name": "Godspeed", "genre": "Post Rock"}
{"index": {"_index": "bands", "_type": "band", "_id": 7}}
{"name": "Harry Belafonte", "genre": "Calypso"}

// Perform a search
POST /bands/band/_search
{"size": 0, "facets":{"bands":{"terms":{"field":"genre"}}}}

// A more specific search
POST /bands/band/_search
{"size": 5, "query": {"match": {"name": "Harry"}}, "facets":{"bands":{"terms":{"field":"genre"}}}}

Integrating With an App Server

Key Rails Integration Criteria


Generally use an RDBMS(SQL) as primary store
Elasticsearch data should respond correctly to RDBMS transactions
Elasticsearch data can be rebuilt from RDBMS any time
ActiveRecord objects do not necessarily map 1:1 w/ ES objects
ES should fail gracefully whenever possible. If ES dies, your app should degrade, not stop.

What NOT to do!

after_save do
  es_client.put(self.id, self.as_json)
end
Bad because:

Another after_save block fails causing a transaction rollback, won't rollback elasticsearch
ES goes down, your app goes down
Even if you handle ES going down, you have to figure out which records need re-indexing when it comes back up

How We Solved This at Pose

after_save do
  # Add to RBDMS queue of objects needing indexing
  IndexRequest.create(self)
end
Good because:

Processed in background
Transaction safe
If ES dies, our queue backs up
BONUS: Efficient bulk update now possible

Queue Visualized


How We Implemented Bulk Updates


Indexes are rebuilt w/o using queue
Multiple DelayedJob workers run mod sharded queries over table
High-speed, parallel re-imports possible
New content will use queue

Complex Schema Update Problems


No (good) Way in ES to change field type.
Delete / Rebuild may leave site inoperable too long

Complex Schema Update Solutions


Allow N indexes per model
All indexes are updated in real-time, IndexRequest queue centralizes reqs
Batch job runs in background retroactively adding new records
When new index caught up, point queries at it, delete old

Requirements For Multi-Schema Solutions


Ability to map models:indexes 1:n. We implemented m:n
Simultaneous bulk range and real-time indexing
Fast enough bulk operations that you don't take ∞ time

Problem: Building BIG Queries


Some queries will be large and programmatically generated
Our largest query > 100 lines expanded JSON
Sometimes need to run A/B tests between queries

Solution: Class Per Query


Each query gets own class
Plenty of space for DRY helpers within classes
When running A/B tests, subclassing for variations

Search API Class Structure


Does ElasticSearch Support Clustering?

You're Damn Right it Supports Clustering!


The Clustering Story


All queries run across all shards in the cluster
Shards are allocated automatically to nodes and rebalanced
A query to any node will work, the actual queries will be executed on the proper shard / node
Shards are rack aware
Indexes have a configurable number of replicas, set this based on your failure tolerance

The Ops Side of elasticsearch


elasticsearch is easy to set up!
Just a java jar, all you need is java installed
Has a .deb package available

Clustering just works


Clustering just works...
If on a LAN they will find each other and figure everything out
If on EC2, install the EC2 plugin and they will find each other
There is no built-in security, but proxying nginx in front works well

Thank You for Listening!

Links


This talk: http://bit.ly/142wv13
http://www.elasticsearch.org/
http://exploringelasticsearch.com (my free book on elasticsearch)
https://github.com/PoseBiz/stretcher
Paramedic Cluster Monitoring tool: https://github.com/karmi/elasticsearch-paramedic

This Page Intentionally Left Blank