andrewvc/elastic_search_crash_course.md

## elastic_search_crash_course.md

      
    Raw
  

              elastic_search_crash_course.md
            
          
    #elasticsearch Crash Course!
By Andrew Cholakian

What is elasticsearch?


A way to search... things
A way to search your data in terms of natural language, and so much more
A distributed version of lucene with a JSON API.
A fancy clustered database

What is lucene?

A software library providing full-text indexing and search. elastisearch provides an HTTP interface, clustering support, and other tools on top of it.
Modeling Data


Data is stored in an index, similar to an SQL DB
Each index can store multiple types, similar to an SQL table
Items inside the index are documents that have a type
Specifying attributes for a type is optional
All data is sent as JSON, and can have an arbitrary depth

Creating a Schema

# Setup our server
server = Stretcher::Server.new('http://localhost:9200')
# Create the index with its schema
server.index(:foo).create(mappings: {
                  tweet: {
                    properties: {
                      text: {type: 'string', 
                      analyzer: 'snowball'}}}}) rescue nil
Create some fake data

words = %w(Many dogs dog cat cats candles candleizer abscond rightly candlestick monkey monkeypulley deft deftly)
id = 0
words.each {|w|
  id+=1
  server.index(:foo).type(:tweet).put(id, {text: w })
}

The document is a simple JSON hash: {"text": "word" }
Each document has a unique ID
We use put, elasticsearch has a RESTish API

And Perform a Search!

# A simple search
server.index(:foo).search(query: {match: {text: "abscond"}}).results.map(&:text)
=> ["abscond"]

our query is actually a JSON object
our response is also JSON!

What is Analysis?

Analysis is the process whereby words are transformed into tokens.
The Snowball analyzer, for instance, turns english words into tokens based on their stems.

Analysis Using the API

server.analyze("deft", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftly", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftness", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("candle", analyzer: :snowball).tokens.map(&:token)
=> ["candl"]
server.analyze("candlestick", analyzer: :snowball).tokens.map(&:token)
=> ["candlestick"]
Analysis in Action

# Will match deft and deftly
server.index(:foo).search(query: {match: {text: "deft"}}).results.map(&:text)
=> ["deft", "deftly"]
# Will match candle, but not candlestick
server.index(:foo).search(query: {match: {text: "candle"}}).results.map(&:text)
# => ["candles"]
More kinds of Analysis

# NGram
server.analyze("news", tokenizer: "ngram", filter: "lowercase").tokens.map(&:token)
# =>  ["n", "e", "w", "s", "ne", "ew", "ws"]

# Stop word
server.analyze("The quick brown fox jumps over the lazy dog.", analyzer: :stop).tokens.map(&:token)
#=> ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

# Path Hierarchy
server.analyze("/var/lib/racoons", tokenizer: :path_hierarchy).tokens.map(&:token)
# => ["/var", "/var/lib", "/var/lib/racoons"]
Searching With An NGRam

# Create the index
server.index(:users).create(settings: {analysis: {analyzer: {my_ngram: {type: "custom", tokenizer: "ngram", filter: 'lowercase'}}}}, mappings: {user: {properties: {name: {type: :string, analyzer: :my_ngram}}}})

# Store some fake data
users = %w(bender fry lela hubert cubert hermes calculon)
users.each_with_index {|name,i| server.index(:users).type(:user).put(i, {name: name}) }

# Our analyzer in action
server.index(:users).analyze("hubert", analyzer: :my_ngram).tokens.map(&:token)
# => ["h", "u", "b", "e", "r", "t", "hu", "ub", "be", "er", "rt"]

# Some queries

# Exact
server.index(:users).search(query: {match: {name: "Hubert"}}).results.map(&:name)
=> ["hubert", "cubert", "bender", "hermes", "fry", "calculon", "lela"]

# A Mis-spelled query
server.index(:users).search(query: {match: {name: "Calclulon"}}).results.map(&:name)
=> ["calculon", "lela", "cubert", "bender", "hubert"]
Boosting

# Individual docs can be boosted
server.index(:users).type(:user).put(1000, {name: "boiler", "_boost" => 1_000_000})

server.index(:users).search(query: {match: {name: "bender"}}).results.map(&:name)
# Wha?
# => ["boiler", "bender", "hermes", "cubert", "hubert", "calculon", "fry", "lela"]

server.index(:users).search(query: {match: {name: "lela"}}).results.map(&:name)
# Sweet Zombie Jesus!
=> ["boiler", "lela", "calculon", "bender", "hermes", "cubert", "hubert"]
Faceting

ElasticSearch can report counts of common terms in documents, frequently seen on the left-hand side of web-sites these are 'facets'

Let's Facet

# Create a mapping for bands, with a 'name' and a 'genre'
server.index(:bands).create(mappings: {band: {properties: {name: {type: :string}, genre: {type: :string, index: :not_analyzed} }}})

#Import some docs
[["Stone Roses", "madchester"], ["Boards of Canada", "IDM"], ["Aphex Twin", "IDM"], ["Mogwai", "Post Rock"], ["Godspeed", "Post Rock"], ["Harry Belafonte", "Calypso"]].
each_with_index {|b,i|
  server.index(:bands).type(:band).put(i, {name: b[0], genre: b[1]})
}

# Perform a search
server.index(:bands).search(facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["Post Rock", 2], ["IDM", 2], ["madchester", 1], ["Calypso", 1]]

# A more specific search
server.index(:bands).search(query: {match: {name: "Boards"}}, facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["IDM", 1]]
Does ElasticSearch Support Clustering?

You're Damn Right it Supports Clustering!


The Clustering Story


All queries run across all shards in the cluster
Shards are allocated automatically to nodes and rebalanced
A query to any node will work, the actual queries will be executed on the proper shard / node
Shards are rack aware
Indexes have a configurable number of replicas, set this based on your failure tolerance

The Ops Side of elasticsearch


elasticsearch is easy to set up!
Just a java jar, all you need is java installed
Has a .deb package available

Clustering just works


Clustering just works...
If on a LAN they will find each other and figure everything out
If on EC2, install the EC2 plugin and they will find each other
There is no built-in security, but proxying nginx in front works well

Thank You for Listening!

Links


http://www.elasticsearch.org/
Paramedic Cluster Monitoring tool: https://github.com/karmi/elasticsearch-paramedic
This presentation: https://gist.github.com/andrewvc/5022184

This Page Intentionally Left Blank