Skip to content

Instantly share code, notes, and snippets.

@andrewvc
Last active November 2, 2020 06:24
Show Gist options
  • Save andrewvc/5022184 to your computer and use it in GitHub Desktop.
Save andrewvc/5022184 to your computer and use it in GitHub Desktop.
Elastic Search Crash Course for LA Hacker News

#elasticsearch Crash Course!

By Andrew Cholakian

What is elasticsearch?

  1. A way to search... things
  2. A way to search your data in terms of natural language, and so much more
  3. A distributed version of lucene with a JSON API.
  4. A fancy clustered database

What is lucene?

A software library providing full-text indexing and search. elastisearch provides an HTTP interface, clustering support, and other tools on top of it.

Modeling Data

  • Data is stored in an index, similar to an SQL DB
  • Each index can store multiple types, similar to an SQL table
  • Items inside the index are documents that have a type
  • Specifying attributes for a type is optional
  • All data is sent as JSON, and can have an arbitrary depth

Creating a Schema

# Setup our server
server = Stretcher::Server.new('http://localhost:9200')
# Create the index with its schema
server.index(:foo).create(mappings: {
                  tweet: {
                    properties: {
                      text: {type: 'string', 
                      analyzer: 'snowball'}}}}) rescue nil

Create some fake data

words = %w(Many dogs dog cat cats candles candleizer abscond rightly candlestick monkey monkeypulley deft deftly)
id = 0
words.each {|w|
  id+=1
  server.index(:foo).type(:tweet).put(id, {text: w })
}
  • The document is a simple JSON hash: {"text": "word" }
  • Each document has a unique ID
  • We use put, elasticsearch has a RESTish API

And Perform a Search!

# A simple search
server.index(:foo).search(query: {match: {text: "abscond"}}).results.map(&:text)
=> ["abscond"]
  • our query is actually a JSON object
  • our response is also JSON!

What is Analysis?

Analysis is the process whereby words are transformed into tokens. The Snowball analyzer, for instance, turns english words into tokens based on their stems.

An Analyzer in Action

Analysis Using the API

server.analyze("deft", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftly", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftness", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("candle", analyzer: :snowball).tokens.map(&:token)
=> ["candl"]
server.analyze("candlestick", analyzer: :snowball).tokens.map(&:token)
=> ["candlestick"]

Analysis in Action

# Will match deft and deftly
server.index(:foo).search(query: {match: {text: "deft"}}).results.map(&:text)
=> ["deft", "deftly"]
# Will match candle, but not candlestick
server.index(:foo).search(query: {match: {text: "candle"}}).results.map(&:text)
# => ["candles"]

More kinds of Analysis

# NGram
server.analyze("news", tokenizer: "ngram", filter: "lowercase").tokens.map(&:token)
# =>  ["n", "e", "w", "s", "ne", "ew", "ws"]

# Stop word
server.analyze("The quick brown fox jumps over the lazy dog.", analyzer: :stop).tokens.map(&:token)
#=> ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

# Path Hierarchy
server.analyze("/var/lib/racoons", tokenizer: :path_hierarchy).tokens.map(&:token)
# => ["/var", "/var/lib", "/var/lib/racoons"]

Searching With An NGRam

# Create the index
server.index(:users).create(settings: {analysis: {analyzer: {my_ngram: {type: "custom", tokenizer: "ngram", filter: 'lowercase'}}}}, mappings: {user: {properties: {name: {type: :string, analyzer: :my_ngram}}}})

# Store some fake data
users = %w(bender fry lela hubert cubert hermes calculon)
users.each_with_index {|name,i| server.index(:users).type(:user).put(i, {name: name}) }

# Our analyzer in action
server.index(:users).analyze("hubert", analyzer: :my_ngram).tokens.map(&:token)
# => ["h", "u", "b", "e", "r", "t", "hu", "ub", "be", "er", "rt"]

# Some queries

# Exact
server.index(:users).search(query: {match: {name: "Hubert"}}).results.map(&:name)
=> ["hubert", "cubert", "bender", "hermes", "fry", "calculon", "lela"]

# A Mis-spelled query
server.index(:users).search(query: {match: {name: "Calclulon"}}).results.map(&:name)
=> ["calculon", "lela", "cubert", "bender", "hubert"]

Boosting

# Individual docs can be boosted
server.index(:users).type(:user).put(1000, {name: "boiler", "_boost" => 1_000_000})

server.index(:users).search(query: {match: {name: "bender"}}).results.map(&:name)
# Wha?
# => ["boiler", "bender", "hermes", "cubert", "hubert", "calculon", "fry", "lela"]

server.index(:users).search(query: {match: {name: "lela"}}).results.map(&:name)
# Sweet Zombie Jesus!
=> ["boiler", "lela", "calculon", "bender", "hermes", "cubert", "hubert"]

Faceting

ElasticSearch can report counts of common terms in documents, frequently seen on the left-hand side of web-sites these are 'facets'

Facets on Amazon

Let's Facet

# Create a mapping for bands, with a 'name' and a 'genre'
server.index(:bands).create(mappings: {band: {properties: {name: {type: :string}, genre: {type: :string, index: :not_analyzed} }}})

#Import some docs
[["Stone Roses", "madchester"], ["Boards of Canada", "IDM"], ["Aphex Twin", "IDM"], ["Mogwai", "Post Rock"], ["Godspeed", "Post Rock"], ["Harry Belafonte", "Calypso"]].
each_with_index {|b,i|
  server.index(:bands).type(:band).put(i, {name: b[0], genre: b[1]})
}

# Perform a search
server.index(:bands).search(facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["Post Rock", 2], ["IDM", 2], ["madchester", 1], ["Calypso", 1]]

# A more specific search
server.index(:bands).search(query: {match: {name: "Boards"}}, facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["IDM", 1]]

Does ElasticSearch Support Clustering?

You're Damn Right it Supports Clustering!

ES Clustering

The Clustering Story

  • All queries run across all shards in the cluster
  • Shards are allocated automatically to nodes and rebalanced
  • A query to any node will work, the actual queries will be executed on the proper shard / node
  • Shards are rack aware
  • Indexes have a configurable number of replicas, set this based on your failure tolerance

The Ops Side of elasticsearch

  • elasticsearch is easy to set up!
  • Just a java jar, all you need is java installed
  • Has a .deb package available

Clustering just works

  • Clustering just works...
  • If on a LAN they will find each other and figure everything out
  • If on EC2, install the EC2 plugin and they will find each other
  • There is no built-in security, but proxying nginx in front works well

Thank You for Listening!

Links

This Page Intentionally Left Blank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment