-
-
Save brpaz/f9af06efc5f1421724c0 to your computer and use it in GitHub Desktop.
GET http://localhost:9200/:index/:type/:id
PUT http://localhost:9200/:index/:type/(:id)
# All resources
DELETE http://localhost:9200/_all
# An index
DELETE http://localhost:9200/:index
# A type
DELETE http://localhost:9200/:index/:type
# An item
DELETE http://localhost:9200/:index/:type/:id
When you insert data into elastic search, it uses dynamic detection to determine what kind of data each field is.
GET http://localhost:9200/:index/:type/_mapping
- If you are adding fields, there is no need to reindex.
- If you need to change a field, you need to reindex your data. An article on the subject
- Create a new index with the new mapping (see PUT mapping below)
- Pull in documents from the old index using a scrolled search and index them to the new index using the bulk API. Note: make sure that you include search_type=scan in your search request. This disables sorting and makes "deep paging" efficient.
- Update index alias.
- Delete the old index
PUT http://localhost:9200/:index/:type -d '{
"mappings": {
"tweet": {
"properties": {
…
}
}
}
}'
The hits
object gives you the top 10 hits that matched the query. The score
represents how well the results matched the query.
# Entire database
GET http://localhost:9200/_search
# One index
GET http://localhost:9200/:index/_search
# Multuple indecies
GET http://localhost:9200/:index,:index/_search
# Wildcards
GET http://localhost:9200/hub*/_search
# Page 1
GET http://localhost:9200/_search?size=5&from=0
# Page 2
GET http://localhost:9200/_search?size=5&from=5
The go-to query when you need to run a query on any one field. Main use is for full-text searches.
{
"query": {
"match": {
"post_title": {
"query": "cancer research"
}
}
}
}
{
"query": {
"match": {
"post_title": {
"query": "cancer research",
"operator": "AND"
}
}
}
}
{
"query": {
"match": {
"post_title": {
"query": "cancer research",
"minimum_should_match": "75%"
}
}
}
}
The simpliest multi-field query to deal with is the one where we can map search terms to specific fields.
{
"query": {
"bool": {
"should": [
{ "match": { "title": "War and Peace" }},
{ "match": { "author": "Leo Tolstoy" }}
]
}
}
}
The bool query takes a more-matches-is-better approach, so the score from each match clause will be added together to provide the final score for each document. Queryies at the same level have the same weight.
{
"query": {
"bool": {
"should": [
{ "match": { "title": "War and Peace" }},
{ "match": { "author": "Leo Tolstoy" }},
{ "bool": {
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude" }}
]
}}
]
}
}
}
The above query also queries for specific translators, but because it's on a lower level than the title and author queries, it doesn't contribute as much to the overall score of documents.
To further boost the importance of the title and author queries, we can boost their scores. Boost levels between 1 and 10 are reasonable. Higher than that, there isn't much affect.
{
"query": {
"bool": {
"should": [
{ "match": {
"title": {
"query": "War and Peace",
"boost": 2
}}},
{ "match": {
"author": {
"query": "Leo Tolstoy",
"boost": 2
}}},
{ "bool": {
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude" }}
]
}}
]
}
}
}
This strategy is best when the query is likely to be found in a single field.
When searching for words that represent a concept, such as "cancer research," the words mean more together than they do individually. Documents should have as many words in the query in the SAME field and the score should come from the best matching field.
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
The bool query calculates the score like this;
- Run both queries in the should clause
- Add scores together
- Divide by the number of clauses (2)
This has the potential to give unrelevant results because if "brown" or "fox" is NOT found in one of the fields, it seriously affects the relevance results. The "title" and "body" fields are competing with each other.
What if we used the score from the best-matching field as the overall score for the query? This would give preference to a single field that contain both of the words we are looking for, rather than preference to the same word repeated in different fields.
Returns documents that match any of these queries and return the score of the best matching query.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
Sometimes you may need to employ a tie breaking strategy if one word is found in each field -- this would result in every document having fields with equal scores.
Adding a tie breaker allows you to take the score from the other matching clases into account. This also adds the other fields' scores times 0.3 and adds it to the overall score. With a tie breaker, all matching clauses count, but the best matching clause counds the most. Keep the tie breaker betwee 0.1 and 0.4.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.3
}
}
}
You can use the multi_match to run the same query in a quicker way.
{
"multi_match": {
"query": "Quick brown fox",
"type": "best_fields",
"fields": [ "title", "body" ],
"tie_breaker": 0.3,
"minimum_should_match": "30%"
}
}
{
"multi_match": {
"query": "Quick brown fox",
"type": "best_fields",
"fields": [ "*_title", "body" ],
"tie_breaker": 0.3,
"minimum_should_match": "30%"
}
}
{
"multi_match": {
"query": "Quick brown fox",
"type": "best_fields",
"fields": [ "*_title", "body^2" ],
"tie_breaker": 0.3,
"minimum_should_match": "30%"
}
}
Designed to find the most fields matching any words, rather than to find the most matching words across all fields. Cannot use the minimum_should_match parameter to reduce long tail of less relevant results. Term frequencies are different in each field and could interfere with each other to produce badly ordered results. Field-centric instead of term-centric.
A common technique for fine-tuning relevance is to index the same data into multiple fields, each with their own analysis chain.
The main field may contain words in the stemmed form and synonyms. It is used to match as many documents as possible.
The same text could then be indexed into other fields to provide more precise matching. One field may contain the unstemmed version, another removes accent marks, and another may use shingles to provide information about word proximity.
These other fields act as signals to increase the relevance score of each matching document. The more fields that match the better.
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"analyzer": "english",
"fields": {
"std": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
The title field is stemmed by the english analyzer, while the title.std uses the standard analyzer, so it is not stemmed.
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields",
"fields": [ "title", "title.std" ]
}
}
}
The query checks against both the stemmed and unstemmed fields and combines the scores from all matching fields. So if a document contains the exact words from the query, it will rank higher than a query that matching only the stemmed versions.
This strategy is best if the query is likely to be found across multiple fields (address). Takes a term-centric approach.
For some entities, the identifying information is spread across multiple fields, each of which contains just part of the whole (first name field, last name field). In this case, we want to find as many words as possible in any of the listed fields.
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "cross_fields",
"operator": "and",
"fields": [ "first_name", "last_name" ]
}
}
}
The _all field indexes the values from all other fields as one big string. You can create custom _all fields to get the same affect with other fields. For example, combining a first_name and last_name field into one field:
{
"mappings": {
"person": {
"properties": {
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"full_name": {
"type": "string"
}
}
}
}
}
Giving higher relevance to documents that contain the query words closer together, but they require all terms to be present.
{
"query": {
"match_phrase": {
"title": "quick brown fox"
}
}
}
OR
"match": {
"title": {
"query": "quick brown fox",
"type": "phrase"
}
}
What match_phrase does:
- Analyzes the query string to produce a list of terms
- Searches for all the terms, but only keeps documents which contains all of the search terms in the same positions, relative to each other.
To be less strict about positioning (if we want "quick fox" to return), we can introduce the "slop" parameter that talls the query how far apart terms are allowed to be while still considering it a match.
{
"query": {
"match_phrase": {
"title": "quick brown fox",
"slop": 1
}
}
}
If you give a higer slop value, say 50, the query will still give back documents where words aren't super close together, but it will give a higher score to documents where the words are closer together.
{
"properties": {
"names": {
"type": "string",
"position_offset_gap": 100
}
}
}
Since proximity queries exclude results that do not contain all terms, we can implement the proximity query as a signal -- as one of potentially many queries, each of which contribute to the overall score for each document (most fields).
{
"query": {
"bool": {
"must": {
"match": {
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"should": {
"match_phrase": {
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}
This query uses the match_phrase to help with relevance, while the match query is used to determine which documents are returned.
Phrase and proximity queries are expensive. Some ways to help with query time:
A simple match query will already have ranked documents which contain all search terms near the top of the list. Really, we just want to rerank the top results to give an extra relevance bump to documents that also match the phrase query. Taking the above query, let's just rescore the top results:
{
"query": {
"match": {
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"match_phrase": {
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}
window_size is the amount of results to rescore.
Group word pairs together (2, 3, 4, etc.. words) to maintain meaning between words and can be a good alternative to match_phrase queries because they are a lot quicker.
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
{
"query": {
"bool": {
"must": {
"match": {
"title": "the hungry alligator ate sue"
}
},
"should": {
"match": {
"title.shingles": "the hungry alligator ate sue"
}
}
}
}
}