Skip to content

Instantly share code, notes, and snippets.

@polyfractal
Created January 15, 2013 21:56
Show Gist options
  • Star 13 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save polyfractal/4542494 to your computer and use it in GitHub Desktop.
Save polyfractal/4542494 to your computer and use it in GitHub Desktop.
#create a test index with shingle mapping
curl -XPUT localhost:9200/test -d '{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_shingle":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_stop", "filter_shingle"]
}
},
"filter":{
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"enable_position_increments":"false"
}
}
}
}
},
"mappings":{
"product":{
"properties":{
"title":{
"search_analyzer":"analyzer_shingle",
"index_analyzer":"analyzer_shingle",
"type":"string"
}
}
}
}
}'
#Add some docs to the index
curl -XPOST localhost:9200/test/product/1 -d '{"title" : "Sample product title for shingles"}'
curl -XPOST localhost:9200/test/product/2 -d '{"title" : "Another title"}'
curl -XPOST localhost:9200/test/product/3 -d '{"title" : "Shingles is a viral disease"}'
#Analyze API to check out shingling
curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&pretty' -d 'Test text to see shingles' | grep token
#Sample search
curl -XGET 'localhost:9200/test/product/_search?q=title:product+title&pretty'
#this one won't return anything, because of the stop filter
curl -XGET 'localhost:9200/test/product/_search?q=title:is+a&pretty'
#while this one will, because we emit unigrams
curl -XGET 'localhost:9200/test/product/_search?q=title:is+a+viral&pretty'
@tapit69
Copy link

tapit69 commented Mar 5, 2014

Another great post that has help me so much. Thank you!!!!

@jarib
Copy link

jarib commented Nov 8, 2014

Any tips for how to deal with this?

IndexCreationException[[hdo-transcripts] failed to create index]; nested: ElasticsearchIllegalArgumentException[[enable_position_increments: false] is not supported anymore as of Lucene 4.4 as it can create broken token streams. Please fix your analysis chain or use an older compatibility version (<=4.3) but beware that it might cause unexpected behavior.];

@lu3do
Copy link

lu3do commented Nov 13, 2014

yeah i'm running into the same problem as @jarib with
ES v1.4.0 / Lucene v4.10.2

@sanjay1688
Copy link

{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 5,
"min_shingle_size": 2,
"output_unigrams": "true"
}
}
}
}
},
"mappings": {
"product": {
"properties": {
"title": {
"search_analyzer": "analyzer_shingle",
"index_analyzer": "analyzer_shingle",
"type": "string"
}
}
}
}
}

@yehosef
Copy link

yehosef commented Jan 22, 2015

just take out the enable_position_increments reference. It works fine without it for the purposes of testing.

@plcstevens
Copy link

Corrected for latest versions of ElasticSearch: Version: 1.6.0, Build: cdd3ac4/2015-06-09T13:36:34Z

curl -XPUT 127.0.0.1:9200/test -d '{
   "settings":{
      "index":{
         "analysis":{
            "analyzer":{
               "analyzer_shingle":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_stop", "filter_shingle"]
               }
            },
            "filter":{
               "filter_shingle":{
                  "type":"shingle",
                  "max_shingle_size":5,
                  "min_shingle_size":2,
                  "output_unigrams":"true"
               },
               "filter_stop":{
                  "type":"stop"
               }
            }
         }
      }
   },
   "mappings":{
      "product":{
         "properties":{
            "title":{
               "search_analyzer":"analyzer_shingle",
               "index_analyzer":"analyzer_shingle",
               "type":"string"
            }
         }
      }
   }
}'

@krinkere
Copy link

krinkere commented Aug 3, 2015

Good stuff! Thank you

@pranipat
Copy link

pranipat commented Jul 21, 2016

curl -XGET 'localhost:9200/test/product/_search?q=title:product+title&pretty'

Here I just want to get only 10 top records . How to achieve that . Also if I want results which score are greater than 0.9 . are those conditional search with result possible ? Also what if the titile we are searching is very big like 400 words .

@chjose
Copy link

chjose commented Jul 25, 2016

Hi Pranipat, You can limit the number of results using 'size' parameter in your query. Minimum score required can be provided in the 'min_score' field.

I didn't get your last question. Elastic search is very fast, but if you are having title of size 400 your index size will increase considerably. If you could detail your analyzers and filters used i could give you exact reason you are looking for.

{
    "from" : 0, 
    "size" : 10,
    "min_score": 0.9,
    "query": {
    .....
}

Thank you!!!!

@ctindel
Copy link

ctindel commented Oct 17, 2016

index_analyzer was removed, so you can change it to "analyzer" and in this case since its the same as the search_analyzer you can also drop the search_analyzer from the mapping. Also it needs to be changed from string to text mapping type.

      "properties" : {
        "title" : { 
          "type": "text", 
          "analyzer":"analyzer_shingle"
        }
      }

@omercadocoss
Copy link

omercadocoss commented Oct 19, 2017

Working gist for elastic search v 5.6:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_shingle": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "filter_stop",
              "filter_shingle"
            ]
          }
        },
        "filter": {
          "filter_shingle": {
            "type": "shingle",
            "max_shingle_size": 5,
            "min_shingle_size": 2,
            "output_unigrams": "true"
          },
          "filter_stop": {
            "type": "stop"
          }
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "title": {
          "search_analyzer": "analyzer_shingle",
          "analyzer": "analyzer_shingle",
          "type": "text"
        }
      }
    }
  }
}'

@renatalucia
Copy link

renatalucia commented Jan 9, 2018

I use Elasticserach 6.1 and I have an issue when querying the shingle field with fuzziness.

The following query works as expected:

GET test/product/_search 
{
  "query": {
    "match": {
      "title": {
        "query": "Shingles is",
        "fuzziness": "AUTO"
      }
    }
  }
}

I would expect that replacing "Shingles is" with "Shingle is" would return the same result due to the fuzziness, however the result is empty. Any help?

@pcsanwald
Copy link

pcsanwald commented Jul 2, 2018

version of gist updated for 6.3

curl -XPUT localhost:9200/test -H 'Content-Type: application/json' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_shingle": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "filter_stop",
              "filter_shingle"
            ]
          }
        },
        "filter": {
          "filter_shingle": {
            "type": "shingle",
            "max_shingle_size": 5,
            "min_shingle_size": 2,
            "output_unigrams": "true"
          },
          "filter_stop": {
            "type": "stop"
          }
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "title": {
          "search_analyzer": "analyzer_shingle",
          "analyzer": "analyzer_shingle",
          "type": "text"
        }
      }
    }
  }
}'

#Add some docs to the index
curl -H 'Content-Type: application/json' -XPOST localhost:9200/test/product/1 -d '{"title" : "Sample product title for shingles"}'
curl -H 'Content-Type: application/json' -XPOST localhost:9200/test/product/2 -d '{"title" : "Another title"}'
curl -H 'Content-Type: application/json' -XPOST localhost:9200/test/product/3 -d '{"title" : "Shingles is a viral disease"}'

#Analyze API to check out shingling
curl -H 'Content-Type: application/json' -XGET 'localhost:9200/test/_analyze?pretty' -d '{ "analyzer": "analyzer_shingle", "text":"Test text to see shingles"}'

#Sample search
curl -XGET 'localhost:9200/test/product/_search?q=title:product+title&pretty'

#this one won't return anything, because of the stop filter
curl -XGET 'localhost:9200/test/product/_search?q=title:is+a&pretty'

#while this one will, because we emit unigrams
curl -XGET 'localhost:9200/test/product/_search?q=title:is+a+viral&pretty'

@kamlesh28
Copy link

Hi I am trying to create a index with custom analyzer as given below . when we push the data to this index using spark submit job the job is failing with nodes not available exception;

Index mapping :

"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"filter_stop",
"english_stemmer",
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 4,
"min_shingle_size": 2,
"output_unigrams": "true"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"english_stop": {
"type": "stop",
"stopwords": "english"
},
"filter_stop": {
"type": "stop",
"stopwords": ["it",
"its",
"itself",
"they",
"them",
"their",
"theirs",
"themselves",
"what",
"which",
"who",
"whom",
"this",
"that",
"these",
"those",
"am",
"is",
"are",
"was",
"were",
"be",
"been",
"being",
"have",
"has",
"had",
"having",
"do",
"does",
"did",
"doing",
"a",
"an",
"the",
"and",
"but",
"if",
"or",
"because",
"as",
"until",
"while",
"of",
"at",
"by",
"for",
"with",
"about",
"between",
"into",
"through",
"during",
"before",
"after",
"above",
"below",
"to",
"from",
"up",
"down",
"in",
"out",
"on",
"off",
"over",
"under",
"again",
"then",
"once",
"here",
"there",
"when",
"where",
"why",
"how",
"all",
"any",
"both",
"each",
"few",
"more",
"most",
"other",
"some",
"such",
"no",
"nor",
"not",
"only",
"same",
"so",
"than",
"too",
"very",
"s",
"t",
"can",
"will",
"just",
"don",
"should",
"now",
"apparatus",
"embodiments",
"embodiments",
"technique",
"operation",
"operations"]
}
}
}
}
},
"mappings": {
"-indexName1": {
"properties": {
"cpc": {
"type": "text",
"analyzer": "standard"
},
"definition": {
"type": "text",
"search_analyzer": "analyzer_shingle",
"analyzer": "analyzer_shingle"
}
}
}
}

exception :
ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 2)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:445)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:186)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:222)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:244)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:198)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:161)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@eugeneRover
Copy link

ES 7.5.2
1st error:

{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

and 2nd, after adding -H 'Content-Type: application/json' :

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"enable_position_increments is not supported anymore. Please fix your analysis chain"}],"type":"illegal_argument_exception","reason":"enable_position_increments is not supported anymore. Please fix your analysis chain"},"status":400}

@kapilraghuwanshi
Copy link

This one is working with the latest ES version 7 -

PUT /moviesdb
{
"settings": {
"index": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": [
"lowercase",
"filter_stop",
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 4,
"output_unigrams": "true"
},
"filter_stop": {
"type": "stop"
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"analyzer": "analyzer_shingle",
"type": "text"
}
}
}
}

You can simply copy it since I have gone through 3-4 posts to get rid of all my errors.

@lancewalton
Copy link

I have the same question as renatalucia. Can we do fuzzy searches with this? How?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment