Skip to content

Instantly share code, notes, and snippets.

@lgueye
Created February 7, 2012 14:45
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save lgueye/1760014 to your computer and use it in GitHub Desktop.
Save lgueye/1760014 to your computer and use it in GitHub Desktop.
elasticsearch : dealing with case and accents
# delete index (will print an error if 'my_index' doesn't exist, you can safely ignore it)
curl -XDELETE 'http://localhost:9200/my_index'
# create index with its settings
curl -XPOST 'http://localhost:9200/my_index' -d '{
"index.analysis.analyzer.default.type":"custom",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"lowercase",
"index.analysis.analyzer.default.filter.1":"asciifolding"
}'
# check index analyzer behaviour
# we can note that lowercase filter and asciifolding filters work at index phase
# 2 tokens are stored : 'ingenieur' and 'java'
curl -XGET 'localhost:9200/my_index/_analyze?text=Ingénieur+Java'
# add data
curl -XPUT 'http://localhost:9200/my_index/my_type/1' -d '{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2' -d '{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3' -d '{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4' -d '{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5' -d '{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'
# search data
# the above queries should return the same results (2 hits)
curl http://localhost:9200/my_index/my_type/_search?q=Ingénieur*
curl http://localhost:9200/my_index/my_type/_search?q=ingénieur*
curl http://localhost:9200/my_index/my_type/_search?q=ingenieur*
curl http://localhost:9200/my_index/my_type/_search?q=Ingén*
curl http://localhost:9200/my_index/my_type/_search?q=ingén*
curl http://localhost:9200/my_index/my_type/_search?q=ingén*
curl http://localhost:9200/my_index/my_type/_search?q=ingen*
@klerisson
Copy link

Did you manage to sort this problem? I'm facing the same... no hits after all.

@lgueye
Copy link
Author

lgueye commented Apr 30, 2012

Hi,

Yes, the key is accents encoding. Instead of "curl http://localhost:9200/my_index/my_type/_search?q=Ingén_" use "curl http://localhost:9200/my_index/my_type/_search?q=Ing%C3A9n_"

Cheers

@klederson
Copy link

The problem is the oposite, i need also to get Ingen ... tried like this:

curl -XGET 'http://172.16.181.128:9200/sandbox/tests/_search' -d '{
"query" : {
"text" : {
"user" : {
"query" : "ingen",
"type" : "boolean",
"operator" : "AND",
"fuzziness" : "0.5"
}
}
}
}'

AND IT WORKS but because the aproximation if i have too many differences between the words than it will not work... so this does not solve all the accent problem.. do someone know how to simply index by IGNORING accents?

@alexol91
Copy link

You can try this:

analysis-asciifolding

Or replace chars with accents with ? exmple

Find: "camión"
{ "query": { "query_string": { "analyze_wildcard": true, "query": "cami?n" } } }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment