Skip to content

Instantly share code, notes, and snippets.

@lcpriest
Created September 6, 2016 14:04
Show Gist options
  • Save lcpriest/369dda8649a16c392612f4d698b889b4 to your computer and use it in GitHub Desktop.
Save lcpriest/369dda8649a16c392612f4d698b889b4 to your computer and use it in GitHub Desktop.
client = ElasticWrapper::Session.new.client
client.search(index: 'test', type: 'contacts', body: body)
body = { query: { bool: { filter: { term: { name: 'lachlan' } } } } }
=>
[{
"_index"=>"test",
"_type"=>"contact",
"_id"=>"1",
"_score"=>1.0,
"_source"=>{
"id"=>2157,
"name"=>"Lachlan Priest"
}
}]
body = { query: { bool: { filter: { term: { name: 'Lachlan' } } } } }
=>
[]
@lcpriest
Copy link
Author

lcpriest commented Sep 6, 2016

From what I can see, when using the terms syntax, it doesn't use the fuzzy find at all (this is good), unfortunately, when indexing this type, I believe the analyzer has tokenized my string by splitting at whitespaces and downcasing.

This would normally be fine, if fuzzy find was on. As I am being quite explicit in my request, it actually is detrimental as I am unable to find my record without abiding by the tokenizer rules.

Unfortunately, I don't want to have to sanitize all inputs and follow the same rules as the analyzer, as it will allow for things like 'Priest Lachlan' to find 'Lachlan Priest' and that doesn't fit the use case I want.

My questions are
a) what exactly do the tokenizers provide when not using the fulltext search?
b) is it a bad idea to disable them?
c) is it possible to do so without having to manually create the mappings?

@fvbock
Copy link

fvbock commented Sep 6, 2016

a) not sure you talk about analyzers or tokenizers. tokenizers only make sense if you want to find documents by a partial token of any field. if you dont you just have a map lookup using string keys (and then maybe something like redis might be a better fit, but i am sure there are other reasons you are on es...)

analyzers help you define how your stuff gets indexed. (eg. HELP, help, Help would all be indexed as help of you have an analyzer that uses the "lowercase" filter... same thing gets applied to a query and then you get the matches as expected.

NB: this is an example from a ES 1.3 mapping - so there might be some changes:

          "code_query_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "keyword_synonyms",
              "cjk_width",
              "icu_folding",
              "lowercase"
            ]
}

in the mapping then you define to use that analyzer for a specific field

        "code" : {
          "type" : "string",
          "index": "analyzed",
          "analyzer": "code_query_analyzer"
}

and then in a query:

                {
                  "prefix": {
                    "code" : {
                        "prefix" : "%s",
                        "analyzer": "code_query_analyzer",
                        "boost": 2.5
                    }
                  }
                },

that %s is part of a format string... not es syntax.

b) kind of same as a) if you only want exact match string (phrase) lookups you might want to look somewhere else. but phrase matching might get you close enough.

c) no. analyzers <> field mappings are define in the mapping... and generally speaking from my experience dynamic mappings seem like a bad idea in most cases.

hope that answers some of your questions.

here is an old but good blogpost about multilingual indexing that also talks about analyzers, etc. might help to understand: https://gibrown.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/

cheers
_f

@lcpriest
Copy link
Author

lcpriest commented Sep 8, 2016

Thanks! This was essentially what I was looking for, I've ended up using the keyword tokenizer (though I might just drop this, as it's essentially the same as not_analyzed) and a lowercase filter to allow for case insensitive search.

I ended up doing a static mapping for as many fields as I know in advance and will just look into again every month or so and see if I find any recurring fields in mappings that I didn't define yet.

Thanks for the post - I think the best piece of advice that someone gave was that the ES docs are not really reference docs, they are more like a tutorial. They are much better if you read them like a book and follow the pages; after that everything became a lot clearer.

Again, thank you, this has been really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment