Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save martinferreira/6be06888204e0941ab7e2c407dd49c46 to your computer and use it in GitHub Desktop.
Save martinferreira/6be06888204e0941ab7e2c407dd49c46 to your computer and use it in GitHub Desktop.
Elastic search wildcard search vs fuzzy search
Introduction
When implementing fuzzy search, it's often confused with wild card search. So its important distinguish the two first of all. Fuzzy search implies that it searches for the spefied phrase(s) and anything related to it by a factor of fuzziness. So example in a data set of {"Martin", "Sipho", "Daniel" , "Pieter", "Peter"} , looking for Peter will also return Pieter as its "close enough". Wildcard search returns all result that includes the phase or subset. For example, if "ter" was searched for will return {"Pieter", "Peter"} whilst looking for "Peter" it will only return "Peter"
Fuzzy Search in Elastic Search
A simple example of a query will be:
1
"query": {
2
"bool": {
3
"should": [
4
{
5
"match": {
6
"trackingNumber": {
7
"query": "234329",
8
"fuzziness": "AUTO"
9
}
10
}
11
}
12
]
13
}
14
}
which will result in:
1
{
2
"took": 99,"timed_out": false,
3
"_shards": {
4
"total": 5, "successful": 5, "failed": 0
5
},
6
"hits": {
7
"total": 4, "max_score": 3.0165722,
8
"hits": [
9
{
10
"_id": "1251", "_score": 3.0165722,
11
"_source": { "trackingNumber": "234323", "id": 1251 }
12
},
13
{
14
"_id": "1241", "_score": 2.9706888,
15
"_source": { "trackingNumber": "234324", "id": 1241 }
16
},
17
{
18
"_id": "1314", "_score": 2.4132576,
19
"_source": { "trackingNumber": "23432999", "id": 1314 }
20
},
21
{
22
"_id": "1289", "_score": 2.4012454,
23
"_source": { "trackingNumber": "324324", "id": 1289 }
24
}
25
]
26
}
27
}
The query can be expanded how "fuzzy" the search can be, how many permutations can be searched for and , how long the prefix length can be. This can be used to refine the search or improve the performance of the query.
Wildcard search
Here is a simple example
1
"query": {
2
"bool": {
3
"should": [
4
{
5
"wildcard": {
6
"trackingNumber": {
7
"value": "*23432*"
8
}
9
}
10
}
11
]
12
}
13
}
Which will result in:
1
{
2
"took": 55, "timed_out": false,
3
"_shards": {
4
"total": 5, "successful": 5, "failed": 0
5
},
6
"hits": {
7
"total": 5, "max_score": 1,
8
"hits": [
9
{
10
"_id": "1314", "_score": 1,
11
"_source": { "trackingNumber": "23432999", "id": 1314 }
12
},
13
{
14
"_id": "1265", "_score": 1,
15
"_source": { "trackingNumber": "234324888", "id": 1265 }
16
},
17
{
18
"_id": "1254", "_score": 1,
19
"_source": { "trackingNumber": "23432432", "id": 1254 }
20
},
21
{
22
"_id": "1251", "_score": 1,
23
"_source": { "trackingNumber": "234323", "id": 1251 }
24
{
25
"_id": "1241", "_score": 1,
26
"_source": { "trackingNumber": "234324", "id": 1241 }
27
}
28
]
29
}
30
}
This search is much more refined and strict. It's important to note it has issues when searching for whitespace, so these must be stripped out when creating the model in another field, or the field search for must get a tag to indicate that the field is not analysed.
Conclusion
Its also important to note that Elastic search searches on a word by default. It required 100% matching on this word. Won't work that great on Account numbers. Both solutions will work for getting a result. It just depends on what the requirement is.
Honourable Mention: Nico Botha
References
https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzzy-match-query.html
https://liliendahl.com/2012/02/13/wildcard-search-versus-fuzzy-search/
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment