- LIKE, ILIKE
- fuzzystrmatch
- trigram
- tsvector/query
- URLs.
- Categories.
- Removing generic but non brand names like cafe, pizza etc..
- LIKE, ILIKE
- Map brand_name to factual_name (uncleaned). 21 matches.
- On the cleaned strings - * * Problem - regex substrings will cause more false positives than what we buying for. Plus full text search does this for us* 1. Run LIKE with % from brand name to factual name. 2. If brand name is a single word, then run a regex match on factual name only if there is a space before and/or after.
- fuzzystrmatch -
-
- Result - Costs for insertion 1, deletion 3, substitution 5. For example, to get from 'dominos' and 'dominos pizza', it is better to insert 'pizza' than substitute or delete characters. Total matches - 1832*
-
- Run levenshtein between cleaned factual name and cleaned brand name. Calculate the confidence as follows - levenshtein/length of longer string*100.
- trigram
- Run trigram between cleaned factual name and cleaned brand name. Record matches with similarity >=0.4.
- tsvector/tsquery -
-
- Result - The theory is that this is perhaps the best coverage that we can obtain with guaranteed false positives. 587 matches.*
- Run tsvector on uncleaned factual name and tsquery on uncleaned brand names. Store all matches. Also readup about the ranking.
- Start from the table created by 4.
- Find levenshtein between the strings that are matched with the 1, 3, 5 rule. Ignore the ones with large scores.
- Use category to match and get a better subset.
- Use URLs to get an even better subset.
- Check the clear-string function.
- Create a new search term table with the new algorithm.
- Create a new column in the factual table with the new algorithm.
- Incorporate all the ideas in the script.
- Find levenshtein of table generated by ts.
- Use category mapping.
- Use URL mapping.