geohacker/string.md

## string.md

      
    Raw
  

              string.md
            
          
    Strategies


LIKE, ILIKE
fuzzystrmatch
trigram
tsvector/query
URLs.
Categories.
Removing generic but non brand names like cafe, pizza etc..

Ideas


LIKE, ILIKE
Map brand_name to factual_name (uncleaned). 21 matches.
On the cleaned strings -
* * Problem - regex substrings will cause more false positives than what we buying for. Plus full text search does this for us*
1. Run LIKE with % from brand name to factual name.
2. If brand name is a single word, then run a regex match on factual name only if there is a space before and/or after.
fuzzystrmatch -


Result - Costs for insertion 1, deletion 3, substitution 5. For example, to get from 'dominos' and 'dominos pizza', it is better to insert 'pizza' than substitute or delete characters. Total matches - 1832*


Run levenshtein between cleaned factual name and cleaned brand name. Calculate the confidence as follows - levenshtein/length of longer string*100.
trigram
Run trigram between cleaned factual name and cleaned brand name. Record matches with similarity >=0.4.
tsvector/tsquery -


Result - The theory is that this is perhaps the best coverage that we can obtain with guaranteed false positives. 587 matches.*


Run tsvector on uncleaned factual name and tsquery on uncleaned brand names. Store all matches. Also readup about the ranking.
Start from the table created by 4.
Find levenshtein between the strings that are matched with the 1, 3, 5 rule. Ignore the ones with large scores.
Use category to match and get a better subset.
Use URLs to get an even better subset.

To-do


 Check the clear-string function.
 Create a new search term table with the new algorithm.
 Create a new column in the factual table with the new algorithm.
 Incorporate all the ideas in the script.
 Find levenshtein of table generated by ts.
 Use category mapping.
 Use URL mapping.