Skip to content

Instantly share code, notes, and snippets.

@geohacker
Last active August 29, 2015 13:56
Show Gist options
  • Save geohacker/8873655 to your computer and use it in GitHub Desktop.
Save geohacker/8873655 to your computer and use it in GitHub Desktop.
String Matching Jugaad.

Strategies

  1. LIKE, ILIKE
  2. fuzzystrmatch
  3. trigram
  4. tsvector/query
  5. URLs.
  6. Categories.
  7. Removing generic but non brand names like cafe, pizza etc..

Ideas

  1. LIKE, ILIKE
  2. Map brand_name to factual_name (uncleaned). 21 matches.
  3. On the cleaned strings - * * Problem - regex substrings will cause more false positives than what we buying for. Plus full text search does this for us* 1. Run LIKE with % from brand name to factual name. 2. If brand name is a single word, then run a regex match on factual name only if there is a space before and/or after.
  4. fuzzystrmatch -
      • Result - Costs for insertion 1, deletion 3, substitution 5. For example, to get from 'dominos' and 'dominos pizza', it is better to insert 'pizza' than substitute or delete characters. Total matches - 1832*
  5. Run levenshtein between cleaned factual name and cleaned brand name. Calculate the confidence as follows - levenshtein/length of longer string*100.
  6. trigram
  7. Run trigram between cleaned factual name and cleaned brand name. Record matches with similarity >=0.4.
  8. tsvector/tsquery -
    • Result - The theory is that this is perhaps the best coverage that we can obtain with guaranteed false positives. 587 matches.*
  1. Run tsvector on uncleaned factual name and tsquery on uncleaned brand names. Store all matches. Also readup about the ranking.
  2. Start from the table created by 4.
  3. Find levenshtein between the strings that are matched with the 1, 3, 5 rule. Ignore the ones with large scores.
  4. Use category to match and get a better subset.
  5. Use URLs to get an even better subset.

To-do

  1. Check the clear-string function.
  2. Create a new search term table with the new algorithm.
  3. Create a new column in the factual table with the new algorithm.
  4. Incorporate all the ideas in the script.
  5. Find levenshtein of table generated by ts.
  6. Use category mapping.
  7. Use URL mapping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment