Skip to content

Instantly share code, notes, and snippets.

@fatih
Created March 19, 2023 17:44
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fatih/0b377d277767ed8cb290a5d8627afe0e to your computer and use it in GitHub Desktop.
Save fatih/0b377d277767ed8cb290a5d8627afe0e to your computer and use it in GitHub Desktop.
Poly file layer
Poly File Layer
Idea: A single file content to represent a set of identical or similar files.
* Files can be templated. Files generated by the template are identical in the PFL (Poly File Layer)
* Files that have common lines, can be linked and marked as a Poly File.
Prior Art
* [ ] Read the paper
* Extracting a Unified Directory Tree to Compare Similar Software Products: https://sel.ist.osaka-u.ac.jp/lab-db/betuzuri/archive/1012/1012.pdf
* Fast Search in Hamming Space with Multi-Index Hashing: http://www.cs.toronto.edu/~norouzi/research/papers/multi_index_hashing.pdf
Finding Similar Items:
* http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf
* http://www.mmds.org
* RWS-Diff: Flexible and Efficient Change Detection in Hierarchical Data:
https://db.in.tum.de/~finis/papers/RWS-Diff.pdf
Notes
Locality Sensitivity Hasing: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4
https://security.stackexchange.com/questions/44743/is-there-a-hash-algorithm-that-can-identify-similar-files-or-strings
https://softwareengineering.stackexchange.com/a/266741
Data deduplication is also often called "record linkage", so you may want to also use that as a search term when researching this.
There is an article on the Eventbrite engineering blog that explains how you could greatly reduce the number of file comparisons by using Multi Index Locality Sensitive Hashing . In short, you create a special kind of hash value whereby similar documents will have hash values that are close by. You can then compare potentially similar documents byte for byte because the number of documents you have to compare to is a much smaller set.
String similarity
TODO:
* [ ] Jaro-Winker Similarity (watch from youtube)
* [ ] Ratcliff-Obershelp Similarity
Types:
* Edit distance based
* Hamming distance: computed by overlaying one string over another and finding the places where the strings vary.
* Con: arrow vs arow's normalization score is 0.4
* Levenshtein distance: computed by finding the number of edits which will transform one string to another.
* Jaro Winkler: computed by giving high scores to two strings, if they contain same characters, but within a certain distance from one another, AND the order of the matching characters is same.
* Pro: provides a better outcome for arrow vs arow. Normalization score is 0.96
* Token based
* Jaccard index: the formulae is to find the number of common tokens and divide it by the total number of unique tokens. numerator/denomintor is intersection (common tokens)/ union (unique tokens)
* Sorensen-Dice: the logic is to find the common tokens, multiple by 2, and divide it by the total number of tokens present by comining both sets.
* Sequence based
* Ratcliff-Obershelp similarity
https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment