fatih/poly-file-layer.txt

## poly-file-layer.txt
Poly File Layer

Idea: A single file content to represent a set of identical or similar files.
* Files can be templated. Files generated by the template are identical in the PFL (Poly File Layer)
* Files that have common lines, can be linked and marked as a Poly File.

Prior Art
* [ ] Read the paper
* Extracting a Unified Directory Tree to Compare Similar Software Products: https://sel.ist.osaka-u.ac.jp/lab-db/betuzuri/archive/1012/1012.pdf
* Fast Search in Hamming Space with Multi-Index Hashing: http://www.cs.toronto.edu/~norouzi/research/papers/multi_index_hashing.pdf
Finding Similar Items:
*  http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf
* http://www.mmds.org
* RWS-Diff: Flexible and Efficient Change Detection in Hierarchical Data:
https://db.in.tum.de/~finis/papers/RWS-Diff.pdf
Notes
Locality Sensitivity Hasing: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4
https://security.stackexchange.com/questions/44743/is-there-a-hash-algorithm-that-can-identify-similar-files-or-strings
https://softwareengineering.stackexchange.com/a/266741
Data deduplication is also often called "record linkage", so you may want to also use that as a search term when researching this.
There is an article on the Eventbrite engineering blog that explains how you could greatly reduce the number of file comparisons by using Multi Index Locality Sensitive Hashing . In short, you create a special kind of hash value whereby similar documents will have hash values that are close by. You can then compare potentially similar documents byte for byte because the number of documents you have to compare to is a much smaller set.
String similarity
TODO:
* [ ] Jaro-Winker Similarity (watch from youtube)
* [ ] Ratcliff-Obershelp Similarity
Types:
* Edit distance based
    * Hamming distance: computed by overlaying one string over another and finding the places where the strings vary.
        * Con: arrow vs arow's normalization score is 0.4
    * Levenshtein distance: computed by finding the number of edits which will transform one string to another.
    * Jaro Winkler: computed by giving high scores to two strings, if they contain same characters, but within a certain distance from one another, AND the order of the matching characters is same.
        * Pro: provides a better outcome for arrow vs arow. Normalization score is 0.96
* Token based
    * Jaccard index: the formulae is to find the number of common tokens and divide it by the total number of unique tokens. numerator/denomintor is intersection (common tokens)/ union (unique tokens)
    * Sorensen-Dice: the logic is to find the common tokens, multiple by 2, and divide it by the total number of tokens present by comining both sets.
* Sequence based
    * Ratcliff-Obershelp similarity
https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227
	Poly File Layer

	Idea: A single file content to represent a set of identical or similar files.
	* Files can be templated. Files generated by the template are identical in the PFL (Poly File Layer)
	* Files that have common lines, can be linked and marked as a Poly File.

	Prior Art
	* [ ] Read the paper
	* Extracting a Unified Directory Tree to Compare Similar Software Products: https://sel.ist.osaka-u.ac.jp/lab-db/betuzuri/archive/1012/1012.pdf
	* Fast Search in Hamming Space with Multi-Index Hashing: http://www.cs.toronto.edu/~norouzi/research/papers/multi_index_hashing.pdf
	Finding Similar Items:
	* http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf
	* http://www.mmds.org
	* RWS-Diff: Flexible and Efficient Change Detection in Hierarchical Data:
	https://db.in.tum.de/~finis/papers/RWS-Diff.pdf
	Notes
	Locality Sensitivity Hasing: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
	https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
	https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4
	https://security.stackexchange.com/questions/44743/is-there-a-hash-algorithm-that-can-identify-similar-files-or-strings
	https://softwareengineering.stackexchange.com/a/266741
	Data deduplication is also often called "record linkage", so you may want to also use that as a search term when researching this.
	There is an article on the Eventbrite engineering blog that explains how you could greatly reduce the number of file comparisons by using Multi Index Locality Sensitive Hashing . In short, you create a special kind of hash value whereby similar documents will have hash values that are close by. You can then compare potentially similar documents byte for byte because the number of documents you have to compare to is a much smaller set.
	String similarity
	TODO:
	* [ ] Jaro-Winker Similarity (watch from youtube)
	* [ ] Ratcliff-Obershelp Similarity
	Types:
	* Edit distance based
	* Hamming distance: computed by overlaying one string over another and finding the places where the strings vary.
	* Con: arrow vs arow's normalization score is 0.4
	* Levenshtein distance: computed by finding the number of edits which will transform one string to another.
	* Jaro Winkler: computed by giving high scores to two strings, if they contain same characters, but within a certain distance from one another, AND the order of the matching characters is same.
	* Pro: provides a better outcome for arrow vs arow. Normalization score is 0.96
	* Token based
	* Jaccard index: the formulae is to find the number of common tokens and divide it by the total number of unique tokens. numerator/denomintor is intersection (common tokens)/ union (unique tokens)
	* Sorensen-Dice: the logic is to find the common tokens, multiple by 2, and divide it by the total number of tokens present by comining both sets.
	* Sequence based
	* Ratcliff-Obershelp similarity
	https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227