Skip to content

Instantly share code, notes, and snippets.

@MajorGressingham
Created November 28, 2013 13:19
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save MajorGressingham/7691723 to your computer and use it in GitHub Desktop.
Save MajorGressingham/7691723 to your computer and use it in GitHub Desktop.
JellyFish Notebook
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"JellyFish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Author: Rory Creedon\n",
"\n",
"Date: 28 November 2013\n",
"\n",
"Purpose: To examine the possibility of using the python JellyFish library for complex \"fuzzy\" string matching."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**CONTENTS**:\n",
"\n",
"+ **Introduction** - Outlining the Problem\n",
"+ **JellyFish and It's Algorithms** - Brief introduction to JellyFish\n",
"+ **Levenshtein Distance** - Description, Calculation, and Examples\n",
"+ **Damerau Levenshtein Distance** - Description, Calculation, and Examples\n",
"+ **Jaro Distance** - Description, Calculation, and Examples\n",
"+ **Jaro-Winkler Distance** - Description, Calculation, and Examples\n",
"+ **Match Rating Approach Comparison** - Brief Description\n",
"+ **Hamming Distance** - Brief Description\n",
"+ **Experimenting with the Measures**\n",
"+ **Conclusion**"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import datetime\n",
"import os\n",
"from pandas import DataFrame\n",
"from numpy import nan as NA\n",
"from IPython.core.display import HTML\n",
"from IPython.core.display import Image\n",
"from IPython.display import Math\n",
"from IPython.display import Latex\n",
"import collections\n",
"import jellyfish as jf\n",
"import re"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With regard to style names in production data we face two distinct problems:\n",
"\n",
"1. Style names not being clean within factory. <br />\n",
"</p> </p> The fact of the matter is that often the string values of the style names are not clean meaning that styles that in reality are the same are not recognised as such. This is clearly problematic as we want to be able to ;identify those styles that are produced on multiple lines in order to sensibly create the running days values, &nbsp;&nbsp;&nbsp;to do across line analysis of the same styles etc. etc. \n",
"\n",
"2. Matching styles across factory. <br />\n",
"</p> </p> Eventually we will want to be able to match styles across factories in order to identify those styles that are produced in multiple factories. The reasons for this are pretty obvious. \n",
"\n",
"At this point, the notebook will mostly relate to the former problem, as it is most salient at this stage of the data cleaning. \n",
"\n",
"<h />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**DISCLAIMER** I am not an expert in this type of computer science. Much of this notebook is based upon Wikipedia entries. I suggest you do your own research, and use this notebook as a jumping off point."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"JellyFish and Its Algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jellyfish is a python library for doing approximate and phonetic matching of strings. It allows for string comparison using the following algorithms:\n",
"\n",
"+ Levenshtein Distance\n",
"+ Damerau-Levenshtein Distance\n",
"+ Jaro Distance\n",
"+ Jaro-Winkler Distance\n",
"+ Match Rating Approach Comparison\n",
"+ Hamming Distance\n",
"\n",
"All of the above are metrics for measuring the difference between two sequences. Each of these will not be briefly introduced and an example given:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Levenshtein Distance**<br />\n",
"The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to transform one string into another. The lower the score, the lower the number of edits. If the score is 0, then no edits are needed so the strings are exactly the same. \n",
"\n",
"For example, the Levenshtein distance between \"kitten\" and \"sitting\" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:\n",
"\n",
"+ Edit 1: kitten \u2192 sitten (substitution of \"s\" for \"k\")\n",
"+ Edit 2: sitten \u2192 sittin (substitution of \"i\" for \"e\")\n",
"+ Edit 3: sittin \u2192 sitting (insertion of \"g\" at the end)\n",
"\n",
"The measure is most useful when looking to match short strings with longer strings. It takes a lot of computational power to do this with long strings and may not therefore be totally appropriate. \n",
"\n",
"For more information about the measure see\n",
"\n",
"http://en.wikipedia.org/wiki/Levenshtein_distance\n",
"\n",
"Now some examples follow:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"A = 'Rory'\n",
"B = 'Ronnie'\n",
"\n",
"jf.levenshtein_distance(A, B)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"4"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the following demonstrates, the algorithm also considers case, which is a strong argument for converting all strings to the same case. This is a pretty standard way of working with strings so will come as no surprise."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"A = 'Rory'\n",
"B = 'rory'\n",
"jf.levenshtein_distance(A, B)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"1"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Now the levenshtein score of 0 means the strings are an exact match. \n",
"jf.levenshtein_distance(A.lower(), B.lower())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"0"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Damerau Levenshtein Distance**<br />\n",
"This measure is very similar to the Levenshtein distance, in that it is a measure of the minimum number of edits needed to transform one string into another. The permissible 'edits' in the Levenshtein distance are insertions, deletion and substitution whereas in the Damerau Levenshtein distance the transposition of two adjacent characters is also allowed. Damerau claimed that these four edits correspond to 80% of human spelling errors. \n",
"\n",
"As with the Levenshtein distance a score of zero indicates and exact match etc.\n",
"\n",
"This measure may be an improvement on the Levenshtein distance as using the Damerau Levenshtein Distance strings where two letters are simply the wrong way around will have a lower score (indicating a better match) than they would under the Levenshtein measure. \n",
"\n",
"A simple example will suffice:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.levenshtein_distance('jellyfihs', 'jellyfish')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"2"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.damerau_levenshtein_distance('jellyfihs', 'jellyfish')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"1"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, the **Levenshtein** distance works like this:\n",
"\n",
"+ Edit 1: jellyfihs \u2192 jellyfiss (substitution of \"s\" for \"h\")\n",
"+ Edit 2: jellyfiss \u2192 jellyfish (substitution of \"h\" for \"s\")\n",
"\n",
"This is only one way it could happen. For instance 'h' could have been deleted, and then inserted (and so on), but the minimum number of edits is always 2\n",
"\n",
"The **Damerau Levenshtein** distance works like this:\n",
"\n",
"+ Edit 1: jellyfihs \u2192 jellyfish (transposition of 's' and 'h' adjacent characters)\n",
"\n",
"The measure may therefore be a better one in that it recognises strings as being closer than the Levenshtein measure in cases where there has been a simple mix up of characters. \n",
"\n",
"\n",
"**NB** I have observed some odd behaviour of the jf.damerau_levenshtein_distance algorithm. The odd behaviour may be related to me misunderstanding the nature of the measure, or it may be a problem with coding of the library. I have written to the developer for clarification. If you are interested, then see the following, if not then skip to the next measure below.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"2"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.damerau_levenshtein_distance('ifhs', 'fish')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"3"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I find the above output very odd for the reason that the scores should be the same as the edits required in both instances are exactly the same:\n",
"\n",
"In the first example:\n",
"\n",
"+ Edit 1: jellyifhs \u2192 jellyfihs (transpose adjacent characters 'i' and 'f')\n",
"+ Edit 2: jellyfihs \u2192 jellyfish (transpose adjacent characters 'h' and 's')\n",
"\n",
"In the second example:\n",
"\n",
"+ Edit 1: ifhs \u2192 fihs (transpose adjacent characters 'i' and 'f')\n",
"+ Edit 2: fihs \u2192 fish (transpose adjacent characters 'h' and 's')\n",
"\n",
"Why the outputs are different remains to be determined.\n",
"\n",
"**Update** It appears from looking at the source code that in some cases the function is returning the OSA measure not the Damerau Levenshtein measure. The developer is now aware, and has been contacted. use this measure with caution. \n",
"\n",
"More information on the measure can be found at http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Jaro Distance**<br />\n",
"The Jaro distance is a measure that considers the number of matching characters in both strings being compared, and also the number of transpositions which is defined as the number of matching characters (in a different sequence order) divided by two. The measure returns a score between 0 and 1, 0 being no match whatsoever (as defined in the calculation) and 1 being a perfect match. \n",
"\n",
"Beware that this measure will ignore matching characters that are more than a certain distance from each other. This could either be a good thing (to ignore spurious matches) or a bad thing (ignoring correct matches). In any event, it is important to be aware of this, so it is explained in detail below.\n",
"\n",
"It is calculated as follows:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"d_j = \\left\\{\n",
" \\begin{array}{1 1}\n",
" 0 & \\quad \\text{if $m$ = 0 <}\\\\\n",
" \\\\\n",
" \\frac{1}{3} \\bigg(\\frac{m}{|s_1|} + \\frac{m}{|s_2|} + \\frac{m - t}{m} & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"d_j = \\left\\{\n",
" \\begin{array}{1 1}\n",
" 0 & \\quad \\text{if $m$ = 0 <}\\\\\n",
" \\\\\n",
" \\frac{1}{3} \\bigg(\\frac{m}{|s_1|} + \\frac{m}{|s_2|} + \\frac{m - t}{m} & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"<IPython.core.display.Latex at 0x3cff9b0>"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **$m$** is the number of matching characters\n",
"+ **$t$** is half the number of transpositions.\n",
"+ **$|s_1|$** is the length of the first string to be matched\n",
"+ **$|s_2|$** is the length of the second string to be matched\n",
"\n",
"If the number of matching characters is 0, then the measure equals 0\n",
"\n",
"A character in each string is only considered to be matching if it is the same and obeys the following rule as to distance:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"\\frac{max(|s_1|,|s_2|)}{2} -1\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"\\frac{max(|s_1|,|s_2|)}{2} -1\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"<IPython.core.display.Latex at 0x3cff9d0>"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe the following:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"S1 = 'AzzzzB'\n",
"S2 = 'BxxA'\n",
"jf.jaro_winkler(S1, S2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"0.0"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the characters A and B appear in both strings, m = 0 because they are farther in distance from each other than:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"\\frac{max(|6|,|4|)}{2} -1 = 2\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"\\frac{max(|6|,|4|)}{2} -1 = 2\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"<IPython.core.display.Latex at 0x6713d30>"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transpositions are also important as already mentioned. The number of transpositions is calculated as \"The number of matching (but different sequence order) characters divided by 2\". Observe the following:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"S3 = 'Poverty'\n",
"S4 = 'Poervty'\n",
"jf.jaro_distance(S3, S4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"0.9523809523809524"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sadly it looks like this is also being calculated incorrectly. To my mind the calculation should be as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"\\Bigg(\\frac{7}{7} + \\frac{7}{7} + \\frac{7- \\frac{3}{2}}{7}\\Bigg) = 0.9285714\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"\\Bigg(\\frac{7}{7} + \\frac{7}{7} + \\frac{7- \\frac{3}{2}}{7}\\Bigg) = 0.9285714\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"<IPython.core.display.Latex at 0x6713e10>"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whereas is appears that it is being calcualted as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"\\Bigg(\\frac{7}{7} + \\frac{7}{7} + \\frac{7- \\frac{2}{2}}{7}\\Bigg) = 0.9523809\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"\\Bigg(\\frac{7}{7} + \\frac{7}{7} + \\frac{7- \\frac{2}{2}}{7}\\Bigg) = 0.9523809\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"<IPython.core.display.Latex at 0x6713d10>"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The critical difference is that I calculate the number of matching (but different sequence order) characters as 3, those being:\n",
"\n",
"+ e\n",
"+ r\n",
"+ v\n",
"\n",
"Whereas it appears that JellyFish thinks there are only two.\n",
"\n",
"Again, I have raised this issue with the developer. It may resolved in future, but for now use this measure with caution. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Jaro-Winkler Distance**\n",
"The Jaro-Winkler Distance measure builds upon the Jaro measure, but uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length. This will become clear when the calculation is viewed:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"d_w = d_j + (\\ell p(1 - d_j))\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"d_w = d_j + (\\ell p(1 - d_j))\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"<IPython.core.display.Latex at 0x6713fd0>"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The expression is made up of the following:\n",
"\n",
"+ **$d_w$** is the Jarrow-Winkler distance\n",
"+ **$d_j$** is the Jaro distance for strings s1 and s2\n",
"+ **$\\ell$** is the length of the common prefix up to a maximum of 4 characters\n",
"+ **$p$** is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. It should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is 0.1\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The key takeaway is that strings that begin with the same prefixes score more highly...\n",
"\n",
"Observe the following:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"S5 = 'Innovaiosn'\n",
"S6 = 'Innovations'\n",
"jf.jaro_distance(S5, S6)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"0.9363636363636364"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_winkler(S5, S6)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"0.9618181818181818"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although it is not stated anywhere in the jellyfish documentation, it is clear that the value of $p$ is 0.1, this is because rearranging the following to solve for $\\ell$:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"d_w = d_j + (\\ell p(1 - d_j))\\\\\n",
"\\\\\n",
"\\\\\n",
"\\ell = \\frac{d_w - d_j}{1 - d_j} * \\frac{1}{4}\\\\\n",
"\\\\\n",
"\\\\\n",
"0.1 = \\frac{0.96182-0.936364}{1-0.936364} * \\frac{1}{4}\n",
"\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"d_w = d_j + (\\ell p(1 - d_j))\\\\\n",
"\\\\\n",
"\\\\\n",
"\\ell = \\frac{d_w - d_j}{1 - d_j} * \\frac{1}{4}\\\\\n",
"\\\\\n",
"\\\\\n",
"0.1 = \\frac{0.96182-0.936364}{1-0.936364} * \\frac{1}{4}\n",
"\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"<IPython.core.display.Latex at 0x6713f50>"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In some implementations of Jaro-Winkler, the prefix bonus is only added when the compared strings have a Jaro distance above a set \"boost threshold\". In the work of the developer himself, this threshold was set at 0.7. In other words, if the Jaro measure is less than 0.7, then even if the prefixes of the string are the same up to four characters, the prefix bonus will not be applied. Then the calculation looks like this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"d_w = \\left\\{\n",
" \\begin{array}{1 1}\n",
" d_j & \\quad \\text{if $d_j$ < $b_t$}\\\\\n",
" d_j + (\\ell p(1 - d_j)) & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"d_w = \\left\\{\n",
" \\begin{array}{1 1}\n",
" d_j & \\quad \\text{if $d_j$ < $b_t$}\\\\\n",
" d_j + (\\ell p(1 - d_j)) & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"<IPython.core.display.Latex at 0x6727030>"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, as there is no documentation to JellyFish it is hard to know if this is being applied, and if so, at what level the $b_t$ value is set to trigger the prefix bonus. However, a bit of easy experimentation can demonstrate, firstly I compare strings that have a Jaro score just below 0.7, and then strings that have a score just above 0.7"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"S7 = 'ABCDqwerty'\n",
"S8 = 'ABCDpoiuyt'\n",
"jf.jaro_distance(S7, S8)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"0.6777777777777777"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_winkler(S7, S8)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
"0.6777777777777777"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"S9 = 'ABCDqwerty'\n",
"S10 = 'ABCDpoiuty'\n",
"jf.jaro_distance(S9, S10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
"0.7333333333333334"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_winkler(S9, S10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"0.8400000000000001"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above output indicates that indeed, there implementation of the Jaro-Winkler measure in JellyFish uses a threshold of 0.7 for $b_t$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Match Rating Approach Comparison**\n",
"This technique is based upon a phonetic algorithm. As such I don\u2019t see much use for this function in the work I am doing, and will therefore neglect to explore it further. I may return to this at a later date. For information about the measure see:\n",
"\n",
"http://en.wikipedia.org/wiki/Match_rating_approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###**Hamming Distance**\n",
"\"In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.\"\n",
"\n",
"In actual fact this measure has the glory of being the only one of these that is technically a 'metric' as unlike the others presented here it satisfies the rules of triangle inequality.\n",
"\n",
"However, it is also the most useless for our purpose as it can only operate on strings of the same length, and additionally it effectively only permits edits that are substitutions. Therefore the Levenshtein distance is already a much subtler measure, and the remaining measures looked at even more so. \n",
"\n",
"Therefore, this measure will not be explored."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Experimenting with the Algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to experiment with the algorithms I am going to read in some style data from a factory and play with it, to see how/if they might be useful to us."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This reads the data from a CSV file and stores it in a DataFrame object which has been named \"DF\"\n",
"Styles = pd.Series(pd.read_csv(r'C:\\Users\\rcreedon\\Dropbox\\GIZSupervisor\\DATA\\Production_Data\\STP_Data\\Data_Sets\\Wave1\\1005\\1005_all_merged.csv')['style'].unique())"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Styles"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"0 Denim Jacket\n",
"1 Denim Vest\n",
"2 Jonas Long\n",
"3 2618 (Longer Denim)\n",
"4 Lilen Trs\n",
"5 Ride\n",
"6 Tye Dye Jean\n",
"7 Jonas Short\n",
"8 Jacket (2002)\n",
"9 Dobby Short\n",
"10 Parker Chino\n",
"11 550 (Pant)\n",
"12 Ride (510)\n",
"13 Chambrey(117)\n",
"14 Space Treggins\n",
"15 Jhonny Boxer\n",
"16 New skinny\n",
"17 Clint\n",
"18 Todd Jkt\n",
"19 Jacket (2001)\n",
"20 Back to School\n",
"21 Reidun Parka\n",
"22 Mixmatch\n",
"23 Reidun Parka(Lining)\n",
"24 Parko Chino\n",
"25 Camden Jkt\n",
"26 NaN\n",
"27 CB32405510\n",
"28 Bomber Jkt\n",
"29 5 Pkt Cord\n",
"30 Shairt-615\n",
"31 Shairt-537\n",
"32 Shairt-564\n",
"33 Love Trs.(340)\n",
"34 1ZB-153T(Pant)\n",
"35 1ZB-153T(JKT)\n",
"36 Jacket(S.wear)\n",
"37 Jacket(Bk to School)Boys\n",
"38 Jacket(Bk to School)Girls\n",
"39 Jacket-253KE\n",
"40 14011876\n",
"dtype: object"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Styles = Styles[pd.notnull(Styles)]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to work with this data could be to create a dictionary with an entry for each string, and then iterate through the remaining strings, and enter those in a list item if they meet some threshold of each returned distance score.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing that should always be done is to strip away all white spaces and non-alphanumeric characters and convert the case to lower (or upper). This can be achieved in the following manner:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())\n",
"for row in Styles.index:\n",
" Styles[row] = Styles[row].translate(None, delchars).lower()\n",
"Styles"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"0 denimjacket\n",
"1 denimvest\n",
"2 jonaslong\n",
"3 2618longerdenim\n",
"4 lilentrs\n",
"5 ride\n",
"6 tyedyejean\n",
"7 jonasshort\n",
"8 jacket2002\n",
"9 dobbyshort\n",
"10 parkerchino\n",
"11 550pant\n",
"12 ride510\n",
"13 chambrey117\n",
"14 spacetreggins\n",
"15 jhonnyboxer\n",
"16 newskinny\n",
"17 clint\n",
"18 toddjkt\n",
"19 jacket2001\n",
"20 backtoschool\n",
"21 reidunparka\n",
"22 mixmatch\n",
"23 reidunparkalining\n",
"24 parkochino\n",
"25 camdenjkt\n",
"27 cb32405510\n",
"28 bomberjkt\n",
"29 5pktcord\n",
"30 shairt615\n",
"31 shairt537\n",
"32 shairt564\n",
"33 lovetrs340\n",
"34 1zb153tpant\n",
"35 1zb153tjkt\n",
"36 jacketswear\n",
"37 jacketbktoschoolboys\n",
"38 jacketbktoschoolgirls\n",
"39 jacket253ke\n",
"40 14011876\n",
"dtype: object"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####Levenshtein Distance\n",
"Firstly I turn to the Levenshtein distance"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"LMatches1 = {_style:[] for _style in Styles}\n",
"for _style in LMatches1.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < 5:\n",
" LMatches1[_style].append(Styles[row])\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"LMatches1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': [],\n",
" 'bomberjkt': ['camdenjkt'],\n",
" 'camdenjkt': ['bomberjkt'],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': ['ride'],\n",
" 'denimjacket': [],\n",
" 'denimvest': [],\n",
" 'dobbyshort': ['jonasshort'],\n",
" 'jacket2001': ['jacket2002', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001'],\n",
" 'jacketbktoschoolboys': ['jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],\n",
" 'jacketswear': [],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong', 'dobbyshort'],\n",
" 'lilentrs': [],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': [],\n",
" 'reidunparkalining': [],\n",
" 'ride': ['ride510', 'clint'],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above has matched styles based upon whether the levenshtein score is less than 5. What does this tell me?\n",
"\n",
"Well firstly, it seems like to make the score useful, it should be normalized by some measure of the length of the string. For example, it seems preety clear that for a style with a name like 'ride', it will always be possible to match the string to any other string of the same length. \n",
"\n",
"This is why it is stated in the Wikipedia entry that it is best for comparing strings of similar length. One strategy could be to match those strings that have fewer edits than some proportion of the average length of the two strings being matched.\n",
"\n",
"For example if the condition was that only for scores of less than 0.35 times the average length of the strings would be considered matched, then this would mean that there would only be a match if fewer than 3.5 edits were needed on two strings of average length 10.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"LMatches2 = {_style:[] for _style in Styles}\n",
"for _style in LMatches2.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < \\\n",
" ((len(_style) + len(Styles[row])/2)*0.35):\n",
" LMatches2[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 31
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"LMatches2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 32,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': [],\n",
" 'bomberjkt': ['camdenjkt'],\n",
" 'camdenjkt': ['bomberjkt'],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': [],\n",
" 'denimjacket': ['denimvest'],\n",
" 'denimvest': [],\n",
" 'dobbyshort': ['jonasshort'],\n",
" 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],\n",
" 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],\n",
" 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong', 'dobbyshort'],\n",
" 'lilentrs': [],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': ['reidunparkalining'],\n",
" 'reidunparkalining': ['reidunparka'],\n",
" 'ride': [],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm not sure I am approaching this in the most scientific way, I am simply playing with the numbers. If the multiplier is too large, then too many matches are made, too small and some are missed. For example, I would hope that 'Ride' and 'Ride510' would be matched, but in order to get there a multiplier of 0.5 is needed, and that creates too many matches. \n",
"\n",
"I will return here later. Let's now look to see how the second measure holds up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####Damerau Levenshtein Distance Distance\n",
"Now I turn to the Damerau Levenshtein distance"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DLMatches1 = {_style:[] for _style in Styles}\n",
"for _style in DLMatches1.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < 5:\n",
" DLMatches1[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 33
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DLMatches1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 34,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': [],\n",
" 'bomberjkt': ['camdenjkt'],\n",
" 'camdenjkt': ['bomberjkt'],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': ['ride'],\n",
" 'denimjacket': [],\n",
" 'denimvest': [],\n",
" 'dobbyshort': ['jonasshort'],\n",
" 'jacket2001': ['jacket2002', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001'],\n",
" 'jacketbktoschoolboys': ['jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],\n",
" 'jacketswear': [],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong', 'dobbyshort'],\n",
" 'lilentrs': [],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': [],\n",
" 'reidunparkalining': [],\n",
" 'ride': ['ride510', 'clint'],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, the same problems with the short strings seems to have happened. But what, if anything is the difference between the output when using the two measures?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in DLMatches1:\n",
" if LMatches1[key] != DLMatches1[key]:\n",
" print 'This was the Leven result ' + key + \" \" + str(LMatches1[key])\n",
" print 'This was the D - Leven result ' + key + \" \" + str(DLMatches1[key])\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In fact they were identical.\n",
"\n",
"Now looking at the more nuanced measure matching based upon a proportion of the average string length:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DLMatches2 = {_style:[] for _style in Styles}\n",
"for _style in DLMatches2.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < \\\n",
" ((len(_style) + len(Styles[row])/2)*0.35):\n",
" DLMatches2[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DLMatches2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 37,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': [],\n",
" 'bomberjkt': ['camdenjkt'],\n",
" 'camdenjkt': ['bomberjkt'],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': [],\n",
" 'denimjacket': ['denimvest'],\n",
" 'denimvest': [],\n",
" 'dobbyshort': ['jonasshort'],\n",
" 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],\n",
" 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],\n",
" 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong', 'dobbyshort'],\n",
" 'lilentrs': [],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': ['reidunparkalining'],\n",
" 'reidunparkalining': ['reidunparka'],\n",
" 'ride': [],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, this is just trial and error, but let's see if the matches are different when the same multiplier is used (0.35):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in DLMatches2:\n",
" if LMatches2[key] != DLMatches2[key]:\n",
" print 'This was the Leven result ' + key + \" \" + str(LMatches2[key])\n",
" print 'This was the D - Leven result ' + key + \" \" + str(DLMatches2[key])\n",
" print"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, no difference. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####Jaro Distance\n",
"Now I turn to the Jaro distance.\n",
"\n",
"For the Jaro distance again some arbitrary score has be chosen as the cutoff"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JMatches1 = {_style:[] for _style in Styles}\n",
"for _style in JMatches1.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.jaro_distance(_style, Styles[row]) > 0.75:\n",
" JMatches1[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JMatches1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 40,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],\n",
" 'bomberjkt': [],\n",
" 'camdenjkt': [],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': ['lilentrs'],\n",
" 'denimjacket': ['denimvest'],\n",
" 'denimvest': ['denimjacket'],\n",
" 'dobbyshort': [],\n",
" 'jacket2001': ['jacket2002', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],\n",
" 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['backtoschool', 'jacketbktoschoolboys'],\n",
" 'jacketswear': ['jacket253ke'],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong'],\n",
" 'lilentrs': ['clint'],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': ['reidunparkalining'],\n",
" 'reidunparkalining': ['reidunparka'],\n",
" 'ride': ['ride510'],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, this is where my non-scientific approach breaks down. I have arbitrarily chosen a figure of greater than 0.75. However, it seems to me like the results are a lot closer to what I would hope. In fact they match very closely the questions I actually sent to this factory about their style names, that I did simply by eyeballing the data. \n",
"\n",
"let's compare:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in JMatches1:\n",
" if DLMatches2[key] != JMatches1[key]:\n",
" print 'This was the D- Leven result ' + key + \" \" + str(DLMatches2[key])\n",
" print 'This was the Jaro result ' + key + \" \" + str(JMatches1[key])\n",
" print"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This was the D- Leven result lilentrs []\n",
"This was the Jaro result lilentrs ['clint']\n",
"\n",
"This was the D- Leven result clint []\n",
"This was the Jaro result clint ['lilentrs']\n",
"\n",
"This was the D- Leven result denimvest []\n",
"This was the Jaro result denimvest ['denimjacket']\n",
"\n",
"This was the D- Leven result jacketswear ['jacket2002', 'jacket2001', 'jacket253ke']\n",
"This was the Jaro result jacketswear ['jacket253ke']\n",
"\n",
"This was the D- Leven result camdenjkt ['bomberjkt']\n",
"This was the Jaro result camdenjkt []\n",
"\n",
"This was the D- Leven result backtoschool []\n",
"This was the Jaro result backtoschool ['jacketbktoschoolboys', 'jacketbktoschoolgirls']\n",
"\n",
"This was the D- Leven result jonasshort ['jonaslong', 'dobbyshort']\n",
"This was the Jaro result jonasshort ['jonaslong']\n",
"\n",
"This was the D- Leven result jacketbktoschoolgirls ['jacketbktoschoolboys']\n",
"This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']\n",
"\n",
"This was the D- Leven result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']\n",
"This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']\n",
"\n",
"This was the D- Leven result bomberjkt ['camdenjkt']\n",
"This was the Jaro result bomberjkt []\n",
"\n",
"This was the D- Leven result ride []\n",
"This was the Jaro result ride ['ride510']\n",
"\n",
"This was the D- Leven result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']\n",
"This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']\n",
"\n",
"This was the D- Leven result dobbyshort ['jonasshort']\n",
"This was the Jaro result dobbyshort []\n",
"\n"
]
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems to me as thought the Jaro result is actually much more in line with what I would consider to be actual matches. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####Jaro -Winkler Distance\n",
"Now I turn to the Jaro Winkler distance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JWMatches1 = {_style:[] for _style in Styles}\n",
"for _style in JWMatches1.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.75:\n",
" JWMatches1[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JWMatches1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 43,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],\n",
" 'bomberjkt': [],\n",
" 'camdenjkt': [],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': ['lilentrs'],\n",
" 'denimjacket': ['denimvest'],\n",
" 'denimvest': ['denimjacket'],\n",
" 'dobbyshort': [],\n",
" 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],\n",
" 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['backtoschool',\n",
" 'jacketswear',\n",
" 'jacketbktoschoolboys'],\n",
" 'jacketswear': ['jacket2002',\n",
" 'jacket2001',\n",
" 'jacketbktoschoolgirls',\n",
" 'jacket253ke'],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong'],\n",
" 'lilentrs': ['clint'],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': ['reidunparkalining'],\n",
" 'reidunparkalining': ['reidunparka'],\n",
" 'ride': ['ride510'],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems to me that as the J-W score gives a boost for those matches with a similar prefix, that, the threshold score for matching could be pushed a little higher, to eliminate more spurious matches. But first let's see how it compares with the Jaro measure:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in JWMatches1:\n",
" if JMatches1[key] != JWMatches1[key]:\n",
" print 'This was the Jaro result ' + key + \" \" + str(JMatches1[key])\n",
" print 'This was the Jaro-Winkler result ' + key + \" \" + str(JWMatches1[key])\n",
" print"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This was the Jaro result jacketswear ['jacket253ke']\n",
"This was the Jaro-Winkler result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke']\n",
"\n",
"This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']\n",
"This was the Jaro-Winkler result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys']\n",
"\n",
"This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']\n",
"This was the Jaro-Winkler result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']\n",
"\n",
"This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']\n",
"This was the Jaro-Winkler result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']\n",
"\n"
]
}
],
"prompt_number": 44
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The prefix boost is associated with J-W measure is pushing styles with the same prefix higher. As the prefixes are likely to be common when working with styles, perhaps this is a disadvantage of that method. \n",
"\n",
"One fix, might be to push the threshold higher. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JWMatches2 = {_style:[] for _style in Styles}\n",
"for _style in JWMatches2.keys():\n",
" for row in Styles.index:\n",
" if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.8:\n",
" JWMatches2[_style].append(Styles[row])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"JWMatches2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 46,
"text": [
"{'14011876': [],\n",
" '1zb153tjkt': ['1zb153tpant'],\n",
" '1zb153tpant': ['1zb153tjkt'],\n",
" '2618longerdenim': [],\n",
" '550pant': [],\n",
" '5pktcord': [],\n",
" 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],\n",
" 'bomberjkt': [],\n",
" 'camdenjkt': [],\n",
" 'cb32405510': [],\n",
" 'chambrey117': [],\n",
" 'clint': [],\n",
" 'denimjacket': ['denimvest'],\n",
" 'denimvest': ['denimjacket'],\n",
" 'dobbyshort': [],\n",
" 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],\n",
" 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],\n",
" 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],\n",
" 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],\n",
" 'jacketbktoschoolgirls': ['backtoschool',\n",
" 'jacketswear',\n",
" 'jacketbktoschoolboys'],\n",
" 'jacketswear': ['jacket2002',\n",
" 'jacket2001',\n",
" 'jacketbktoschoolgirls',\n",
" 'jacket253ke'],\n",
" 'jhonnyboxer': [],\n",
" 'jonaslong': ['jonasshort'],\n",
" 'jonasshort': ['jonaslong'],\n",
" 'lilentrs': [],\n",
" 'lovetrs340': [],\n",
" 'mixmatch': [],\n",
" 'newskinny': [],\n",
" 'parkerchino': ['parkochino'],\n",
" 'parkochino': ['parkerchino'],\n",
" 'reidunparka': ['reidunparkalining'],\n",
" 'reidunparkalining': ['reidunparka'],\n",
" 'ride': ['ride510'],\n",
" 'ride510': ['ride'],\n",
" 'shairt537': ['shairt615', 'shairt564'],\n",
" 'shairt564': ['shairt615', 'shairt537'],\n",
" 'shairt615': ['shairt537', 'shairt564'],\n",
" 'spacetreggins': [],\n",
" 'toddjkt': [],\n",
" 'tyedyejean': []}"
]
}
],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in JWMatches2:\n",
" if JMatches1[key] != JWMatches2[key]:\n",
" print 'This was the Jaro result ' + key + \" \" + str(JMatches1[key])\n",
" print 'This was the Jaro-Winkler (0.8) result ' + key + \" \" + str(JWMatches2[key])\n",
" print"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This was the Jaro result lilentrs ['clint']\n",
"This was the Jaro-Winkler (0.8) result lilentrs []\n",
"\n",
"This was the Jaro result clint ['lilentrs']\n",
"This was the Jaro-Winkler (0.8) result clint []\n",
"\n",
"This was the Jaro result jacketswear ['jacket253ke']\n",
"This was the Jaro-Winkler (0.8) result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke']\n",
"\n",
"This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']\n",
"This was the Jaro-Winkler (0.8) result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys']\n",
"\n",
"This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']\n",
"This was the Jaro-Winkler (0.8) result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']\n",
"\n",
"This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']\n",
"This was the Jaro-Winkler (0.8) result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']\n",
"\n"
]
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In fact no, increasing even 0.05 on the J-W cutoff is not enough to get rid of the effect of the prefix, and it has the deleterious effect of removing matches for 1ZB153TJKT and 1ZB153TPant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is too soon to say exactly which of the measures will be most useful, and to what extent they can help. If you are reading this notebook as a non-colleague, then the usefulness of the measures will depend upon what type of strings you are analyzing, and I daresay that the measures I did not consider could be useful.\n",
"\n",
"For my purposes it seems like the Jaro measure will be the most promising, although of course, in reality I will use all four measures and compare their output to get a sense of where my strings might be matching. \n",
"\n",
"It is possible that I may return to this notebook and add further thoughts on experimentation at a later date. \n",
"\n",
"One thing that these algorithms may not help with is where style names are incorporated into much larger strings. To identify those as matches some sort of algorithm based upon the matching of sequences may be needed. More in this to follow. Possibly I will write my own.\n"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment