Skip to content

Instantly share code, notes, and snippets.

@MajorGressingham
Last active December 30, 2015 08:19
Show Gist options
  • Save MajorGressingham/7801647 to your computer and use it in GitHub Desktop.
Save MajorGressingham/7801647 to your computer and use it in GitHub Desktop.
Fuzzy String Strategy Pt.1
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Developing A Fuzzy String Strategy Pt. 1 - Stripping Generic Components"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Image(r\"C:\\Users\\rcreedon\\Dropbox\\Rory Notes\\Notes\\JellyFish\\StringStrategy\\generic-drugs.png\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"png": "/9j/4AAQSkZJRgABAgAAZABkAAD/7AARRHVja3kAAQAEAAAAPAAA/+4ADkFkb2JlAGTAAAAAAf/b\nAIQABgQEBAUEBgUFBgkGBQYJCwgGBggLDAoKCwoKDBAMDAwMDAwQDA4PEA8ODBMTFBQTExwbGxsc\nHx8fHx8fHx8fHwEHBwcNDA0YEBAYGhURFRofHx8fHx8fHx8fHx8fHx8fHx8fHx8fHx8fHx8fHx8f\nHx8fHx8fHx8fHx8fHx8fHx8f/8AAEQgA+gEiAwERAAIRAQMRAf/EAKMAAQACAwEBAQAAAAAAAAAA\nAAACAwEEBQYHCAEBAQEBAQEBAAAAAAAAAAAAAAECAwQFBhAAAQMCBQEFBQUGAwYHAQAAAQARAgME\nITFBEgVRYXGBIgaRsTITB6HBQmIU0VJykiMV8KJD4cIzUyRE8YKyNFQlFggRAQEAAgEEAAUDBAEF\nAAAAAAABEQIDITESBEFRYSIFMhMUcYGRoVKxQiMzFf/aAAwDAQACEQMRAD8A/VKAgICAgICAgICA\ngICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICA\ngICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICA\ngICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICA\ngICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIKbm8tLWmal\nzWhQpjEzqSEQB3lWTLO281624ef5D6j+krIF7w15AttowlP2FhE+1bnFtXl29/inxy4Nz9Y+PjNr\nawq1Ig/FOQi/cBuXSevXl2/K6ztFMfrPS3MeInt1Irh/YYBP2Pqn/wBaf8f9ty3+sHDTkBXs69CJ\nLGbxmB7ipeCt6/ldPjK71p6/9IXRjGnyVMSllGoJwbTEyAC53j2erX3eK/8Ac7dC7tbgPQrU6o60\n5CQ+wrOHo12l7Vao0ICAgICAgICAgICAgICAgICAgICAgICAgICAg5nNepOH4alvvq4hIjyUY+ap\nLuiFrXW1x5vY04591fOOa+qHL3s5U+MiLG3xAkWlVI6vlHwXo14ZO743N+T3v6ftjx1/e17ur8y7\nrTrzxedSRkz9HXWTDwb8t271pynCWQZu1VyzhUZAnI9irOUt+LZg9en7VJC1mGRJBAyCVdal8w4k\ntgWwCYZuVtvd3VCZna1Z0ZguZU5mPuKljppy2dq7/G/UL1ZYTiBefqKEcTC4Am/ZuPm+1YvHrXs4\n/f5Nfjn+r2fDfV3jK22nytCVpUIxqwedN+74guW3Dfg+lxflNb+qYe147luM5Kl82xuadxDU05CR\nHeBiFxutnd9HTl13mdbltqNiAgICAgICAgICAgICAgICAgICAgICAg8X6z+oNvxfzbDjTGtyQDVJ\nZxpHt6y7F24+PPd8/wBv3Zp9uv6v+j5JcX91cVp17mqa9aoXnVkdxJK9OI+Fd7tc3uplWIGBYnJG\nLYqNUFnyGZGfgiZiJ3HIebTp4qysXqjAdS59ytSszmIt3+YjPwUa6JiRzk2GgRi9ETOQxZydFrCT\nbKUJTMcce4KVcJGWJGI1bsVZgJACO5n1UwvktoXFxa1Y16FSVKpEvTnAsYnq4WbHo05MdY9Zw/1Q\n9RWG2ndSjyNKIxjMNU/nH3rntwyvbx/k+TXv90e74T6leneSanWnKxuDhsrhoE9kw8W/iZcduGx9\nPi/I8e/e4r1VOpTqwjUpzE6cg8ZxIII7CFye6WXskiiAgICAgICAgICAgICAgICAgICDxP1F9bf2\nmh/bLCf/ANncRecx/o0zgZYfiOntXXj0z1rwe77XhPGfqv8Ap8clUwcl5S+LqfFenL4Gyr5odiw6\nt2K92VcqgnEy/CcMOxVmxkS8rksAFXKysjcA7mTDLqiyJyIAMTiMD3qLGbanUuJ/KtaU69Yu0KYM\nz9iW4bnHdu0dWn6T58x+ZWoU7WmfxXVaFMjwJdZ846z1N/j0/qlH0xX3GP8AdePjI4ECqT9yef0P\n4uv/AC1/yzX9Kc7SpSr06VO6px+KVpUFUgdwxU841fU3k6df6OQfLOQ2kTDCW5wRjqF07vJnHTCw\niR8pLgEHJMpIk40iw1f7lGrhEEBg2WJVQBO4gMAcXTCupxXqXmuJlusrqdIZ/KJ3QLflKzdJe7tw\n+1vx37a9xw31hp+WnzVoYaG5t8Q/bTOX8y47cHyfW4fys7bz/D3fFc/w/K0xOwuoVusAWmO+JaX2\nLhtrZ3fU4+fTf9Ny6Cy6iAgICAgICAgICAgICAgICDU5fkqPGcXdchWI+XbU5VCDg5Awj4nBXWZu\nGN95rrbfg/Nt1y1zyV3c390TK4uKhnInDVm7gF7cYmH5nm2u21rVlWOUvsRxyx8yRLDVVmpRnE4l\ngTojNlTjLEA4kdOiqVfRpV69WFKlA1a0y1OEcX9nRS9F047tcR35cPwvDxEuamb2/I3R46lJqcen\nzZ/cFz8rt2e28XHxfq+7b5Ne49V8mafyrL5fHW3/ACbaIh7SPMfatTSOW3t7dp9v9HOqQ5GcDcyp\nVqlJ3lXMZSi/8S3mR57xbbTKiJicQQSfix+5XLlNcO56Vt+Uq8nCdhP5VKljcVJERgIau+a58mMd\nXu9TXa75nTDPq274245UmziJxiNtWtHASkDiR1CcetkPc5Ndt+jjCcCThgcHK6SPFt9GRIHPDuLh\nMJJ0yzEl8MXGD4MFF8oyTHAAtHoq1AZCJkBGJ0Dkv2plmxPySJiD5j0RmQpzqW9UVaVSdGoMqsJG\nJfsbFHbW2dZXruF+p/qGwEYXRjyNvgGq+SqB2TiP/UCuW3DK+hw/k+TXv90e+4b6ieneR206lX9H\nclnpV2Ac6CQw9y8+3FY+tw+/x7/SvTRlGURKJEonIjELm9rKAgICAgICAgICAgICAg+f/Wjl6ln6\nZo2lL4r6vGNQHWlAGcv8wiu3DOuXzfyfJjjx86+HCqGjGLs52v2Beh8OREVOhJfMKsrISGAY7nyH\nvUylysE2fEZgEHNlWMrDOJLRwD+UJhJs9F6J5LjrG/qfrHpirHZTrEfBLt7Cs8mtse30ubXTbqs5\nb0rzsrqrdWm3kreoTMV6ExKZfHzRd1Nd58V5vV2ttn3RzDwnOw/4nH16Y1lKLMfFb8o8/wDH3+Tv\n8RaetbCkakJxsbEfF+tqRFIv+QufYFzt1r18XHy6TOfGfU5CHD8oPlWFhLkeWOFW4s4mjbiT5sXw\n70mY3yePJ011zt852c+fp/8ARy28vyFPj5SH/AjGVWoRmMA0ftWvPPZ57680/XcImz9JON17fVC3\nxQpwiO9jJWeTnf2fnt/hbV9N0q1pO64e8N5Rp41aFWGyrEDHTBPPFxWv4021u2lzhxJOI+Vh7l0y\n8kqQ3HNy/VMs2JjbGWIZ8go12iRIyY9jJhJWIymCWxxxOSqbZZjEFyR3SfVCWpSESDj39HSLWASQ\n5juIzGbjtSxZXa4H1lz/AA02tq8jbRb/AKOsN1MDoBnHwIWNuObPVw+7ycXa9H0fgvqjwt9spcgP\n0FxL8Ui9In+LTxXm34bOz7Pr/lNN+m323/T2VOrTqwE6chOEsYyiXB8QuL6UsvZJFEBAQEBAQEBA\nQEBB8l+v85Rs+FiMROtVcfwxH7V34fi+V+U7avj5qguTkMB3dF1w+R2iQmREAOND1ZWM1KNfy7g5\n0YZphipCTkvrmOnctZSa4WwmHcEFs5dnTvUXbC6Mn1LOy044xVtG4rUpNTqSgdDEkF/BK3JXSs/U\nF3aCRMY3Nf8A0q1fdM0zrtiTt+xSyV24+XbX6t7hY1vUXMQhyV3KQIMts5YyI/DHQOs7fbOjrwf+\nXk++r+f5fkrOZ42hQnxdhQlthSpxMN7filIfE6mms7r7XLvL4yeOqPH+pOYuYC1uLM8xbYAUqtIz\nkB+WYDhXbWM8XPyXpZ5RmvZekK5Mv1FfhqsT/VtasDVj2iGo8VM7N3Xh2+etdCMLypxsrPhrf+2c\nTL/3HKX0tk6w7HxbsCkxnN610st08dPt0+d+LVhaejrEf9TdT5GuRlCBFGJ7iYmXtV+6uOn7Gne+\nVa1Kx9O3VUwo8lVtqs5f0xVpD5TnKPlOAVuYzrpxb3Gcf2avJcTf8ZX+XcwB3h6dWJwkOo6LWu0r\nl7PBeO4rRJA0Lah8Ft5okNz44fsTDV2ZluJAAALNtPvRGYOxBiScsUMpNEDy4Dp0CYTyYIEpeU7t\nZDtStzaJEDEPu6ghTC2unw/qfm+FMZcfdSjByZUJvOmcf3Th45rO3HL3d/X9rfjv219G9PfVPjbz\nbQ5WIsrjD+qCTRJPacYrz78FnZ9rg/J67dN+l/09vSq0qtMVKUxOEg8ZRLgjvC4Ppyy9kkUQEBAQ\nEBAQEBB8w+vth8z03x/IAP8AoLsfMPSnVhKJ/wAwiu3D3fM/J6Z0l+VfDBPc5JfB8dV3fEwngTIl\nxiCHyVSVP5jAtEuMpuGIUTCUanwnMHrmgnCZIxbDFz25JVkTFaUWMpBz7T3LUuWbFkaoYybIP2no\nhKv+dFmJ2yGgGCMrIV9khKMtshpq/YrEuXYoer+dpw+XG8nKAA8tRpt/MCs3SO2vsck+Lat+f5G8\nqGHIcvXs7fEn5McS/wCECG3Tql1nwjWvs237trIl/e+OsS/E2W+4P/f3rVapbWMT5IpNPmX2tZ+i\nf3rlXvKchfVBK6uJ3NWeEQXl7IrWJOzjd9t7m9XRtvTPJTpfPvJQ4y0If51yWkR+Wl8RWfP5Ov8A\nGvfb7Z9XR4q3sadV+C4+tzF5T/764GyhCXWMMvaVja/N6uGaz/1zyvzqjlbada9lU9QcxGheDO2o\n05VJRGe38MBhlirrflHPm4+ueTbqoFr6UPkNzfTBbz7abduC19zjJw/Uu/T0adnK/wCMuzeWMP8A\niRnHbUh/EMsEm/XFa5PVnh563McndEAOcI4OMMF0eIiJbzgGJLEF8AhstIhg4Da9T3o5wkMQ2OGK\nNYSiBhucPg/YlhlibOHxB8w8M0i61ECIJzkZBp9QFaTbq6/Aeq+a4IxFhWP6fE/pqnmpyx6HLwZc\nttJe72cHt78d6Xo+oemPqNxPMSjbXA/RXxwFOZeEj+WXb0K82/FY+9635DTk6XpXrVye8QEBAQEB\nAQEHlvqfxH929CcvaAEzFA1oAZ7qXn/3VvjuNnn9rXPHX5Zs7rfCMpkb5xG7VmXpz1fn942t46YZ\nhVyDIRcA55g5eCsqzVMyiHMQ4wzUymEnaWRPQoJ7pOJHGUciRgtRmpxqEfEfbgguEyCDqc2yx1TL\nnYu3gRLM8deqRpZGUTLcJZYABXKLoTEDtdwcQmWNtFoJOOhzL6dEqazDrWPP3HH2/wAqzt6NKuM7\nvYJVW7JSdljxz3enX2LrMTH9fi3fTlj/AH3kqlTka8q84DeKEpeaqf3QSclOS4nR29Tj15N873Kr\nmua5k1zb1aVbj6NuWo2lOJpwiBkSQ27vV0kcvY5OW7Y7SfBsWPqDk76mLW7488xbgZTpSlOIH7tQ\nBwl1k+jfHy8t6bTziudj6NnIVhdXVgxapYmPzJH8sJftTOx48Nues+jo1aHKXnHQtbahT4LggX33\nc2qVdd0gfNInoFJiX51131231x+jj+rXp2fpG0g36z9bcEN86dOZpxf92Ab7UvlWJeDSYzmtOfE8\nNOJlb8zGdbHbb/p6m6R0jERdXys7xx/Y49uuu3X+ladxxt9aTpi+oyt4VA9MSbeR12u48V0m0vZ5\n9+K6Y8ovuqnBigKdna1alxnK6uJ6jSMIYe1ZkuerfLvx+ONZ/doxMXxbHMaLdryyMSBbHT4WwVbh\nBznidSotIwMTIv5SSyVM1gkxY7gJaHIg9VJGpl7v0Z9TLmznCw5uRq2bxhC8OM6e7Abv3o/aFx5O\nGXrH2PT/ACN1xrv2+b6zGUZREokSjIPGQxBB1C8j78rKAgICAgICCFelTrUZ0akd1OpEwnE5GMgx\nCJZno/FfNWlXief5HjZ/FZXVSiJjURkWI8F67XwNtMXDNG9FRhIvLSR9yZcfDDbhIGOLBstUrNq8\nEZ7gRkRoX17FXJh4jAlmGGviqLxIRkBhli5UTATPIgSAbbI49vuWmcLIS24wi8Ti5LuplcJwqbjE\nHJy4OSsrNytjKQLBgRl07lc5SrYVidM0xhnK6E2k4YkYt+1VjK2FUbnk/YrhrK63qzpy3wmYEHCQ\nLMfepSV1oeqfUMIfLjdTMB8EagEyP5gVnx1d9fZ5J0y2KHMV+QenynLXNC2Af5dCLmbaMNsQpdcd\no1pzXb9e1wkOd4+xEhw1jGhPF72v/WrHtD4RTwz3L7U1/RMfW92hEcvzlyTShXv65znmI98j5Yhb\nzNXKacnJfnW9/ZuNsWPLX26t/wDBsv6kj2Sn8I8Fjyt7Os4NNf1XN+UdSjHkLehus6Nt6csphxc1\nzuuqkTqMDP2ALPf6vTPLWdMcc/25U6XpoVZVLu8u+Qrj4qkAKcCTmRuJLLf3PPf2s9bdkRL0hI7a\nlC+oBsasKsJN27TH71fHZz/c4c4xsjynp+px1Ond0q/6iyuG+TUlHZNs8R3Jpv5dGva9accm0vSu\nZjm/gei6PFJgYAs/sQtN3a2mKzhfJiUBrtlIYvm3cqvnQbxjIuPw9qlay+mfSv1XVnM8Feyx2mdg\nSX8sfipuf5h4rzc2nxj7v4z2s/Zf7PpS877IgICAgICDEjgg/LH154s8d9Qa9eMdtLkaNOuNN0gN\nkm/lXo1udXxvZ0xyX6vnYqg4l+g7Ew5roXVWmYuRKP3J5JeOVvW3JwkQzP8AYt5cLxYb1OtvIMSG\nIxLjNMud1WU5wOJxJzBzLdFrDntUzKIOZEtC+XeFYylSIjhjF8ipVqyU4y8ssupSMpxk+cuuYx71\nZMM+WVtOqzsXbNaYtTjsbeA75kFLWlwqggRdh9rpKzhOlPzbpEbtWyWjo2o3DxO7F/as+JdkgQDk\nRE6g4qs9VlORcFhtByP3plZHTufUHL3Vv8kzjRto4ChRanDxjFnPeszWOvJz77TGejs+nLSI4mty\nHGxhdcxF9tKbbofmjE5rG969ez2+npJx269d3AubTnbq7nK5trqrcyLykac5S932ZLtNpI8O/HyW\n9ZbWx/8Anecp0J3Fe2/T0duJqyjCTPltJdZ85V/j8kmbMNS3qRoVoVdsKvyy4jMCUS3UHNavV59L\n43K/leYv+UrfMui4hhSpx8sYjpGKa6zV05efblua0wZRk3X/AAUuHGZjEo7B2duCTYxhAznkAfzY\nKrIjTJ/FiRoVNnSRI+XEYAaO+JUTVKF9cWdend20jG4t5wrU5DMSpnd7sFix103utlneP0dxt7C+\n461vaYaFzShWiOgnESbwdeKzFfsNNvLWX5thRoQEBAQEEZ5IPhf/APTHFGVhxPLwABoVZW1eeuyo\nN0ftifau3Fe7we7p2r4JsbBmWrXiZkCSC+RdlI15ALkh9pVSraVatTIIk4AxByVyxiOhQ5KJIBDN\n/gqzZx34m9C6pS+KTRODAYrWXLwsXU9u1oydji5ZarntnK0kGOpI00WZUSpzlhLEgPEuc3VZSjMP\ntGQB3OGWolidKqIYAfbgluVwsFUGTN8WUtEkZsWipEYNlqM0MLRMSAlkRkrlitiNUMC+ILEBz4qp\nVtOtiAYkyGQ7OqlWRfDZIPNzg8Wb4u1IWr7YThLfCUqc84mJIJxywW2POx1KXLczPbE3dwaRO2Y+\nZLb0ZiVnwjV5+T/lf8qpivcQqVJzwpkuaksSBhr0STCZtappONs2lNj8PTPBaYsUTDHaDh1zVSME\njHE4aHJ1DujKXv8AYoYRNQAkjp7VMOvRUSQQ2L4Noq1Oic6kQcQCehwxUkZRgS7McddA6VX2z6Tc\nn+u9F20SQZ2k6lvIajZMs/gV4uWYr9P+P38uKfR7Fc3uEBAQEBBGeSDwX1b4OXNeiuTsoD+sKfza\nOAPnpneM+5b47ivP7WudL9Or8nyifhOBHxBsl22j4+uyJG3uWcN5RkcX117UixmMtoLhxqFV6JCp\ntiN2n2KJhOlXqRxjJuvitJdY3aPJGIaQAEcCmXK8ToW1/RlFozM+zJbjhvx2NuFw4J2jaADF8SyY\nc/HDILzcF3wBKCYJkCCGb4joqysiRgzkHXqeqJVgfBiyqZWRmJAh/MDimTGV0JyBwAZ8R1WssWL4\nSjhIE73L5/CdEyZXU6kIkjFtO5Ew26NZ/K4ZsJjMdgVlYuroWRNetB57RI4xOXerlfFt06YqiP8A\nSj80vERlhAAYDPqjUiuUIwnGntEa8ixgAdob8yM7RzrkDZMABiCJEdVcszVrylgxIkSGkdA33pak\nnVEVQPLtz00Pist7RggZjMdVMrrEd0GJ1GPTJI1YrIE5bmBAVySYZqVoAYAjDHtI08FCR9K+gt6Z\n23N2pkCIXEK0Yg/8yJBP+QLzc/wfe/FW42j6uuD6wgICAgIMSyQcrk6e6EgQ4IIIQfln6l+j6vBc\n9XrUYEWF3LfSIcsdR3hembZj4nNw3j2x8Pg8TIl8ccC5UwywSwOLjuTJJliMoyD4xHbqphb0JRDZ\n/sViZZi7nHIZ5BLRmM5FgXI6nAqNJ/OEZhvKHzGYVjOG7Q5CpEbfijjjqrNmLxyuhb8hRlBtwBGG\n0j71rLhvx1twqwliDiGbFby43VZ8yTFhke5gplPHKYriPb0C1Kx4LaVaIeWfaO1ZsXCyMzucNjkX\nWnOxdGrHVIXVbGY/E4A1KZWRsU57agLhwzEZEK5ZsdChdy3mTNu0zZMmG/Su6RjEGmZVMWO5tyZX\nCcq1QfLlLExJMwPKAfw96uUsc66IlKWLiWJ0H/ijMtjUnUn0Y6FUmszlWC4O4AnVsAo3t1T3ExwO\nJzA171Caq8Q4ObMAqWrBIbCDho2vtWUtaF9XjCJjE6tH+Lora3xzNj3/ANB7kQ5nlKEiTKpRgegB\nicm8V5+bs+1+O2+6z6PtsS6877CSAgICAgINS7pbolB4j1V6ftORtalvdUhVpS0OYPUHQqy4Z30m\n0xXwD1l9Pbzias69tE1rJyTMZx/jAy78u5dJu+dy+vdfrHjalOYk0htY4ArbzzowYnQOMMElS9Ql\nhg4bMdEZkHDZeAzWa3IRkD3MhYEx3OXDdA7rUZZGBO4M+IDMG6oJ/MIwdn1TJ41s0b2pBmkSBgYl\nXLF0l7ujb8rAkCoW0fRXLjtxfJuUpUpyeMh2exby5bSxeKkThgW/H9xSMJwmx6Nk2qtZsWxqsQ43\nDHDVZWRIXETUDOZOHByA1WozV0KkRKTfCDgRg4TK46L6dzUiSCNo0OpHatM4bdG5mJAu+3yiXemU\n7NiPKVKMjGkYynmZZkA4HApKXq1pXEpyJk7ak9FbWPFgVd4OkNJYZ6LNWWRATZspnUnLv71VykJd\nrvkcvch5ozk+ZxBd8z0VjNqi4uoU4EPuPsKlreumXKq3RqSGzAuzdFi16dOPD3f0ZvBT9ZxGRuba\nrCXV4bZf7q5ck6Pd6Nxyv0DSk4XnfcXOgICAgICCM4uEHMv7IVInBB5Pl+Ffc0XdB8y9TfTiwu98\n6MBb1TiREeQnuGXeFub4ebl9bXbrOlfM+a9KclxcpfNh5NJ6EdhyJ+1bly8PJxba93FwcOwY5Hr2\nquWEZk74kBpAtjl4qxYnLEBgfFREIAROBxHVXK1mRO1ncAYl0TAX66A+1CIyFQSwLjB1cqtjLyGQ\nwbxxUywst7irTk+7a+TK5LJXSo8xVBebeH7FZs5XhjoUuRoVAGPmJ71uVw24rG1Tq7onH/zdFK5p\n72Yjzdn3rUSxkzMZbiQYnNyrbElXwlIhhLcNQMFnKJxrbTiMZDJ1pLFsaxYxGEmY9W0Rlg1pE7SR\n0I6K1YkJUsCRiNSMB3KJYzGoCAxcddEyuEzVjEPuADeYKs+DTueQo0yRIk4aaFS10nG5lS5nXnjj\nHpks2vTpriIuBiAzag/tWctZes+lt2afrSxIiRKRnTc5GMokkj2LO/Z6PVt/dj9J2s3C8z7zbdBl\nAQEBAQEEZwEgg0bqxjUBwQef5Dg4yciKDy3K+nIThOFSmJwkGlGQcEdoRLMvnHqT6Z0aolOy/pzD\nkU5Pt1+E5hb13eXk9WXrq+e8l6c5TjZmFxSlHEiIzJ/hbCXguucvFvpdb1cwQ2y2kFx+9oUZtDKI\nLdrFGcZYqRl8Qw7T07UalYBMWM8xkOqUqW53JDjrkiYBKM9H1GimErOGuQLv29iqQJ0MsXxRrCcK\nsokNl3olbVHkKsTiQYjNyrNnPbjjqW3J0pFpTMCcsMB4rUrz78db8LqJ8rgv8I+9by4+Kw1Qfgxf\nAl1JUsZFWTkYA69i0zhdGUduTNjtzZZMEZFzsAMn9uquSRidzFpE5xxke9RvxUT5ehEM+1g8u99F\nbVnHWlW5aUw0CQH+JlPJucWFBm5Micy58VnLtIlGcgcMHyfH3pksZlVlnIY5Mqni9j9LaZn6vtKj\ng/LhM+yBy9qxydnr9OZ5I/RlhNwF5n2nSf3IJoCAgICAgIMEOgqqUIyGSDnXfGQmDgg4F/wMZO0U\nHmeU9N06kJU6tIVKcvihIAg+BRLMvnnP/TK2qmU7T+lLSEsR4SzC3N3l5PVl7PB8r6b5CxqmFWgY\nsfJEhyQM2l8MlubPFvx7a93Iq05ReJG09C+HetSuaG2UR5cT29EyZCJhjp+7gisjcZeZgdP9qZSs\nFz5ZDxAwRTZ5XdjqmUylgWfXqojJwPYcirFvVKBlAvEsemauU8V1O4rwLRkNoOL5nuKZZukbUL+r\nm7HoAwVy5+C2PKVMYt3lXyLwxYOSk2oLMcVnyY/bVTvassQ7Ee5ay1rxRWbmrKWMz2Iv7YD+LNy2\nKjWE3kSWiOrqs1ZGpIjD2lMMs/MBO2QdmDn3qYbi2JBO4API4MNEMvon0ms5Hla12RhRp7Qe2Z09\ni58l6Pb6GudrX3TjJPELi+q637EFqAgICAgICAgIMGIKCipbRkMkHOuuLhMHBBwr7gYl2ig81ynp\nunUpyp1KUalOWEoSAIPgUSzLwHN/TW3mZTsz8qR/05vKHgfiC3N3m5PV1vbo8Jy3pbkePJNakacI\n/jLmBfpMeV+wsrNni39fbXu4tSlKEg8DGWgkCCe5bjkrMcB25nqqlp5o9+XYVGsjkDTtdMJAywY5\nDs6pgNAAxL4BVMsbhuwzGfejS+ngSG7WOSRmraZkzt3gK1MMkfETl0RJRjHHFuqKkJ7iC2WL9ESs\niQIDY7sygkNxlJj8RGGi0wsiQPLrkFCxIEviD0bRuxKzhOLgg/i1VhG9QoykYgPKUs217AplfF9n\n9BcNLjuOgJxatWPzJg5gaD2LhvtmvterxeGnXu+lcXAiIWHpdZvcgsQEBAQEBAQEBAQEGDEFBRVt\noyGSDn3XFwmDgg4V7wMS/lQcC/8AT+BG1x0QeJ5v6ecfcCXy6f6eTv5B5CR+XL2LU2rz8nra7fR4\nHmfQ3K2W+UaRqUsxKkDIDw+Iexbm0ePk9bbXt1edrW9WmPPHyxyIy8Fp5mtKkJSzx7FprPQixBBz\nGuilZYIPYRqMvtQylGIIYHHVFytpiRA6j/DKsVa0i5j00RQbgBuLBjgiVJniCC7hw6isGJJ8pBwV\niZZjUORw6q4MLNzEvHb11cdVImEogsWxA6dVUX0oy27Y+Z/sTA2aFEyaMgWGgWh7z0V6YnOrC+uo\nNTjjSgRiToSuPJs9vqevm+V7PrHE22IwXF9R62ypbYhBvMgygICAgICAgICAgICAgwYgoKaltGWi\nDRuOMhMHBBxrzg4yfyoODfenzi0UHkua9EWF3KU6tHbWP+rDyy6Y9fFWWxy5OHXbvHhOX+nF3SJl\nb/1oZeVoVPESaJ9oW5u8m/p2duryNzxVza1JUqtKUZnHZIGMmyxjILecvHtpdb1a3yRi4I/KQzKo\nyKeIHie7xVTCyECMGx16d6mShhOLgHGWff1RWGkYk+AJRKyKRY4v+70BVyJiABbLDEDN1Eqcaehc\n4IkThQkR2DRWVavhQO7d1AjtOOHYFpmt60s5SmIRg8j2KD2/p70eDKFxeRaIYxpak9vQLntyfJ7+\nD1M9dn0PjrH4QIsBgAFxfSkes4yy2gFkHeow2gILEBAQEBAQEBAQEBAQEBAQEGDEFBXOhGWYQalf\nj4SGSDk3fCwk/lQcS99PjFooPN8r6Wt7iBhcUY1I5tIe5E21l7vFct9Og0pWk2lmIVnkH7JjzD2F\nbm7y7+pL2eSvvTHIWc2rUZxOYmBvgW13Rfb4stzZ4+Tg21+DT/tlwISkIGcRrHH3LTzdVRtjA/Cx\n17FcL1QlbSbAORiAckXCUbWe1iH3ZD/aolZjbMSXwPlZVVlOy3NIYAn4T2dEyzK3qFhXqECFKRI6\nD7XRrDu8V6RvbjbPbtpk4SJ/x9il3kdeP199/pHteH9L2lptlsFSqMpMwHcFy23tfQ4vV106969P\nZ8dIkYLD0vScdxu1sEHoLegIhBsgMgICAgICAgICAgICAgICAgICAgEAoISpAoNarZwkMkHOueIh\nJ8EHHu+CBfyoOJd8BifKg4N96QsaxepbR3AuJRGyT/xRYq5Y249b3jj3Hoe23mUSWP4ZiMx7SN32\nrU3rht6mt7dGjV9DZiIg3TzR925X9xz/AIf1Qh6FwG8gEH96RHuCnmz/AAr81tL0FS/FUY9QD95V\n/cWel9XQtvRNhAgy3TIzGAH2B/tUvJXSenp8Xas/TlnRIMKI3MznH3rN2rtrw669o7Nvxci2Cjq6\n1pxGWCDt2nGCLYIOrRtxAZINgBkGUBAQEBAQEBAQEBAQEBAQEBAQEBAQGQRMAUFVS2jLMINOtx0J\naINCvw8Dog0K3Bj91Bqz4P8AKgpPBl/hQSjwp6ILqfC45IN2hw4GiDoUOLiGwQdCjZxiMkGzGkAg\nmyAgICAgICAgICAgICAgICAgICAgICAgICAgMEETTBQVyt4nRBWbSB0QQNjDogCyh0QTFpAaILI0\nIjRBYIAIJMgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg//9k=\n",
"prompt_number": 100,
"text": [
"<IPython.core.display.Image at 0x65559b0>"
]
}
],
"prompt_number": 100
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Author: Rory Creedon\n",
"\n",
"Date: 29 November 2013\n",
"\n",
"Purpose: Building upon a previous notebook (JellyFish) which described different method for matching fuzzy strings, this notebook explores the possibility for developing a concrete strategy for dealing with fuzzy strings. The particular focus is on developing a strategy for cleaning style names. \n",
"\n",
"As ever this notebook has been developed for my work, and so the issues may not be totally intelligible to all readers (if there are any). However, I take pains to explain what I am doing, and so a careful reading will hlep those facing similar issues. \n",
"\n",
"Comments/Questions - rcreedon@poverty-action.org\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import datetime\n",
"import os\n",
"from pandas import DataFrame\n",
"from numpy import nan as NA\n",
"from IPython.core.display import HTML\n",
"from IPython.core.display import Image\n",
"from IPython.display import Math\n",
"from IPython.display import Latex\n",
"import collections\n",
"import jellyfish as jf\n",
"import re\n",
"%qtconsole"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 67
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a previous notebook (JellyFish) (link: http://nbviewer.ipython.org/gist/MajorGressingham/7691723 ) I explore the possibility of using four different algorithms for string matching. These were:\n",
"\n",
"+ **Levenshtein Distance**\n",
"+ **Damerau-Levenshtein Distance**\n",
"+ **Jaro Distance**\n",
"+ **Jaro-Winkler Distance**\n",
"\n",
"\n",
"The Damerau-Levenshtein Distance was somewhat effective (and in testing with style strings indistinguishable from the Levenshtein Distance). The problem with that measure is that it is absolute with regard to the length of the strings being matched. Therefore setting up a matching decision rule is difficult. The solution that was reached was to compare the D-L score with the average length of the strings being matched multiplied by some arbitrary value. It was found that setting that arbitrary multiplier at 0.35 (in the data that was being tested) created the most favourable (or least undesirable results). \n",
"\n",
"The Jaro distance measure seemed more promising. In the data it was trialed with a decision rule of accepting scores greater than 0.75 seemed to calibrate the matches at about the right level. In fact it generated results that actually were extremely close to the matches that had been generated manually by sight. \n",
"\n",
"The Jaro-Winkler measure was less desirable for the reason of the prefix booster. As many styles have the same prefix (e.g. 'jacket') then the prefix booster component of the J-W measure meant that spurious matches were being made. In response it was thought that the decision rule value could be raised as compensation, but that was not sucessful, as the spurious matches did not disappear, and good matches did. \n",
"\n",
"In all of this it should be understood that all of these 'lessons learned' are really only appropriate for the data that are being worked with. What works best will totally depend upon what type of string is being dealt with.\n",
"\n",
"Having thought a little about how to best use these measures with particular reference to style names, I believe that one possibly furtive strategy is to compare strings that have 'generic' components removed. \n",
"\n",
"This proposed method is the subject of the remainder of this noteook\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Stripping Out Generic Components"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many of the style strings that we work with have component that describes the item (e.g. Jacket, Short, Pant). Sometimes the component is just a single word, sometimes it is multiple words (e.g. Boys round neck T shirt). These sub-strings I refer to as *Generic Components*. If you are working with names it could be Mr. Mrs. Ms. If you are working with addresses it could be country, county, other types of sub-string. If working with sports teams it could be city names. \n",
"\n",
"The style strings will also typically contain some reference, or code, or buyer name, or other indicator, that is truly the identifying part of the style. These sub-strings I refer to as *Identifying Components*. Unfortunately these identifying components can be located anywhere within a sea of generic components. If they were always the first sub-string, they could be easily accessed. However, in fact with a bit of simple coding and some manual work I believe that the generic components can be stripped away to reveal the Identifying Components. \n",
"\n",
"It seems to me that the algorithms for working with fuzzy strings will be much more successful, if they are deployed only upon the Identifying Components. It is almost certainly obvious to the reader why this should be so, but just to make the point absolutely clear observe the following:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Create two string with generic components as a prefix\n",
"S1 = 'THIS IS A STANDARD PREFIX 1234'\n",
"S2 = 'THIS IS A STANDARD PREFIX 4567'\n",
"\n",
"# Create the same strings but stripping away the prefix\n",
"S3 = '1234'\n",
"S4 = '4567'\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 68
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above is a stylised simulation of the generic components issue. Now how do the measures perform when we submit the full strings (S1, S2) for comparison, versus submitting the strings that have had the generic components stripped away (S3, S4)?\n",
"\n",
"Turning first to the Damerau-Levenshtein distance. Recall that in previous work I set the decision rule as follows:\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Latex(r\"\"\"\\begin{eqnarray}\n",
"M = \\left\\{\n",
" \\begin{array}{1 1}\n",
" \\text{Matched} & \\quad \\text{if } d_l < {\\frac{|s1| + |s2|}{2}} * 0.35\\\\\n",
" \\text{Not Matched} & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}\"\"\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"latex": [
"\\begin{eqnarray}\n",
"M = \\left\\{\n",
" \\begin{array}{1 1}\n",
" \\text{Matched} & \\quad \\text{if } d_l < {\\frac{|s1| + |s2|}{2}} * 0.35\\\\\n",
" \\text{Not Matched} & \\quad \\text{otherwise}\n",
" \\end{array} \\right.\n",
"\\end{eqnarray}"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 69,
"text": [
"<IPython.core.display.Latex at 0x6555d90>"
]
}
],
"prompt_number": 69
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So lets see how the measure performs on the above created strings with reference to the above decision rule:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# With S1 & S2\n",
"jf.damerau_levenshtein_distance(S1, S2) < ((len(S1) + len(S2))/2)*0.35"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 70,
"text": [
"True"
]
}
],
"prompt_number": 70
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# With S3 & S4\n",
"jf.damerau_levenshtein_distance(S3, S4) < ((len(S3) + len(S4))/2)*0.35"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 71,
"text": [
"False"
]
}
],
"prompt_number": 71
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly, the decisions on the match come out differently. The D-L score in both cases is 4, but in the S1, S2 example the decision condition is $4 < 10.5$ whereas in the second example the decision condition is $4 < 1.4$, putely becasue the generic prefix had been stripped away, thus greatly reducing the length of the string."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course the same thing will be true of the Jaro distance. Observe the following:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_distance(S1, S2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 72,
"text": [
"0.9333333333333332"
]
}
],
"prompt_number": 72
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_distance(S3, S4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 73,
"text": [
"0.0"
]
}
],
"prompt_number": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is actually a very extreme case, whereby the first example appears very closely matched, but the second example is not at all matched. **NB** If you are surpised that it is 0.0 given that '4' appears in both 'S3' and 'S4' then I suggest you read the JellyFish notebook that explains the measures. This is the correct behaviour for this measure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the context of the style strings without stripping out the generic components S1 and S2 would match, which I think would lead to a waste of time on behalf of the person cleaning the data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An additional consideration is that the jaro-winkler measure will possibly be more useful with the generic components stripped out. Recall that in the previous notebook, spurious matches were made when two styles both began with jacket. If jacket is successfuly identified as generic and stripped away, then it once again makes sense to apply the prefix boost that is part of that measure. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Identifying Generic Components"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This part is concerned with how to strip out generic components. To make the exercise more real, let's read in some data to look at. The 1006 data was particuarly nasty to deal with."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Read in the data\n",
"DF = pd.read_csv(r'C:\\Users\\rcreedon\\Dropbox\\Rory Notes\\Notes\\JellyFish\\ExampleData.csv')\n",
"DF"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 2385 entries, 0 to 2384\n",
"Data columns (total 3 columns):\n",
"Unnamed: 0 2385 non-null values\n",
"style 2385 non-null values\n",
"smv 2385 non-null values\n",
"dtypes: float64(1), int64(1), object(1)\n",
"</pre>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 74,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 2385 entries, 0 to 2384\n",
"Data columns (total 3 columns):\n",
"Unnamed: 0 2385 non-null values\n",
"style 2385 non-null values\n",
"smv 2385 non-null values\n",
"dtypes: float64(1), int64(1), object(1)"
]
}
],
"prompt_number": 74
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen this is daily data (although dates and all other variables have been removed). There are two columns of data, 'style' and 'smv'. SMV will not be used here.\n",
"\n",
"As the below output shows there are 438 unique styles in the data, and some of those are displayed to the reader. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print len(DF.style.unique())\n",
"# Display part of the data\n",
"DF.style.unique()[:50]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"454\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 75,
"text": [
"array(['Boys s slv r nk tee (TS-334)', 'Clock house tanktop',\n",
" 'Boys basic tanktop (354140)', 'Vito v nk s slv tee (805914)',\n",
" 'PJ Smurf tee (Top)Ta', 'Max tee (826581)',\n",
" 'Nightwear BTM (PJ-2916)', 'Mens l slv tee (933-005)',\n",
" 'Basic shorts (174490)', 'Nightwear top hanger loop (BS-2923)',\n",
" 'Boys s slv mock layer tee (SR-189)', 'Marks shorts (226175)',\n",
" 'Nightwear top with hanger loop (BS-2923)',\n",
" 'Mens s slv polo (313606)',\n",
" 'Nightwear top with hanger loop (BS-2916)',\n",
" 'Boys basic tanktop (904240)', 'Shark tank top (595551)',\n",
" 'Mens s slv tee (931-005)', 'Boys s slv r nk tee (344-001)',\n",
" 'Nightwear BTM (PJ-2921)', 'Basic shorts (776611)',\n",
" 'Mens s slv placket tee (453-001)', 'Nightwear BTM (SH-2913)',\n",
" 'Mens s slv polo (313606)-XL Team', 'Marks shorts (226176)',\n",
" 'Nightwear top with hanger loop (BS-2911)',\n",
" 'Mens s slv tee (UWC-931-005)',\n",
" 'Mens s slv polo with bon pocket (313606)',\n",
" 'Boys s slv tee (TS-903)', 'Ray tank top (177111)',\n",
" 'Basic shorts (776622)', 'Boys slv less tee (DLP-350)-Stripe',\n",
" 'Basic shorts (667570)', 'Vito v nk s slv tee (845720)',\n",
" 'Boys s slv placket tee (TS-912)', 'Marks shorts (226172)',\n",
" 'Boys slv less tee (SG-356)', 'San fransisco tank top (693170)',\n",
" 'Micky mouse s slv tee (455650)',\n",
" 'Boys slv less tee (DLP-350)-Solid', 'Boys slv less tee (320-001)',\n",
" 'VIP tank top (561580)', 'PJ Smurf tee (Top)Tb',\n",
" 'Boys s slv r nk tee (TS-354)', 'Basic shorts (179324)',\n",
" 'Mens s slv polo with pocket (320764)',\n",
" 'I love to party s slv tee (693170)',\n",
" 'Boys s slv r nk tee (TS-921)',\n",
" 'Boys s slv mock layer tee (TS-357)', 'Ray tank top (538810)'], dtype=object)"
]
}
],
"prompt_number": 75
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the style data (above), it is quite clearly a disaster! But the question is how to pull out the generic data. As a first step, let's at least convert all of the characters to uppercase:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DF.style = pd.Series([DF.style[row].lower() for row in DF.index],)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 76
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a second preliminary let's strip out multiple white spaces:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DF.style = pd.Series([\" \".join(DF.style[row].split()) for row in DF.index])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 77
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And remove any leading/trailing whitespace:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DF.style = pd.Series([DF.style[row].strip() for row in DF.index])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 78
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is probably a smart move to remove special characters also, let's see. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DF.style = pd.Series([re.sub('[^A-Za-z0-9 ' ']+', '', DF.style[row]) for row in DF.index])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 79
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DF.style.unique()[:50]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 80,
"text": [
"array(['boys s slv r nk tee ts334', 'clock house tanktop',\n",
" 'boys basic tanktop 354140', 'vito v nk s slv tee 805914',\n",
" 'pj smurf tee topta', 'max tee 826581', 'nightwear btm pj2916',\n",
" 'mens l slv tee 933005', 'basic shorts 174490',\n",
" 'nightwear top hanger loop bs2923',\n",
" 'boys s slv mock layer tee sr189', 'marks shorts 226175',\n",
" 'nightwear top with hanger loop bs2923', 'mens s slv polo 313606',\n",
" 'nightwear top with hanger loop bs2916',\n",
" 'boys basic tanktop 904240', 'shark tank top 595551',\n",
" 'mens s slv tee 931005', 'boys s slv r nk tee 344001',\n",
" 'nightwear btm pj2921', 'basic shorts 776611',\n",
" 'mens s slv placket tee 453001', 'nightwear btm sh2913',\n",
" 'mens s slv polo 313606xl team', 'marks shorts 226176',\n",
" 'nightwear top with hanger loop bs2911', 'mens s slv tee uwc931005',\n",
" 'mens s slv polo with bon pocket 313606', 'boys s slv tee ts903',\n",
" 'ray tank top 177111', 'basic shorts 776622',\n",
" 'boys slv less tee dlp350stripe', 'basic shorts 667570',\n",
" 'vito v nk s slv tee 845720', 'boys s slv placket tee ts912',\n",
" 'marks shorts 226172', 'boys slv less tee sg356',\n",
" 'san fransisco tank top 693170', 'micky mouse s slv tee 455650',\n",
" 'boys slv less tee dlp350solid', 'boys slv less tee 320001',\n",
" 'vip tank top 561580', 'pj smurf tee toptb',\n",
" 'boys s slv r nk tee ts354', 'basic shorts 179324',\n",
" 'mens s slv polo with pocket 320764',\n",
" 'i love to party s slv tee 693170', 'boys s slv r nk tee ts921',\n",
" 'boys s slv mock layer tee ts357', 'ray tank top 538810'], dtype=object)"
]
}
],
"prompt_number": 80
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Already the data look somewhat less daunting. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to think about finding the generic data would be to create a dictionary of every word (sequence) found in each string with the corresponding value being a count of how many instances of that word are found in all strings. Those words with a higher count are more likely to be generic.\n",
"\n",
"Of course, there is no need to implement this on the entire data set, only on the unuqie instances of the style variable:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Create series of unqiue instances of the style variable:\n",
"Styles = pd.Series(DF.style.unique())\n",
"DFUn = DataFrame(Styles, columns = ['style'], index = xrange(len(Styles)))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 81
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"WordDict = {}\n",
"for row in Styles.index:\n",
" for word in Styles[row].split(' '):\n",
" if word not in WordDict:\n",
" WordDict[word] = 1\n",
" else:\n",
" WordDict[word] += 1\n",
"# WordDict "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 82
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Without wishing to get ahead of myself, I think this could be a very useful strategy. The third line of output above reveals that '124417' appears three times. It could be that this is the true style identifier, and (recalling that each style that was iterated through is unique) it could be that it appears in three unique style names which in reality are unique only because of the inconsistent arrangement of generic components."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the dictionary is created, it needs to be decided at what level the 'counter' threshold for each should be set as a decision rules as to whether the word it relates to is generic. Let's experiment with three values, 15, 10, and 8"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key, value in WordDict.iteritems():\n",
" if value >= 15:\n",
" print key, value"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"tee 47\n",
"lsivts 17\n",
"msp 18\n",
"mens 16\n",
"lslvts 56\n",
"sslvts 59\n",
"top 134\n",
"ssivts 72\n",
"ronny 25\n",
"s 31\n",
"polo 36\n",
"tank 35\n",
"slv 34\n",
"bottom 63\n"
]
}
],
"prompt_number": 83
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting the threshold at 15, seems to identify words that are clearly generic. sub-strings like 'tank', 'polo', 'bottom' etc. clearly relate to generic item names, and are probably not unique. \n",
"\n",
"Others like 'lslvts' are also generic. If you look at the original data you will understand that this realtes to 'long sleeve t shirt'\n",
"\n",
"Others like 'ronny' are less obviously generic, as they probably relate to the style name. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key, value in WordDict.iteritems():\n",
" if value >= 10:\n",
" print key, value"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 12\n",
"large 11\n",
"tee 47\n",
"192 11\n",
"lsivts 17\n",
"msp 18\n",
"mens 16\n",
"lslvts 56\n",
"nightwear 10\n",
"cocoon 13\n",
"pj2916 10\n",
"sslvts 59\n",
"top 134\n",
"ssivts 72\n",
"shorts 10\n",
"ronny 25\n",
"s 31\n",
"polo 36\n",
"basic 13\n",
"sr542 10\n",
"tank 35\n",
"slv 34\n",
"long 10\n",
"boys 14\n",
"bennyj 11\n",
"bottom 63\n"
]
}
],
"prompt_number": 84
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The mix of obviosuly generic, and questionably generic is now longer. On the one hand we catch words like 'shorts', but on the other hand its not clear that 'sr541' does not in fact relate to some unqiue part of the style. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key, value in WordDict.iteritems():\n",
" if value >= 8:\n",
" print key, value"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 12\n",
"large 11\n",
"tee 47\n",
"192 11\n",
"lsivts 17\n",
"msp 18\n",
"pj0405 8\n",
"mens 16\n",
"nightware 8\n",
"lslvts 56\n",
"nightwear 10\n",
"l 8\n",
"cocoon 13\n",
"pj2916 10\n",
"pj2912 8\n",
"sslvts 59\n",
"coocon 8\n",
"top 134\n",
"ssivts 72\n",
"shorts 10\n",
"ronny 25\n",
"192194 9\n",
"randy 8\n",
"art 8\n",
"s 31\n",
"polo 36\n",
"basic 13\n",
"pj 9\n",
"sr542 10\n",
"tank 35\n",
"slv 34\n",
"topb 8\n",
"long 10\n",
"705 8\n",
"boys 14\n",
"bennyj 11\n",
"bottom 63\n"
]
}
],
"prompt_number": 85
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this cutoff, again the mix of generic and non-generic is further expanded.\n",
"\n",
"I would suggest that the best method would be to get a good understanding of the data and how shorthand is used in general, then set the threshold low, and choose only those sub-strings that are clearly generic to be stripped out. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before proceeding, it might be worth looking and what the real effect of this stripping could be. I looked at the unqiue style data and randomly picked out two stings:\n",
"\n",
"+ 'uwc932011 898899 lslvts sslvts'\n",
"+ 'uwc932011 898 lslvts'\n",
"\n",
"As identifed in the WordDict, lslvts and sslvts are generic components that could be stripped out\n",
"\n",
"We can't know for sure that these strings in fact identify the same style, but even to look at them it seems clear that they should be identified as potentially matching to merit further investigation. Would the match in there *unstripped* format?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Unstripped of generic components\n",
"S5 = '705 580057 ssivts'\n",
"S6 = 'uwc932011 899 sslvts'\n",
"\n",
"# Stripped of generic components\n",
"S7 = 'uwc932011 898899'\n",
"S8 = 'uwc932011 898'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 86
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.damerau_levenshtein_distance(S5, S6) < ((len(S5) + len(S6))/2)*0.35"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 87,
"text": [
"False"
]
}
],
"prompt_number": 87
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_distance(S5, S6)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 88,
"text": [
"0.6598039215686274"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unstripped they would both fail to be matched given the decision rules mentioned above. Now, will the stripped versions match?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.damerau_levenshtein_distance(S7, S8) < ((len(S7) + len(S8))/2)*0.35"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 89,
"text": [
"True"
]
}
],
"prompt_number": 89
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_distance(S7, S8)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 90,
"text": [
"0.9375"
]
}
],
"prompt_number": 90
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"jf.jaro_winkler(S7, S8)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 91,
"text": [
"0.95625"
]
}
],
"prompt_number": 91
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stripped of the generic componentents both would match, and the jaro_distance match is very strong. Note, that now that the generic components have been stripped it makes sense to use the J-W measure (see above), and that match comes out even stronger due to the prefix bonus. "
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Stripping out Generic Components"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first task is to use the output of the WordDict, to create a list of sub strings identified as generic. There is no science to this, it should just be based on common sense. Clearly it is not desirable to strip out genuine style identifiers, or potential codes. As many generic components as possible should be included without adding danger that genuine identifiers will be stripped.\n",
"\n",
"In some senses it boils down to how bad you think the data are. If you think it unlikely that one style could be being produced under 8 different names, then is actually makes sense to take every word in the WordDict with values of 8 and above. If you think that it is feasible that the same style is produced under 8 different monikers, then this would be a risky strategy. \n",
"\n",
"The following to me seems like a sensible list:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"GenericList = ['bottom', 'boys', 'long', 'topb', 'slv', 'tank', 'pj', 'basic', 'polo', 's', 'shorts', 'ssivts', \\\n",
" 'top', 'sslvts', 'l', 'mens', 'nightware', 'lslvts', 'nightwear', 'msp', 'lsivts', 'tee', 'large']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 92
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One question is as to whether to stip out any occurance of the above words in the string, or only those that are identified as separate sub-strings (i.e. surrounded by white space). If this fromer is done than the string '1845ssivts large' will become '1845' (desirable) whereas if the latter is done then it becomes '1845ssivts' (an improvement but less desirable). \n",
"\n",
"However it might be less desiable to remove the 's' 'l' and 'pj' generic characters unless they are separate substrings, as it is possible that they will form part of the indentifier component. \n",
"\n",
"Luckily with different regular expressions this can easily be achieved. To make the code easier to read, separate those substrings where all instances will be removed, from those that will only be removed if they are surrounded by whitespace.\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"GenericRemoveAll = ['bottom', 'boys', 'long', 'topb', 'tank', 'basic', 'polo', 'shorts', 'ssivts', \\\n",
" 'top', 'sslvts', 'mens', 'nightware', 'lslvts', 'nightwear', 'msp', 'lsivts', 'tee', 'large', 'slv']\n",
"\n",
"GenericWhiteSpace = ['l', 'pj', 's']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 93
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following will demonstrate how the process of stripping out the generic components works. The print statements will output the value of the style as the list of words is iterated through and their matches removed from the style string. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DemoStyles = Styles.copy()\n",
"for row in DemoStyles.index[:10]:\n",
" # Set value of string\n",
" StringVal = DemoStyles[row]\n",
" print StringVal\n",
" \n",
" # Modify the value of the string by iteratively stripping out the sub-strings in GenericRemoveAll\n",
" for word in GenericRemoveAll:\n",
" RegEx1 = re.compile('' + word)\n",
" StringVal = re.sub(RegEx1, '', StringVal)\n",
" print StringVal\n",
" \n",
" # Modify the value of the string by iteratively stripping out the sub-strings in GenericRemoveAll\n",
" for word in GenericWhiteSpace:\n",
" RegEx2 = re.compile(r'\\b' + word + r'\\b')\n",
" StringVal = re.sub(RegEx2, '', StringVal)\n",
" print StringVal\n",
"\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"boys s slv r nk tee ts334\n",
"boys s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk tee ts334\n",
" s slv r nk ts334\n",
" s slv r nk ts334\n",
" s r nk ts334\n",
" s r nk ts334\n",
" s r nk ts334\n",
" r nk ts334\n",
"clock house tanktop\n",
"clock house tanktop\n",
"clock house tanktop\n",
"clock house tanktop\n",
"clock house tanktop\n",
"clock house top\n",
"clock house top\n",
"clock house top\n",
"clock house top\n",
"clock house top\n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"clock house \n",
"boys basic tanktop 354140\n",
"boys basic tanktop 354140\n",
" basic tanktop 354140\n",
" basic tanktop 354140\n",
" basic tanktop 354140\n",
" basic top 354140\n",
" top 354140\n",
" top 354140\n",
" top 354140\n",
" top 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
" 354140\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv tee 805914\n",
"vito v nk s slv 805914\n",
"vito v nk s slv 805914\n",
"vito v nk s 805914\n",
"vito v nk s 805914\n",
"vito v nk s 805914\n",
"vito v nk 805914\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee topta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf tee ta\n",
"pj smurf ta\n",
"pj smurf ta\n",
"pj smurf ta\n",
"pj smurf ta\n",
" smurf ta\n",
" smurf ta\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max tee 826581\n",
"max 826581\n",
"max 826581\n",
"max 826581\n",
"max 826581\n",
"max 826581\n",
"max 826581\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
"nightwear btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
" btm pj2916\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
"mens l slv tee 933005\n",
" l slv tee 933005\n",
" l slv tee 933005\n",
" l slv tee 933005\n",
" l slv tee 933005\n",
" l slv tee 933005\n",
" l slv tee 933005\n",
" l slv 933005\n",
" l slv 933005\n",
" l 933005\n",
" 933005\n",
" 933005\n",
" 933005\n",
"basic shorts 174490\n",
"basic shorts 174490\n",
"basic shorts 174490\n",
"basic shorts 174490\n",
"basic shorts 174490\n",
"basic shorts 174490\n",
" shorts 174490\n",
" shorts 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
" 174490\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear top hanger loop bs2923\n",
"nightwear hanger loop bs2923\n",
"nightwear hanger loop bs2923\n",
"nightwear hanger loop bs2923\n",
"nightwear hanger loop bs2923\n",
"nightwear hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n",
" hanger loop bs2923\n"
]
}
],
"prompt_number": 94
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below I now do the stripping for real:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"StylesMod = Styles.copy()\n",
"for row in StylesMod.index:\n",
" StringVal = Styles[row]\n",
" \n",
" for word in GenericRemoveAll:\n",
" RegEx1 = re.compile(r'' + word)\n",
" StringVal = re.sub(RegEx1, '', StringVal)\n",
" \n",
" for word in GenericWhiteSpace:\n",
" RegEx2 = re.compile(r'\\b' + word + r'\\b')\n",
" StringVal = re.sub(RegEx2, '', StringVal)\n",
" \n",
" StylesMod[row] = StringVal.strip()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 95
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Coding Update**<br />\n",
"When testing this method, I came across the following transformation:\n",
"\n",
">\"uwc932011 898899 lslvts sslvts\" \u2192 \"uwc932011 898899 lts sts\"<br />\n",
"\n",
"\n",
"The original GenericRemoveAll list lookd like this:\n",
"\n",
" GenericRemoveAll = ['bottom', 'boys', 'long', 'topb', 'slv', 'tank', 'basic', 'polo', 'shorts', 'ssivts', \\\n",
" 'top', 'sslvts', 'mens', 'nightware', 'lslvts', 'nightwear', 'msp', 'lsivts', 'tee', 'large',]\n",
"\n",
"Which seemed odd, as lslvts and sslvts are clearly in the GenericRemoveAll list. However 'slv' is also in the list, and in the iteration this comes first, so 'slv' was removed from 'lslvts' to leave 'lts' and so on. Therefore, either the list has to be composed carefully, so that the iteration order is carefully controlled, or perhaps a different strategy would be better altogether. For example, rather than moving from list to string, it would be better to split the string and then search the list. \n",
"\n",
"I will return to this at a later date"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DFUn['stripped'] = StylesMod"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 96
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see how out stripped strings compare with the originals:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"HTML(DFUn[:50].to_html())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>style</th>\n",
" <th>stripped</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0 </th>\n",
" <td> boys s slv r nk tee ts334</td>\n",
" <td> r nk ts334</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> clock house tanktop</td>\n",
" <td> clock house</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> boys basic tanktop 354140</td>\n",
" <td> 354140</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3 </th>\n",
" <td> vito v nk s slv tee 805914</td>\n",
" <td> vito v nk 805914</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4 </th>\n",
" <td> pj smurf tee topta</td>\n",
" <td> smurf ta</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5 </th>\n",
" <td> max tee 826581</td>\n",
" <td> max 826581</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6 </th>\n",
" <td> nightwear btm pj2916</td>\n",
" <td> btm pj2916</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7 </th>\n",
" <td> mens l slv tee 933005</td>\n",
" <td> 933005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8 </th>\n",
" <td> basic shorts 174490</td>\n",
" <td> 174490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9 </th>\n",
" <td> nightwear top hanger loop bs2923</td>\n",
" <td> hanger loop bs2923</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td> boys s slv mock layer tee sr189</td>\n",
" <td> mock layer sr189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td> marks shorts 226175</td>\n",
" <td> marks 226175</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td> nightwear top with hanger loop bs2923</td>\n",
" <td> with hanger loop bs2923</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td> mens s slv polo 313606</td>\n",
" <td> 313606</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td> nightwear top with hanger loop bs2916</td>\n",
" <td> with hanger loop bs2916</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td> boys basic tanktop 904240</td>\n",
" <td> 904240</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td> shark tank top 595551</td>\n",
" <td> shark 595551</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td> mens s slv tee 931005</td>\n",
" <td> 931005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td> boys s slv r nk tee 344001</td>\n",
" <td> r nk 344001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td> nightwear btm pj2921</td>\n",
" <td> btm pj2921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td> basic shorts 776611</td>\n",
" <td> 776611</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td> mens s slv placket tee 453001</td>\n",
" <td> placket 453001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td> nightwear btm sh2913</td>\n",
" <td> btm sh2913</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td> mens s slv polo 313606xl team</td>\n",
" <td> 313606xl team</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td> marks shorts 226176</td>\n",
" <td> marks 226176</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td> nightwear top with hanger loop bs2911</td>\n",
" <td> with hanger loop bs2911</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td> mens s slv tee uwc931005</td>\n",
" <td> uwc931005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td> mens s slv polo with bon pocket 313606</td>\n",
" <td> with bon pocket 313606</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td> boys s slv tee ts903</td>\n",
" <td> ts903</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td> ray tank top 177111</td>\n",
" <td> ray 177111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td> basic shorts 776622</td>\n",
" <td> 776622</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td> boys slv less tee dlp350stripe</td>\n",
" <td> less dlp350stripe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td> basic shorts 667570</td>\n",
" <td> 667570</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td> vito v nk s slv tee 845720</td>\n",
" <td> vito v nk 845720</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td> boys s slv placket tee ts912</td>\n",
" <td> placket ts912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td> marks shorts 226172</td>\n",
" <td> marks 226172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td> boys slv less tee sg356</td>\n",
" <td> less sg356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td> san fransisco tank top 693170</td>\n",
" <td> san fransisco 693170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td> micky mouse s slv tee 455650</td>\n",
" <td> micky mouse 455650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td> boys slv less tee dlp350solid</td>\n",
" <td> less dlp350solid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td> boys slv less tee 320001</td>\n",
" <td> less 320001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td> vip tank top 561580</td>\n",
" <td> vip 561580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td> pj smurf tee toptb</td>\n",
" <td> smurf tb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td> boys s slv r nk tee ts354</td>\n",
" <td> r nk ts354</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td> basic shorts 179324</td>\n",
" <td> 179324</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td> mens s slv polo with pocket 320764</td>\n",
" <td> with pocket 320764</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td> i love to party s slv tee 693170</td>\n",
" <td> i love to party 693170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td> boys s slv r nk tee ts921</td>\n",
" <td> r nk ts921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td> boys s slv mock layer tee ts357</td>\n",
" <td> mock layer ts357</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td> ray tank top 538810</td>\n",
" <td> ray 538810</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 98,
"text": [
"<IPython.core.display.HTML at 0x6577690>"
]
}
],
"prompt_number": 98
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stripping out generic components of strings I believe can help in string matching. By stripping out generic components the risk of spurious matches is reduced, and the identification of true matches previously obscured by incompatiable generic components is more strongly assured. However, the appropriateness of this strategy will of cours depend upon the data you are working with. \n",
"\n",
"Initially this workbook then went on to look at how to cluster the strings, but this will now form the basis of the next part of the series. "
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment