cirla/Do Androids Dream of Electric Blue.ipynb

## Do Androids Dream of Electric Blue.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Do Androids Dream of Electric Blue?\n",
    "\n",
    "Paint colors always have such fanciful names. Can we teach a computer to take a random color and assign it a name that not only sounds original but fits visually?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Corpus Colorum\n",
    "\n",
    "First, let's gather some paint color information from existing brands.\n",
    "\n",
    "Digging through the sources on [Benjamin Moore's](http://www.benjaminmoore.com/en-us/for-your-home/color-gallery) and [Sherwin-Williams](http://www.sherwin-williams.com/homeowners/color/find-and-explore-colors/paint-colors-by-family/) color explorers, I was able to find some JSON endpoints that could give us names, RGB values, and color families for all of their currently available colors. There's some other information (e.g. \"color collection\", \"goes great with\") that might be fun to play around with, but for now I'm just grabbing this simple information, e.g.\n",
    "\n",
    "```python\n",
    "{\n",
    "    'name': 'sylvan mist',\n",
    "    'rgb': (184, 199, 191),\n",
    "    'family': 'blue',\n",
    "}\n",
    "```\n",
    "\n",
    "This data is pretty static and we don't want to be slamming these APIs unnescessarily, so let's create a decorator to cache the parsed results in a local file for subsequent runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pickle\n",
    "\n",
    "import requests\n",
    "\n",
    "def local_cache(cache_filename):\n",
    "    \"\"\"\n",
    "    Decorator to cache the results of a function in a local file\n",
    "    and prefer loading from that file if it exists to executing\n",
    "    the function again\n",
    "    \"\"\"\n",
    "    \n",
    "    def decorator(f):\n",
    "        def inner(*args, **kwargs):\n",
    "            # check for local cache\n",
    "            if os.path.exists(cache_filename):\n",
    "                with open(cache_filename, 'rb') as cache:\n",
    "                    return pickle.load(cache)\n",
    "\n",
    "            # no cache; call load function\n",
    "            data = f(*args, **kwargs)\n",
    "            \n",
    "            # save a local cache before returning\n",
    "            with open(cache_filename, 'wb') as cache:\n",
    "                pickle.dump(data, cache)\n",
    "            return data\n",
    "        return inner\n",
    "    return decorator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Benjamin Moore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "BM_FAMILIES_URL = \"http://67.222.214.23/bmServices/ColorExplorer/colorexplorer.svc/ColorFamilies_GetAll?locale=en_US\"\n",
    "BM_COLORS_URL_FMT = \"http://67.222.214.23/bmServices/ColorExplorer/colorexplorer.svc/Colors_GetByFilter?locale=en_US&familyCode={familyCode}&collectionCode=&trendCode=\"\n",
    "\n",
    "@local_cache(\"./benjamin-moore.pickle\")\n",
    "def load_benjamin_moore():\n",
    "    return [{\n",
    "            \"name\": c[\"colorName\"].lower(),\n",
    "            \"rgb\": (c[\"RGB\"][\"R\"], c[\"RGB\"][\"G\"], c[\"RGB\"][\"B\"]),\n",
    "            \"family\": f[\"familyName\"].lower(),\n",
    "        }\n",
    "        for f in requests.get(BM_FAMILIES_URL).json()\n",
    "        for c in requests.get(BM_COLORS_URL_FMT.format(familyCode=f[\"familyCode\"])).json()\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4416"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus = []\n",
    "corpus.extend(load_benjamin_moore())\n",
    "len(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sherwin-Williams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import struct\n",
    "\n",
    "SW_COLORS_URL = \"http://www.sherwin-williams.com/homeowners/color/find-and-explore-colors/paint-colors-by-family/json/full/\"\n",
    "\n",
    "# ignore these families as they contain a broad mix of colors\n",
    "SW_EXCLUDE_FAMILIES = {\n",
    "    \"historic-color\",\n",
    "    \"timeless-color\",\n",
    "}\n",
    "\n",
    "@local_cache(\"./sherwin-williams.pickle\")\n",
    "def load_sherwin_williams():\n",
    "    return [{\n",
    "            \"name\": c[\"attributes\"][\"data-search-by\"].split(\"|\")[0].lower(),\n",
    "            \"rgb\": struct.unpack('BBB', bytes.fromhex(c[\"attributes\"][\"data-color-hex\"])),\n",
    "            \"family\": \"white\" if f[\"cleanName\"] == \"white-pastel\" else f[\"cleanName\"],\n",
    "        }\n",
    "        for _, f in requests.get(SW_COLORS_URL).json().items() if f[\"cleanName\"] not in SW_EXCLUDE_FAMILIES\n",
    "        for c in f[\"items\"]\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5634"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.extend(load_sherwin_williams())\n",
    "len(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring the Corpus\n",
    "\n",
    "Now that we have some information, let's take a look at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'family': 'neutral', 'name': 'old prairie', 'rgb': (227, 225, 210)},\n",
       " {'family': 'neutral', 'name': \"marilyn's dress\", 'rgb': (225, 229, 229)},\n",
       " {'family': 'red', 'name': 'wheatberry', 'rgb': (238, 226, 219)},\n",
       " {'family': 'orange', 'name': 'yellow marigold', 'rgb': (235, 168, 50)},\n",
       " {'family': 'neutral', 'name': 'urbane bronze', 'rgb': (84, 80, 74)},\n",
       " {'family': 'neutral', 'name': 'wooded vista', 'rgb': (154, 117, 94)},\n",
       " {'family': 'green', 'name': 'aloe', 'rgb': (172, 202, 188)},\n",
       " {'family': 'orange', 'name': 'abbey brown', 'rgb': (131, 91, 74)},\n",
       " {'family': 'black', 'name': 'wrought iron', 'rgb': (74, 75, 76)},\n",
       " {'family': 'green', 'name': 'calico blue', 'rgb': (75, 90, 81)}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import random\n",
    "sample = random.sample(corpus, 10)\n",
    "sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It would be neat if we could view the colors inline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from binascii import hexlify\n",
    "from base64 import b64encode\n",
    "from io import BytesIO\n",
    "\n",
    "from IPython import display\n",
    "\n",
    "def display_color(color):\n",
    "    return display.HTML(\"\"\"\n",
    "        <div>\n",
    "            <p><div style=\"font-weight:bold\">{name}</div> ({family})</p>\n",
    "            <p><svg width=\"50\" height=\"50\" style=\"background: #{rgb_hex}\"></p>\n",
    "            <p>#{rgb_hex}</p>\n",
    "        </div>\n",
    "    \"\"\".format(\n",
    "        name=color[\"name\"].title(),\n",
    "        family=color[\"family\"].title(),\n",
    "        rgb_hex=hexlify(struct.pack('BBB', *color[\"rgb\"])).decode(\"utf-8\")\n",
    "    ))\n",
    "\n",
    "def display_colors(colors):\n",
    "    return display.display(*[display_color(c) for c in colors])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Old Prairie</div> (Neutral)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #e3e1d2\"></p>\n",
       "            <p>#e3e1d2</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Marilyn'S Dress</div> (Neutral)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #e1e5e5\"></p>\n",
       "            <p>#e1e5e5</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Wheatberry</div> (Red)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #eee2db\"></p>\n",
       "            <p>#eee2db</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Yellow Marigold</div> (Orange)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #eba832\"></p>\n",
       "            <p>#eba832</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Urbane Bronze</div> (Neutral)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #54504a\"></p>\n",
       "            <p>#54504a</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_colors(sample[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'black',\n",
       " 'blue',\n",
       " 'brown',\n",
       " 'gray',\n",
       " 'green',\n",
       " 'neutral',\n",
       " 'orange',\n",
       " 'pink',\n",
       " 'purple',\n",
       " 'red',\n",
       " 'white',\n",
       " 'yellow'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "families = {c[\"family\"] for c in corpus}\n",
    "families"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## All in the Family\n",
    "\n",
    "A good start for assigning a name to a random color would be to first figure out to which family it belongs.\n",
    "\n",
    "We're going to try a few different classifiers, so let's wrap them in a similar interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class ColorFamilyClassifier:    \n",
    "    def __init__(self, color_corpus, train_percent=0.8):\n",
    "        data = [\n",
    "            (self.get_features(color), self.get_label(color))\n",
    "            for color in color_corpus\n",
    "        ]\n",
    "        \n",
    "        random.shuffle(data)\n",
    "        \n",
    "        cut_index = int(train_percent * len(data))\n",
    "        self.train_set = data[:cut_index]\n",
    "        self.test_set = data[cut_index:]\n",
    "        \n",
    "        self.init_classifier()\n",
    "    \n",
    "    def get_features(self, color):\n",
    "        raise NotImplemented\n",
    "    \n",
    "    def get_label(self, color):\n",
    "        raise NotImplemented\n",
    "    \n",
    "    def init_classifier(self):\n",
    "        raise NotImplemented\n",
    "        \n",
    "    def accuracy(self):\n",
    "        raise NotImplemented\n",
    "    \n",
    "    def classify(self, color):\n",
    "        raise NotImplemented\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Naïve Bayes\n",
    "\n",
    "Let's start by seeing how much mileage we can get with a Naïve Bayes classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "\n",
    "class NaiveBayesRGBClassifier(ColorFamilyClassifier):\n",
    "    def get_features(self, color):\n",
    "        return dict(zip((\"red\", \"green\", \"blue\"), color[\"rgb\"]))\n",
    "    \n",
    "    def get_label(self, color):\n",
    "        return color[\"family\"]\n",
    "    \n",
    "    def init_classifier(self):\n",
    "        self.classifier = nltk.NaiveBayesClassifier.train(self.train_set)\n",
    "        \n",
    "    def accuracy(self):\n",
    "        return nltk.classify.accuracy(self.classifier, self.test_set)\n",
    "    \n",
    "    def classify(self, color):\n",
    "        return self.classifier.classify(self.get_features(color))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2209405501330967"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier = NaiveBayesRGBClassifier(corpus)\n",
    "classifier.accuracy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's not very good accuracy. Let's try different colorspaces:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.259094942324756"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import colorsys\n",
    "\n",
    "class NaiveBayesHSVClassifier(NaiveBayesRGBClassifier):\n",
    "    def get_features(self, color):\n",
    "        hsv = colorsys.rgb_to_hsv(*color[\"rgb\"])\n",
    "        return dict(zip((\"hue\", \"saturation\", \"value\"), hsv))\n",
    "\n",
    "classifier = NaiveBayesHSVClassifier(corpus)\n",
    "classifier.accuracy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2440106477373558"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class NaiveBayesHLSClassifier(NaiveBayesRGBClassifier):\n",
    "    def get_features(self, color):\n",
    "        hls = colorsys.rgb_to_hls(*color[\"rgb\"])\n",
    "        return dict(zip((\"hue\", \"lightness\", \"saturation\"), hls))\n",
    "    \n",
    "classifier = NaiveBayesHLSClassifier(corpus)\n",
    "classifier.accuracy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.07897071872227152"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import husl\n",
    "\n",
    "class NaiveBayesHUSLClassifier(NaiveBayesRGBClassifier):\n",
    "    def get_features(self, color):\n",
    "        hsl = husl.rgb_to_husl(*color[\"rgb\"])\n",
    "        return dict(zip((\"hue\", \"saturation\", \"lightness\"), hsl))\n",
    "    \n",
    "classifier = NaiveBayesHUSLClassifier(corpus)\n",
    "classifier.accuracy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "None of these are showing very good accuracy. I'm very surprised that HSV isn't significantly better than RGB.\n",
    "\n",
    "Let's try a different kind of classifier.\n",
    "\n",
    "### k-Nearest Neighbor\n",
    "\n",
    "Let's try k-NN classifiers and tweak the value of N."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "class KNNColorClassifier(ColorFamilyClassifier):\n",
    "    def __init__(self, color_corpus, train_percent=0.8, n_neighbors=5):\n",
    "        families = {c[\"family\"] for c in color_corpus}\n",
    "        \n",
    "        self.family_map = {\n",
    "            f: i for i, f in enumerate(families)\n",
    "        }\n",
    "        \n",
    "        self.reverse_family_map = {v: k for k, v in self.family_map.items()}\n",
    "        \n",
    "        self.n_neighbors = n_neighbors\n",
    "    \n",
    "        super().__init__(color_corpus, train_percent)\n",
    "        \n",
    "    def get_features(self, color):\n",
    "        return color[\"rgb\"]\n",
    "        # RGB gets significantly better accuracies than these others\n",
    "        # return colorsys.rgb_to_hsv(*color[\"rgb\"])\n",
    "        # return colorsys.rgb_to_hls(*color[\"rgb\"])\n",
    "        # return husl.rgb_to_husl(*color[\"rgb\"])\n",
    "    \n",
    "    def get_label(self, color):\n",
    "        return self.family_map[color[\"family\"]]\n",
    "    \n",
    "    def init_classifier(self):\n",
    "        self.classifier = KNeighborsClassifier(n_neighbors=self.n_neighbors)\n",
    "        [features, labels] = zip(*self.train_set)\n",
    "        self.classifier.fit(features, labels)\n",
    "        \n",
    "    def accuracy(self):\n",
    "        [features, labels] = zip(*self.test_set)\n",
    "        return self.classifier.score(features, labels)\n",
    "    \n",
    "    def classify(self, color):\n",
    "        return self.reverse_family_map[\n",
    "            self.classifier.predict(\n",
    "                [self.get_features(color)]\n",
    "            )[0]\n",
    "        ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 0.52173913043478259),\n",
       " (2, 0.58562555456965393),\n",
       " (3, 0.56521739130434778),\n",
       " (4, 0.58385093167701863),\n",
       " (5, 0.5714285714285714),\n",
       " (6, 0.60248447204968947),\n",
       " (7, 0.58917480035492453),\n",
       " (8, 0.5705412599822538),\n",
       " (9, 0.60869565217391308),\n",
       " (10, 0.5732031943212067),\n",
       " (11, 0.60159716060337176),\n",
       " (12, 0.60159716060337176),\n",
       " (13, 0.60780834072759538),\n",
       " (14, 0.61224489795918369),\n",
       " (15, 0.61224489795918369),\n",
       " (16, 0.62821650399290152),\n",
       " (17, 0.60337178349600706),\n",
       " (18, 0.61313220940550128),\n",
       " (19, 0.60869565217391308),\n",
       " (20, 0.59627329192546585)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifiers = {n: KNNColorClassifier(corpus, n_neighbors=n) for n in range(1, 21)}\n",
    "\n",
    "[(n, c.accuracy()) for n, c in classifiers.items()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like any `n_neighbors` value above `4` is getting about 60% accuracy, so we'll stick with `4`. Still not great, but much better than the ~25% we were getting with Naïve Bayes. It's not shown above, but using RGB gets far better accuracy than the other colorspaces (or combinations thereof)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "classifier = classifiers[4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing out the Classifier\n",
    "\n",
    "Let's see how far off the classifier is when it's wrong. Is it pretty close (e.g. classifying an orange as a red or a yellow) or way off (e.g. classifying a blue as a pink)?\n",
    "\n",
    "We'll take a look at each family and see what family it's most commonly incorrectly identified as."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "red -> neutral (9 times)\n",
      "pink -> red (242 times)\n",
      "black -> neutral (18 times)\n",
      "green -> neutral (8 times)\n",
      "yellow -> neutral (24 times)\n",
      "blue -> green (16 times)\n",
      "orange -> red (14 times)\n",
      "brown -> neutral (91 times)\n",
      "white -> neutral (12 times)\n",
      "neutral -> blue (10 times)\n",
      "purple -> blue (46 times)\n",
      "gray -> neutral (156 times)\n"
     ]
    }
   ],
   "source": [
    "import collections\n",
    "import itertools\n",
    "import operator\n",
    "\n",
    "classified = [(c[\"family\"], classifier.classify(c)) for c in corpus]\n",
    "incorrect = {\n",
    "    f: collections.Counter([c[1] for c in g])\n",
    "    for f, g in itertools.groupby(\n",
    "        [x for x in classified if x[0] != x[1]],\n",
    "        key=operator.itemgetter(0))\n",
    "}\n",
    "\n",
    "for family, counter in incorrect.items():\n",
    "    wrong_family, n = counter.most_common(1)[0]\n",
    "    print(\"{} -> {} ({} times)\".format(\n",
    "        family, wrong_family, n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears that:\n",
    "1. Neutral colors are seriously gumming up the works\n",
    "2. We're overly eager in identifying pinks as reds (and, to a lesser degree, purples as blues).\n",
    "\n",
    "We could remove neutral colors from the corpus entirely, but this only gets us up to ~70% accuracy and we lose 15% of our corpus data. For now let's just accept this as \"good enough\" and have some fun. Let's generate a some random RGB values and take a guess at the family."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def random_color():\n",
    "    return {\n",
    "        \"name\": \"unnamed\",\n",
    "        \"family\": \"unknown\",\n",
    "        \"rgb\": (\n",
    "            random.randint(0, 255),\n",
    "            random.randint(0, 255),\n",
    "            random.randint(0, 255),\n",
    "        )\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Blue)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #464d98\"></p>\n",
       "            <p>#464d98</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #3ca779\"></p>\n",
       "            <p>#3ca779</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #74f3a7\"></p>\n",
       "            <p>#74f3a7</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Yellow)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #e2b446\"></p>\n",
       "            <p>#e2b446</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Purple)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #695d9c\"></p>\n",
       "            <p>#695d9c</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Yellow)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #e4cb57\"></p>\n",
       "            <p>#e4cb57</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #5f7e4e\"></p>\n",
       "            <p>#5f7e4e</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Red)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #ba2040\"></p>\n",
       "            <p>#ba2040</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Purple)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #6b17a5\"></p>\n",
       "            <p>#6b17a5</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Unnamed</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #53fbb8\"></p>\n",
       "            <p>#53fbb8</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "random_colors = [random_color() for i in range(10)]\n",
    "for c in random_colors:\n",
    "    c[\"family\"] = classifier.classify(c)\n",
    "    \n",
    "display_colors(random_colors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Desert Rose by Any Other Name\n",
    "\n",
    "Now that we can make a decent guess towards the color family for a random RGB value, let's try to build off of existing color names within families to create fun new names for colors we think are in that family.\n",
    "\n",
    "So we don't end up re-creating existing names, we're going to spice it up by adding related words to the corpus for each family.\n",
    "\n",
    "To achieve this, we're going to download a the [WordNet](https://wordnet.princeton.edu/) corpus from [NLTK](http://www.nltk.org/) to help us get related words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package wordnet to /Users/tim/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.download([\n",
    "    \"wordnet\", # words, their definitions, and relations to other words\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To keep the names more relevant, we're going to segregate by color family. Our approach will be to tokenize the color names within a family and build a conditional frequency distribution of bigrams (i.e. given a random token, which tokens follow it and how often). We'll then use the CFD for a family to generate a color name using a Markov chain technique. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import wordnet\n",
    "\n",
    "class ColorNameGenerator(object):\n",
    "    def __init__(self, corpus):\n",
    "        self.family_bigrams = collections.defaultdict(list)\n",
    "        \n",
    "        for color in corpus:\n",
    "            tokens = color[\"name\"].split()\n",
    "            bigrams = nltk.bigrams(tokens)\n",
    "            self.family_bigrams[color[\"family\"]].append(bigrams)\n",
    "        \n",
    "        self.cfds = {\n",
    "            f: nltk.ConditionalFreqDist(itertools.chain(*b))\n",
    "            for f, b in self.family_bigrams.items()\n",
    "        }\n",
    "        \n",
    "    def choose_word(self, word):    \n",
    "        base = wordnet.morphy(word)\n",
    "        synsets = wordnet.synsets(base if base else word)\n",
    "        \n",
    "        if not synsets:\n",
    "            return [word]\n",
    "        \n",
    "        syns = itertools.chain(*[synset.lemma_names() for synset in synsets])\n",
    "        return random.choice([word, *syns]).split(\"_\")\n",
    "        \n",
    "    def generate_name(self, family, max_len=5):\n",
    "        name = []\n",
    "        \n",
    "        cfd = self.cfds[family]\n",
    "        word = random.choice(cfd.conditions())\n",
    "        name.extend(self.choose_word(word))\n",
    "        \n",
    "        while word in cfd and len(name) < max_len:\n",
    "            fd = cfd[word]\n",
    "            items = list(fd.items())\n",
    "            random.shuffle(items)\n",
    "        \n",
    "            cum = 0\n",
    "            r = random.randint(0, fd.N())\n",
    "            for i in items:\n",
    "                if r <= cum + i[1]:\n",
    "                    word = i[0]\n",
    "                    name.extend(self.choose_word(word))\n",
    "                    break\n",
    "                cum += i[1]\n",
    "\n",
    "        return \" \".join(name).title()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Bravado Blood-Red River Corpse'"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "color_name_gen = ColorNameGenerator(corpus)\n",
    "color_name_gen.generate_name(\"red\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting it All Together\n",
    "\n",
    "Let's go wild and generat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def gen_color():\n",
    "    c = random_color()\n",
    "    c[\"family\"] = classifier.classify(c)\n",
    "    c[\"name\"] = color_name_gen.generate_name(c[\"family\"])\n",
    "    return c\n",
    "\n",
    "def gen_colors(n):\n",
    "    colors = [gen_color() for i in range(n)]\n",
    "    return display_colors(colors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Hang Around Green River Quest</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #27cc26\"></p>\n",
       "            <p>#27cc26</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Blue Jet Stream</div> (Blue)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #5757e0\"></p>\n",
       "            <p>#5757e0</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Of Paradise Peach Cyder</div> (Orange)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #f0685b\"></p>\n",
       "            <p>#f0685b</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Shore Honey Oil Onyx</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #b8cc12\"></p>\n",
       "            <p>#b8cc12</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Shady Lane</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #82fc15\"></p>\n",
       "            <p>#82fc15</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Calendar Method Of Birth Control</div> (Blue)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #1b0ae8\"></p>\n",
       "            <p>#1b0ae8</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Maidenhair Fern</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #50c32f\"></p>\n",
       "            <p>#50c32f</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Weave Pink Ingenuousness</div> (Red)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #876764\"></p>\n",
       "            <p>#876764</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Hellenic Putting Green Grove</div> (Green)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #8eb025\"></p>\n",
       "            <p>#8eb025</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><div style=\"font-weight:bold\">Maturate Wine</div> (Red)</p>\n",
       "            <p><svg width=\"50\" height=\"50\" style=\"background: #c028b0\"></p>\n",
       "            <p>#c028b0</p>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "gen_colors(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Room for Improvement\n",
    "\n",
    "What we've created is pretty entertaining, but there are plenty of things we can do to make it more interesting/useful:\n",
    "\n",
    "1. Improve the accuracy of the color family classifier\n",
    "    1. Investigate specifically with regards to neutrals, red/pink and purple/blue\n",
    "    2. Add the [HTML color names](http://www.w3schools.com/colors/colors_groups.asp) to the training set\n",
    "2. Improve the name generator:\n",
    "    1. Find a way to use the RGB value more directly in the name generator instead of broadly relying on the guessed family. A name that fits well with a dark green may be completely irrelevant to a light green.\n",
    "    2. [Pun detection/generation](http://www.aclweb.org/anthology/P15-1070)\n",
    "    3. Use an different approach than bigrams (we have too many unique bigrams with very little branching in the CFD).\n",
    "    4. I tried using NLTK's part of speech tagging on the tokens to improve the relevance of the synsets, but the POS tagging doesn't work well on short names (e.g. \"old blue jeans\" is tagged as `[adj, noun, noun]` whereas in the context of \"She was wearing old blue jeans.\" it is correctly tagged as `[adj, adj, noun]`. I'd need to find instances of the names (or bigrams) used in full sentences in some other corpora and tag them via that context.\n",
    "3. Augment the corpus:\n",
    "    1. Add more paint brands (including nail polish colors)\n",
    "    2. Seed with [humorous color names](https://www.reddit.com/r/funny/comments/39h4bx/i_renamed_some_of_the_paint_colors_at_the/)\n",
    "    3. Use [OpenCV](http://opencv.org/) to detect combinations of tags highly correlated with certain colors on [Flickr](https://www.flickr.com) images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}