tejasjadhav/collective-intelligence.ipynb

## collective-intelligence.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Relationship Index\n",
    "\n",
    "Assume we have few people, `luffy`, `zoro`, `sanji`, `chopper` and `franky` (yup, [One Piece](https://en.wikipedia.org/wiki/One_Piece) characters), who have their preferences about three particular food items, `meat`, `candy` and `soup`. Based on these preferences, we have to find out how similar are the food preferences of any two of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# The maximum possible score for any food item is 10.0 and the minimum is 0.0.\n",
    "food_preferences = {\n",
    "    'luffy': {'meat': 10.0, 'candy': 7.5, 'soup': 6.0},\n",
    "    'zoro': {'meat': 5.0, 'candy': 0.5, 'soup': 8.5},\n",
    "    'sanji': {'meat': 6.5, 'candy': 2.5, 'soup': 7.5},\n",
    "    'chopper': {'meat': 0.5, 'candy': 10.0, 'soup': 6.5},\n",
    "    'franky': {'meat': 7.0, 'candy': 4.0, 'soup': 8.0}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose, we want to guess how much `luffy`'s food preferences are similar to that of `chopper`. We can do a raw guess by scanning over the data. It seems `luffy` has a **much higher** preference for `meat` than `chopper`, but a **slightly lower** preference for `candy`, and **somewhat similar** preference for `soup`.\n",
    "\n",
    "However, the terms, **much higher**, **slightly lower** and **somewhat similar** doesn't give us a quantifiable data. By this, I mean, if I would like to compare `zoro` and `luffy`'s food similarity with that of `chopper` and `luffy`, I don't have anything to compare with except for plain English words. Thus, we need a quantity to compare data. This is where our _Relationship Index_ comes in. It is just another number. Please don't mind the fancy name :).\n",
    "\n",
    "We would only have to compare the relationship indexes now. But, a bigger question here would be, how to generate this index for any two persons. Should we just add up the total scores of their individual preferences and check? Or maybe averaging it and then comparing sounds better? Sorry, but none of these random analysis would do justice to all the parameters.\n",
    "\n",
    "Fortunately, there a couple of methods for this, also called as **Correlation methods**. The most famous is _Pearson's Product-Moment Correlation_."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Pearson's Product-Moment Correlation\n",
    "\n",
    "This is one of the most commonly used formula for calculating correlation between two data samples (food item scores of any two characters in our case). The formula may look a bit complicated, but then it works!\n",
    "\n",
    "$$r = \\frac{n \\cdot \\sum_{i=0}^{n}a_i \\cdot b_i - \\sum_{i=0}^{n}a_i \\cdot \\sum_{i=0}^{n}b_i}{\\sqrt{\\sum_{i=0}^{n}a_i^2 - (\\sum_{i=0}^{n}a_i)^2} \\cdot \\sqrt{\\sum_{i=0}^{n}b_i^2 - (\\sum_{i=0}^{n}b_i)^2}}$$\n",
    "\n",
    "where $r$ is the final correlation index, $a$ and $b$ are the food item preferences of the two characters respectively, $a_i$ and $b_i$ are the $i^{th}$ food item scores and $n$ is the total number of common food items.\n",
    "\n",
    "Let's write the code for calculation Pearson's similarity index. We will be running these tests for each unique combination of the characters and find their similarity index with respect to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Similarity scores:\n",
      "\n",
      "Similarity between luffy and chopper: -0.185063\n",
      "Similarity between luffy and franky: -0.008651\n",
      "Similarity between luffy and sanji: -0.006108\n",
      "Similarity between luffy and zoro: -0.078329\n",
      "Similarity between chopper and franky: -0.197984\n",
      "Similarity between chopper and sanji: -0.315994\n",
      "Similarity between chopper and zoro: -0.414149\n",
      "Similarity between franky and sanji: 0.167403\n",
      "Similarity between franky and zoro: 0.324141\n",
      "Similarity between sanji and zoro: 0.478795\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "\n",
    "from itertools import combinations\n",
    "\n",
    "# Let's define a function so that we can reuse Pearson's similarity method.\n",
    "def pearson_similarity(food_items, character_1, character_2):\n",
    "    \n",
    "    # Get all common food items between the two characters. Even though both of them have same\n",
    "    # food items in common, still we are performing this check as we will be reusing this\n",
    "    # method later.\n",
    "    similar_food_items = set(food_items[character_1].keys()) & set(\n",
    "        food_items[character_2].keys())\n",
    "\n",
    "    if not similar_food_items:\n",
    "        return 0\n",
    "\n",
    "    count = len(similar_food_items)\n",
    "    \n",
    "    # Sum of scores.\n",
    "    character_1_rating_sum = sum([food_items[character_1][food_item]\n",
    "                               for food_item in similar_food_items])\n",
    "    character_2_rating_sum = sum([food_items[character_2][food_item]\n",
    "                               for food_item in similar_food_items])\n",
    "    \n",
    "    # Sum of squares of scores.\n",
    "    character_1_rating_sum_square = sum(\n",
    "        [food_items[character_1][food_item] ** 2 for food_item in similar_food_items])\n",
    "    character_2_rating_sum_square = sum(\n",
    "        [food_items[character_2][food_item] ** 2 for food_item in similar_food_items])\n",
    "    \n",
    "    # Sum of product of scores.\n",
    "    critic_rating_sum_product = sum(\n",
    "        [food_items[character_1][food_item] * food_items[character_2][food_item]\n",
    "        for food_item in similar_food_items]\n",
    "    )\n",
    "\n",
    "    numerator = count * critic_rating_sum_product - (character_1_rating_sum * character_2_rating_sum)\n",
    "    denominator = math.sqrt((character_1_rating_sum_square - character_1_rating_sum ** 2)\n",
    "                            * (character_2_rating_sum_square - character_2_rating_sum ** 2))\n",
    "\n",
    "    if denominator == 0:\n",
    "        return 0.0\n",
    "\n",
    "    return numerator / denominator\n",
    "\n",
    "# Printing character scores.\n",
    "def __test_pearson_similarity():\n",
    "    characters = list(food_preferences.keys())\n",
    "    print('Similarity scores:\\n')\n",
    "    \n",
    "    for combination in combinations(characters, 2):\n",
    "        # Each combination is a tuple of this form => ('luffy', 'zoro')\n",
    "        print('Similarity between %s and %s: %f' % (combination[0], combination[1],\n",
    "                                                    pearson_similarity(food_preferences,\n",
    "                                                                       combination[0], \n",
    "                                                                       combination[1])))\n",
    "__test_pearson_similarity()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explanation\n",
    "\n",
    "To put it in simple terms, Pearson's coefficient is used to determine whether all the points for the given samples are in a **straight line** and how **close** are these points to this straight line. The coefficient is the slope of this line. A negative coefficient means a negative slope.\n",
    "\n",
    "* For $r = 1$, all the points of the two samples are in a straight line.\n",
    "* For $1 > r > 0$, all the points are not in a straight line, but appear to be in a straight line.\n",
    "* For $r = 0$, all the points are scattered and no straight line pattern can be found.\n",
    "* For $0 > r > -1$, all the points are not in a straight line, but appear to be in a straight line, but with a negative slope.\n",
    "* For $r = -1$, all the points of the two samples are in a straight line, but with a negative slope.\n",
    "\n",
    "This can be better understood by the figure below. [Source: Wikipedia](https://commons.wikimedia.org/wiki/File:Correlation_coefficient.png#/media/File:Correlation_coefficient.png)\n",
    "\n",
    "<a href=\"https://commons.wikimedia.org/wiki/File:Correlation_coefficient.png#/media/File:Correlation_coefficient.png\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/34/Correlation_coefficient.png\" alt=\"Correlation coefficient.png\" width=\"640\" height=\"353\"></a>\n",
    "\n",
    "What are interested in is the absolute value of these coefficients which decides the similarity of the items. In our case, it is highest for `sanji` and `zoro` (LOL!) and lowest for `sanji` and `luffy`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Relationship Index\n",
	"\n",
	"Assume we have few people, `luffy`, `zoro`, `sanji`, `chopper` and `franky` (yup, [One Piece](https://en.wikipedia.org/wiki/One_Piece) characters), who have their preferences about three particular food items, `meat`, `candy` and `soup`. Based on these preferences, we have to find out how similar are the food preferences of any two of them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# The maximum possible score for any food item is 10.0 and the minimum is 0.0.\n",
	"food_preferences = {\n",
	" 'luffy': {'meat': 10.0, 'candy': 7.5, 'soup': 6.0},\n",
	" 'zoro': {'meat': 5.0, 'candy': 0.5, 'soup': 8.5},\n",
	" 'sanji': {'meat': 6.5, 'candy': 2.5, 'soup': 7.5},\n",
	" 'chopper': {'meat': 0.5, 'candy': 10.0, 'soup': 6.5},\n",
	" 'franky': {'meat': 7.0, 'candy': 4.0, 'soup': 8.0}\n",
	"}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Suppose, we want to guess how much `luffy`'s food preferences are similar to that of `chopper`. We can do a raw guess by scanning over the data. It seems `luffy` has a much higher preference for `meat` than `chopper`, but a slightly lower preference for `candy`, and somewhat similar preference for `soup`.\n",
	"\n",
	"However, the terms, much higher, slightly lower and somewhat similar doesn't give us a quantifiable data. By this, I mean, if I would like to compare `zoro` and `luffy`'s food similarity with that of `chopper` and `luffy`, I don't have anything to compare with except for plain English words. Thus, we need a quantity to compare data. This is where our _Relationship Index_ comes in. It is just another number. Please don't mind the fancy name :).\n",
	"\n",
	"We would only have to compare the relationship indexes now. But, a bigger question here would be, how to generate this index for any two persons. Should we just add up the total scores of their individual preferences and check? Or maybe averaging it and then comparing sounds better? Sorry, but none of these random analysis would do justice to all the parameters.\n",
	"\n",
	"Fortunately, there a couple of methods for this, also called as Correlation methods. The most famous is _Pearson's Product-Moment Correlation_."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\n",
	"## Pearson's Product-Moment Correlation\n",
	"\n",
	"This is one of the most commonly used formula for calculating correlation between two data samples (food item scores of any two characters in our case). The formula may look a bit complicated, but then it works!\n",
	"\n",
	"$$r = \\frac{n \\cdot \\sum_{i=0}^{n}a_i \\cdot b_i - \\sum_{i=0}^{n}a_i \\cdot \\sum_{i=0}^{n}b_i}{\\sqrt{\\sum_{i=0}^{n}a_i^2 - (\\sum_{i=0}^{n}a_i)^2} \\cdot \\sqrt{\\sum_{i=0}^{n}b_i^2 - (\\sum_{i=0}^{n}b_i)^2}}$$\n",
	"\n",
	"where $r$ is the final correlation index, $a$ and $b$ are the food item preferences of the two characters respectively, $a_i$ and $b_i$ are the $i^{th}$ food item scores and $n$ is the total number of common food items.\n",
	"\n",
	"Let's write the code for calculation Pearson's similarity index. We will be running these tests for each unique combination of the characters and find their similarity index with respect to each other."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Similarity scores:\n",
	"\n",
	"Similarity between luffy and chopper: -0.185063\n",
	"Similarity between luffy and franky: -0.008651\n",
	"Similarity between luffy and sanji: -0.006108\n",
	"Similarity between luffy and zoro: -0.078329\n",
	"Similarity between chopper and franky: -0.197984\n",
	"Similarity between chopper and sanji: -0.315994\n",
	"Similarity between chopper and zoro: -0.414149\n",
	"Similarity between franky and sanji: 0.167403\n",
	"Similarity between franky and zoro: 0.324141\n",
	"Similarity between sanji and zoro: 0.478795\n"
	]
	}
	],
	"source": [
	"import math\n",
	"\n",
	"from itertools import combinations\n",
	"\n",
	"# Let's define a function so that we can reuse Pearson's similarity method.\n",
	"def pearson_similarity(food_items, character_1, character_2):\n",
	" \n",
	" # Get all common food items between the two characters. Even though both of them have same\n",
	" # food items in common, still we are performing this check as we will be reusing this\n",
	" # method later.\n",
	" similar_food_items = set(food_items[character_1].keys()) & set(\n",
	" food_items[character_2].keys())\n",
	"\n",
	" if not similar_food_items:\n",
	" return 0\n",
	"\n",
	" count = len(similar_food_items)\n",
	" \n",
	" # Sum of scores.\n",
	" character_1_rating_sum = sum([food_items[character_1][food_item]\n",
	" for food_item in similar_food_items])\n",
	" character_2_rating_sum = sum([food_items[character_2][food_item]\n",
	" for food_item in similar_food_items])\n",
	" \n",
	" # Sum of squares of scores.\n",
	" character_1_rating_sum_square = sum(\n",
	" [food_items[character_1][food_item] ** 2 for food_item in similar_food_items])\n",
	" character_2_rating_sum_square = sum(\n",
	" [food_items[character_2][food_item] ** 2 for food_item in similar_food_items])\n",
	" \n",
	" # Sum of product of scores.\n",
	" critic_rating_sum_product = sum(\n",
	" [food_items[character_1][food_item] * food_items[character_2][food_item]\n",
	" for food_item in similar_food_items]\n",
	" )\n",
	"\n",
	" numerator = count * critic_rating_sum_product - (character_1_rating_sum * character_2_rating_sum)\n",
	" denominator = math.sqrt((character_1_rating_sum_square - character_1_rating_sum ** 2)\n",
	" * (character_2_rating_sum_square - character_2_rating_sum ** 2))\n",
	"\n",
	" if denominator == 0:\n",
	" return 0.0\n",
	"\n",
	" return numerator / denominator\n",
	"\n",
	"# Printing character scores.\n",
	"def __test_pearson_similarity():\n",
	" characters = list(food_preferences.keys())\n",
	" print('Similarity scores:\\n')\n",
	" \n",
	" for combination in combinations(characters, 2):\n",
	" # Each combination is a tuple of this form => ('luffy', 'zoro')\n",
	" print('Similarity between %s and %s: %f' % (combination[0], combination[1],\n",
	" pearson_similarity(food_preferences,\n",
	" combination[0], \n",
	" combination[1])))\n",
	"__test_pearson_similarity()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Explanation\n",
	"\n",
	"To put it in simple terms, Pearson's coefficient is used to determine whether all the points for the given samples are in a straight line and how close are these points to this straight line. The coefficient is the slope of this line. A negative coefficient means a negative slope.\n",
	"\n",
	"* For $r = 1$, all the points of the two samples are in a straight line.\n",
	"* For $1 > r > 0$, all the points are not in a straight line, but appear to be in a straight line.\n",
	"* For $r = 0$, all the points are scattered and no straight line pattern can be found.\n",
	"* For $0 > r > -1$, all the points are not in a straight line, but appear to be in a straight line, but with a negative slope.\n",
	"* For $r = -1$, all the points of the two samples are in a straight line, but with a negative slope.\n",
	"\n",
	"This can be better understood by the figure below. [Source: Wikipedia](https://commons.wikimedia.org/wiki/File:Correlation_coefficient.png#/media/File:Correlation_coefficient.png)\n",
	"\n",
	"<a href=\"https://commons.wikimedia.org/wiki/File:Correlation_coefficient.png#/media/File:Correlation_coefficient.png\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/3/34/Correlation_coefficient.png\" alt=\"Correlation coefficient.png\" width=\"640\" height=\"353\"></a>\n",
	"\n",
	"What are interested in is the absolute value of these coefficients which decides the similarity of the items. In our case, it is highest for `sanji` and `zoro` (LOL!) and lowest for `sanji` and `luffy`."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}