davified/PanamaPapers.ipynb

## PanamaPapers.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding the Panama Papers using pandas and plotly\n",
    "\n",
    "In this notebook, I use the pandas and plotly libraries to clean, wrangle and visualise the Panama Papers dataset. It is in part to help me learn python, pandas and plotly, and in part to help me satisfy my curiosity about the Panama Papers. Specifically, I hope to answer the following questions: (i) which countries do tax evading companies originate from, (ii) which countries are popular tax havens, and (iii) zooming into Singapore, where do tax evaders live?\n",
    "\n",
    "Everything I learnt about pandas are from two amazing tutorials by Brandon Rhodes (PyCon 2015) https://www.youtube.com/watch?v=5JnMutdy6Fw&list=PLO9pkowc_99YYMFnW9GZkIE93jbip1jZb&index=4 and Jonathan Rocher (SciPy 2015) https://www.youtube.com/watch?v=0CFFTJUZ2dc.\n",
    "\n",
    "\n",
    "In the process, I learnt how to use pandas to:\n",
    "1. Load data\n",
    "2. Clean and format data\n",
    "3. Transform datasets\n",
    "4. Aggregate and summarise data\n",
    "\n",
    "I also learnt how to use the plotly library for data visualisation and how to import and embed HTML files\n",
    "\n",
    "Note:\n",
    "- This is an exercise in pandas (for data cleaning) and plotly (for visualisation). I haven't learnt about data validation at this point, so please take my conclusions with a pinch of salt.\n",
    "- Source: https://offshoreleaks.icij.org/pages/database ICIJ Offshore Leaks Database is licensed under the Open Database License and contents under Creative Commons Attribution-ShareAlike license. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('ggplot')\n",
    "import cufflinks as cf\n",
    "import plotly.tools as tls\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import set_option\n",
    "set_option(\"display.max_rows\", 25)\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "\n",
    "LARGE_FIGSIZE = (16, 12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Describing the data\n",
    "The data is taken from 5 csv files, which contain data on the (i) entities, (ii) intermediaries, (iii) officers, (iv) addresses of officers, and the fifth file appears to be a file which allows Neo4j to map data across csv files using node_id as a unique identifier.\n",
    "\n",
    "More details here: https://offshoreleaks.icij.org/pages/about\n",
    "Note that this contains only a fraction of the Panama Papers leaks from the Panama-based offices of Mossack Fonseca. Those leaks are made up of 2.6 terabytes of data, a total of 11.5 million records, which are not included en masse in the public database. The raw documents have not been published. The Mossack Fonseca leaks include vast amounts of email correspondence, bank account details, identity documents and financial transactions, and not all the names on those files appear in the database. Much of the information is buried in emails, power-of-attorney letters and internal notes of Mossack Fonseca employees and cannot easily be extracted in a systematic manner, according to the  International Consortium of Investigative Journalists (ICIJ).\n",
    "\n",
    ""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "addresses = pd.read_csv(\"data/Addresses.csv\")\n",
    "all_edges = pd.read_csv(\"data/all_edges.csv\")\n",
    "entities = pd.read_csv(\"data/Entities.csv\")\n",
    "intermediaries = pd.read_csv(\"data/Intermediaries.csv\")\n",
    "officers = pd.read_csv(\"data/Officers.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>1. entities</th>\n",
       "      <th>2. intermediaries</th>\n",
       "      <th>3. officers</th>\n",
       "      <th>4. addresses</th>\n",
       "      <th>5. all_edges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>name</td>\n",
       "      <td>name</td>\n",
       "      <td>name</td>\n",
       "      <td>address</td>\n",
       "      <td>node_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>original_name</td>\n",
       "      <td>internal_id</td>\n",
       "      <td>icij_id</td>\n",
       "      <td>icij_id</td>\n",
       "      <td>rel_type</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>former_name</td>\n",
       "      <td>address</td>\n",
       "      <td>valid_until</td>\n",
       "      <td>valid_until</td>\n",
       "      <td>node_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>jurisdiction</td>\n",
       "      <td>valid_until</td>\n",
       "      <td>country_codes</td>\n",
       "      <td>country_codes</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>jurisdiction_description</td>\n",
       "      <td>country_codes</td>\n",
       "      <td>countries</td>\n",
       "      <td>countries</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>company_type</td>\n",
       "      <td>countries</td>\n",
       "      <td>node_id</td>\n",
       "      <td>node_id</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>address</td>\n",
       "      <td>status</td>\n",
       "      <td>sourceID</td>\n",
       "      <td>sourceID</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>internal_id</td>\n",
       "      <td>node_id</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>incorporation_date</td>\n",
       "      <td>sourceID</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>inactivation_date</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>struck_off_date</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>dorm_date</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>status</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>service_provider</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>ibcRUC</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>country_codes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>countries</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>note</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>valid_until</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>node_id</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 1. entities 2. intermediaries    3. officers   4. addresses  \\\n",
       "0                       name              name           name        address   \n",
       "1              original_name       internal_id        icij_id        icij_id   \n",
       "2                former_name           address    valid_until    valid_until   \n",
       "3               jurisdiction       valid_until  country_codes  country_codes   \n",
       "4   jurisdiction_description     country_codes      countries      countries   \n",
       "5               company_type         countries        node_id        node_id   \n",
       "6                    address            status       sourceID       sourceID   \n",
       "7                internal_id           node_id            NaN            NaN   \n",
       "8         incorporation_date          sourceID            NaN            NaN   \n",
       "9          inactivation_date               NaN            NaN            NaN   \n",
       "10           struck_off_date               NaN            NaN            NaN   \n",
       "11                 dorm_date               NaN            NaN            NaN   \n",
       "12                    status               NaN            NaN            NaN   \n",
       "13          service_provider               NaN            NaN            NaN   \n",
       "14                    ibcRUC               NaN            NaN            NaN   \n",
       "15             country_codes               NaN            NaN            NaN   \n",
       "16                 countries               NaN            NaN            NaN   \n",
       "17                      note               NaN            NaN            NaN   \n",
       "18               valid_until               NaN            NaN            NaN   \n",
       "19                   node_id               NaN            NaN            NaN   \n",
       "\n",
       "   5. all_edges  \n",
       "0        node_1  \n",
       "1      rel_type  \n",
       "2        node_2  \n",
       "3           NaN  \n",
       "4           NaN  \n",
       "5           NaN  \n",
       "6           NaN  \n",
       "7           NaN  \n",
       "8           NaN  \n",
       "9           NaN  \n",
       "10          NaN  \n",
       "11          NaN  \n",
       "12          NaN  \n",
       "13          NaN  \n",
       "14          NaN  \n",
       "15          NaN  \n",
       "16          NaN  \n",
       "17          NaN  \n",
       "18          NaN  \n",
       "19          NaN  "
      ]
     },
     "execution_count": 145,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data = {'1. entities' : pd.Series(entities.columns.values),\n",
    "            '2. intermediaries' : pd.Series(intermediaries.columns.values),\n",
    "            '3. officers': pd.Series(officers.columns.values),\n",
    "            '4. addresses' : pd.Series(addresses.columns.values),\n",
    "            '5. all_edges': pd.Series(all_edges.columns.values)}\n",
    "\n",
    "data_overview = pd.DataFrame(all_data)\n",
    "data_overview.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Exploring the entities dataframe\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(319150, 21)"
      ]
     },
     "execution_count": 146,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entities.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 319150 entries, 0 to 319149\n",
      "Data columns (total 21 columns):\n",
      "name                        319146 non-null object\n",
      "original_name               213634 non-null object\n",
      "former_name                 6454 non-null object\n",
      "jurisdiction                319150 non-null object\n",
      "jurisdiction_description    319150 non-null object\n",
      "company_type                103227 non-null object\n",
      "address                     299319 non-null object\n",
      "internal_id                 213634 non-null float64\n",
      "incorporation_date          309736 non-null object\n",
      "inactivation_date           144760 non-null object\n",
      "struck_off_date             157872 non-null object\n",
      "dorm_date                   20207 non-null object\n",
      "status                      310106 non-null object\n",
      "service_provider            319150 non-null object\n",
      "ibcRUC                      255375 non-null object\n",
      "country_codes               318360 non-null object\n",
      "countries                   318360 non-null object\n",
      "note                        8984 non-null object\n",
      "valid_until                 319150 non-null object\n",
      "node_id                     319150 non-null int64\n",
      "sourceID                    319150 non-null object\n",
      "dtypes: float64(1), int64(1), object(19)\n",
      "memory usage: 51.1+ MB\n"
     ]
    }
   ],
   "source": [
    "entities.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([u'name', u'original_name', u'former_name', u'jurisdiction',\n",
       "       u'jurisdiction_description', u'company_type', u'address',\n",
       "       u'internal_id', u'incorporation_date', u'inactivation_date',\n",
       "       u'struck_off_date', u'dorm_date', u'status', u'service_provider',\n",
       "       u'ibcRUC', u'country_codes', u'countries', u'note', u'valid_until',\n",
       "       u'node_id', u'sourceID'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 148,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entities.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>jurisdiction_description</th>\n",
       "      <th>countries</th>\n",
       "      <th>node_id</th>\n",
       "      <th>internal_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ZODIAK LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>Virgin Islands, British;Cyprus</td>\n",
       "      <td>67028</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Zven Ltd.</td>\n",
       "      <td>Alesta Consulting, S.A. INVOICE BY EMAIL ONLY</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>Not identified;Virgin Islands, British</td>\n",
       "      <td>67243</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Anson 11A Ltd.</td>\n",
       "      <td>Express Co Registration Pte Ltd 138 Cecil Stre...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>Singapore;Virgin Islands, British</td>\n",
       "      <td>67258</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>GIADA LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>Cyprus;Virgin Islands, British</td>\n",
       "      <td>67266</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Scott D. Howard</td>\n",
       "      <td>Scott D. Howard P.O. Box 811 Brunswick ME 0401...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>United States;Not identified</td>\n",
       "      <td>108050</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              name                                            address  \\\n",
       "0       ZODIAK LTD  Christabel Corporate Services Limited Christab...   \n",
       "1        Zven Ltd.      Alesta Consulting, S.A. INVOICE BY EMAIL ONLY   \n",
       "2   Anson 11A Ltd.  Express Co Registration Pte Ltd 138 Cecil Stre...   \n",
       "3        GIADA LTD  Christabel Corporate Services Limited Christab...   \n",
       "4  Scott D. Howard  Scott D. Howard P.O. Box 811 Brunswick ME 0401...   \n",
       "\n",
       "  jurisdiction_description                               countries  node_id  \\\n",
       "0             Undetermined          Virgin Islands, British;Cyprus    67028   \n",
       "1             Undetermined  Not identified;Virgin Islands, British    67243   \n",
       "2             Undetermined       Singapore;Virgin Islands, British    67258   \n",
       "3             Undetermined          Cyprus;Virgin Islands, British    67266   \n",
       "4             Undetermined            United States;Not identified   108050   \n",
       "\n",
       "   internal_id  \n",
       "0          NaN  \n",
       "1          NaN  \n",
       "2          NaN  \n",
       "3          NaN  \n",
       "4          NaN  "
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#from the above, the columns of interest are: name, jurisdiction_description, countries, \n",
    "#node_id and internal_id. The last 3 columns may be useful later on as an index to reference other csv files.\n",
    "\n",
    "entities_subset = entities[[u'name', u'address', u'jurisdiction_description', u'countries', u'node_id',\n",
    "                     u'internal_id']]\n",
    "entities_subset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         False\n",
       "1         False\n",
       "2         False\n",
       "3         False\n",
       "4         False\n",
       "5          True\n",
       "6         False\n",
       "7         False\n",
       "8         False\n",
       "9          True\n",
       "10        False\n",
       "11        False\n",
       "          ...  \n",
       "319138    False\n",
       "319139    False\n",
       "319140    False\n",
       "319141    False\n",
       "319142    False\n",
       "319143    False\n",
       "319144    False\n",
       "319145    False\n",
       "319146    False\n",
       "319147    False\n",
       "319148    False\n",
       "319149    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#I'm not sure what jurisdiction and countries mean - which is the source and which is the destination? Let's find\n",
    "#out if they are identical.\n",
    "\n",
    "entities.jurisdiction == entities.country_codes\n",
    "\n",
    "# The output shows that they are not the same. Following some exploratory data analyses (which I've excluded from\n",
    "#here for brevity), I found out that (i) jurisdiction refer to the tax havens, and (ii) the distinction between \n",
    "#source country and tax havens is not so clear cut. For instance, based on the second chart below, you'll see that\n",
    "#several thousands of companies investing into the British Virgin Islands are from British Virgin islands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                       Country1                 Country2 Country3 Country4\n",
      "0                        Cyprus  Virgin Islands, British      NaN      NaN\n",
      "1       Virgin Islands, British           Not identified      NaN      NaN\n",
      "2       Virgin Islands, British                Singapore      NaN      NaN\n",
      "3       Virgin Islands, British                   Cyprus      NaN      NaN\n",
      "4                Not identified            United States      NaN      NaN\n",
      "5                Not identified                      NaN      NaN      NaN\n",
      "6       Virgin Islands, British       Russian Federation      NaN      NaN\n",
      "7                   Netherlands           Not identified      NaN      NaN\n",
      "...                         ...                      ...      ...      ...\n",
      "319142                  Andorra                      NaN      NaN      NaN\n",
      "319143              Isle of Man                      NaN      NaN      NaN\n",
      "319144                Hong Kong                      NaN      NaN      NaN\n",
      "319145                Hong Kong                      NaN      NaN      NaN\n",
      "319146                Hong Kong                      NaN      NaN      NaN\n",
      "319147                Hong Kong                      NaN      NaN      NaN\n",
      "319148     United Arab Emirates                      NaN      NaN      NaN\n",
      "319149     United Arab Emirates                      NaN      NaN      NaN\n",
      "\n",
      "[318360 rows x 4 columns]\n"
     ]
    }
   ],
   "source": [
    "#The country_codes and countries column sometimes contain more than 1 element. This may lead to undercounting. Let's\n",
    "#fix this.\n",
    "\n",
    "split = lambda x: pd.Series([i for i in reversed(x.split(';'))])\n",
    "countries_split = entities_subset['countries'].dropna().apply(split)\n",
    "\n",
    "countries_split.rename(columns={0:'Country1',1:'Country2',2:'Country3', 3: 'Country4'},inplace=True)\n",
    "countries_split = countries_split[['Country1', 'Country2', 'Country3', 'Country4']]\n",
    "print countries_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Virgin Islands, British    4\n",
      "Cayman Islands             3\n",
      "Singapore                  2\n",
      "Hong Kong                  2\n",
      "Not identified             2\n",
      "Seychelles                 1\n",
      "United Kingdom             1\n",
      "Curaçao                    1\n",
      "Name: Country3, dtype: int64\n",
      "Virgin Islands, British    1\n",
      "Name: Country4, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print countries_split.Country3.value_counts()\n",
    "print countries_split.Country4.value_counts()\n",
    "\n",
    "# For the sake of simplicity, I am going to drop Country3 and Country4, since countries only appear for a total of\n",
    "#20 times (less than 0.00001% of our dataset) in these 2 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Dropping two columns and merging this back to the main dataframe, and then dropping the country_codes column\n",
    "countries_split.drop(['Country3', 'Country4'], axis = 1)\n",
    "entities_subset[['Country1', 'Country2']] = countries_split[['Country1', 'Country2']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>jurisdiction_description</th>\n",
       "      <th>node_id</th>\n",
       "      <th>internal_id</th>\n",
       "      <th>Country2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ZODIAK LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67028</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Zven Ltd.</td>\n",
       "      <td>Alesta Consulting, S.A. INVOICE BY EMAIL ONLY</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67243</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Not identified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Anson 11A Ltd.</td>\n",
       "      <td>Express Co Registration Pte Ltd 138 Cecil Stre...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67258</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Singapore</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>GIADA LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67266</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Cyprus</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Scott D. Howard</td>\n",
       "      <td>Scott D. Howard P.O. Box 811 Brunswick ME 0401...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>108050</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United States</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              name                                            address  \\\n",
       "0       ZODIAK LTD  Christabel Corporate Services Limited Christab...   \n",
       "1        Zven Ltd.      Alesta Consulting, S.A. INVOICE BY EMAIL ONLY   \n",
       "2   Anson 11A Ltd.  Express Co Registration Pte Ltd 138 Cecil Stre...   \n",
       "3        GIADA LTD  Christabel Corporate Services Limited Christab...   \n",
       "4  Scott D. Howard  Scott D. Howard P.O. Box 811 Brunswick ME 0401...   \n",
       "\n",
       "  jurisdiction_description  node_id  internal_id                Country2  \n",
       "0             Undetermined    67028          NaN  British Virgin Islands  \n",
       "1             Undetermined    67243          NaN          Not identified  \n",
       "2             Undetermined    67258          NaN               Singapore  \n",
       "3             Undetermined    67266          NaN                  Cyprus  \n",
       "4             Undetermined   108050          NaN           United States  "
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entities_subset = entities_subset.drop('countries', axis = 1)\n",
    "entities_subset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>jurisdiction_description</th>\n",
       "      <th>node_id</th>\n",
       "      <th>internal_id</th>\n",
       "      <th>Country1</th>\n",
       "      <th>Country2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ZODIAK LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67028</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Cyprus</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Zven Ltd.</td>\n",
       "      <td>Alesta Consulting, S.A. INVOICE BY EMAIL ONLY</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67243</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Not identified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Anson 11A Ltd.</td>\n",
       "      <td>Express Co Registration Pte Ltd 138 Cecil Stre...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67258</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Singapore</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>GIADA LTD</td>\n",
       "      <td>Christabel Corporate Services Limited Christab...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>67266</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Cyprus</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Scott D. Howard</td>\n",
       "      <td>Scott D. Howard P.O. Box 811 Brunswick ME 0401...</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>108050</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Not identified</td>\n",
       "      <td>United States</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              name                                            address  \\\n",
       "0       ZODIAK LTD  Christabel Corporate Services Limited Christab...   \n",
       "1        Zven Ltd.      Alesta Consulting, S.A. INVOICE BY EMAIL ONLY   \n",
       "2   Anson 11A Ltd.  Express Co Registration Pte Ltd 138 Cecil Stre...   \n",
       "3        GIADA LTD  Christabel Corporate Services Limited Christab...   \n",
       "4  Scott D. Howard  Scott D. Howard P.O. Box 811 Brunswick ME 0401...   \n",
       "\n",
       "  jurisdiction_description  node_id  internal_id                Country1  \\\n",
       "0             Undetermined    67028          NaN                  Cyprus   \n",
       "1             Undetermined    67243          NaN  British Virgin Islands   \n",
       "2             Undetermined    67258          NaN  British Virgin Islands   \n",
       "3             Undetermined    67266          NaN  British Virgin Islands   \n",
       "4             Undetermined   108050          NaN          Not identified   \n",
       "\n",
       "                 Country2  \n",
       "0  British Virgin Islands  \n",
       "1          Not identified  \n",
       "2               Singapore  \n",
       "3                  Cyprus  \n",
       "4           United States  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Let's move on to harmonise the references to British Virgin Islands (in country1 they are referred to as\n",
    "#Virgin Islands, British).\n",
    "\n",
    "entities_subset = entities_subset.replace(to_replace='Virgin Islands, British', value='British Virgin Islands')\n",
    "entities_subset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data is cleaner now, so let's prepare it for visualisation!\n",
    "\n",
    "Based on what's available in the entities_subset dataframe, I think we can visualise the following: (i) top countries with companies found in the Panama Papers, (ii) breakdown of these top countries by destination countries, (iii) the breakdown in source and destination countries (using a heatmap), and (iv) zooming into Singapore, we can use the address column to visualise where these companies are located in Singapore (I will enlist the help of Google Maps for this last visualisation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Top 20 countries with companies found in the Panama Papers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "British Virgin Islands    69087.0\n",
       "Hong Kong                 51293.0\n",
       "Switzerland               38077.0\n",
       "Not identified            25698.0\n",
       "Panama                    18122.0\n",
       "United Kingdom            17972.0\n",
       "Jersey                    14562.0\n",
       "Russian Federation        11516.0\n",
       "                           ...   \n",
       "Cyprus                     6374.0\n",
       "United States              6254.0\n",
       "Singapore                  5867.0\n",
       "Bahamas                    5021.0\n",
       "Uruguay                    4906.0\n",
       "Isle of Man                4893.0\n",
       "China                      4188.0\n",
       "Monaco                     3168.0\n",
       "Name: total, dtype: float64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#First, let's build a dataframe for visualisation\n",
    "\n",
    "source_country = entities_subset[['Country1', 'Country2']]\n",
    "rank = source_country.apply(pd.Series.value_counts).fillna(0)\n",
    "rank['total'] = rank.sum(axis = 1)\n",
    "rank = rank.sort_values(by = 'total', ascending = False).total\n",
    "\n",
    "top_20 = rank[:20]\n",
    "top_20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~davified/18.embed\" height=\"587px\" width=\"1116px\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 124,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Let's visualise the data! I'll the plotly cufflinks library (why not? it's interactive! try clicking on the chart!)\n",
    "\n",
    "top_20.iplot(kind ='bar', yTitle = 'Number of Companies', title = 'Top 20 Countries (Source) Listed in Panama Papers')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Breakdown of top countries by destination countries (i.e. jurisdictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:\n",
      "\n",
      "\n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country1</th>\n",
       "      <th>Country2</th>\n",
       "      <th>jurisdiction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Cyprus</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Not identified</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Singapore</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Cyprus</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Not identified</td>\n",
       "      <td>United States</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Not identified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Russian Federation</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Netherlands</td>\n",
       "      <td>Not identified</td>\n",
       "      <td>Undetermined</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319142</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Hong Kong</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319143</th>\n",
       "      <td>Isle of Man</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319144</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319145</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319146</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319147</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319148</th>\n",
       "      <td>United Arab Emirates</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319149</th>\n",
       "      <td>United Arab Emirates</td>\n",
       "      <td>NaN</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>319150 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Country1                Country2            jurisdiction\n",
       "0                       Cyprus  British Virgin Islands            Undetermined\n",
       "1       British Virgin Islands          Not identified            Undetermined\n",
       "2       British Virgin Islands               Singapore            Undetermined\n",
       "3       British Virgin Islands                  Cyprus            Undetermined\n",
       "4               Not identified           United States            Undetermined\n",
       "5               Not identified                     NaN            Undetermined\n",
       "6       British Virgin Islands      Russian Federation            Undetermined\n",
       "7                  Netherlands          Not identified            Undetermined\n",
       "...                        ...                     ...                     ...\n",
       "319142                 Andorra                     NaN               Hong Kong\n",
       "319143             Isle of Man                     NaN  British Virgin Islands\n",
       "319144               Hong Kong                     NaN  British Virgin Islands\n",
       "319145               Hong Kong                     NaN  British Virgin Islands\n",
       "319146               Hong Kong                     NaN  British Virgin Islands\n",
       "319147               Hong Kong                     NaN  British Virgin Islands\n",
       "319148    United Arab Emirates                     NaN  British Virgin Islands\n",
       "319149    United Arab Emirates                     NaN  British Virgin Islands\n",
       "\n",
       "[319150 rows x 3 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#First, let's structure our data into a 2-level dataframe, so that it can be used to create a stacked bar chart.\n",
    "#Caveat - This section is slightly more messy and I wish there were less trial and error in my approach!\n",
    "\n",
    "source_country['jurisdiction'] = entities_subset['jurisdiction_description']\n",
    "source_country"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country1</th>\n",
       "      <th>jurisdiction</th>\n",
       "      <th>total_companies</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>Bahamas</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>American Samoa</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>Bahamas</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>Costa Rica</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Andorra</td>\n",
       "      <td>Niue</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>947</th>\n",
       "      <td>Virgin Islands, U.S.</td>\n",
       "      <td>Bahamas</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>948</th>\n",
       "      <td>Virgin Islands, U.S.</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>949</th>\n",
       "      <td>Virgin Islands, U.S.</td>\n",
       "      <td>Panama</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>950</th>\n",
       "      <td>Virgin Islands, U.S.</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>951</th>\n",
       "      <td>Yemen</td>\n",
       "      <td>Panama</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>952</th>\n",
       "      <td>Zambia</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>953</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>954</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>Panama</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>955 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Country1            jurisdiction  total_companies\n",
       "0                 Albania                 Bahamas                2\n",
       "1          American Samoa          United Kingdom                1\n",
       "2                 Andorra                 Bahamas               26\n",
       "3                 Andorra  British Virgin Islands               39\n",
       "4                 Andorra              Costa Rica                3\n",
       "5                 Andorra               Hong Kong                9\n",
       "6                 Andorra                  Nevada               24\n",
       "7                 Andorra                    Niue               12\n",
       "..                    ...                     ...              ...\n",
       "947  Virgin Islands, U.S.                 Bahamas                2\n",
       "948  Virgin Islands, U.S.  British Virgin Islands                1\n",
       "949  Virgin Islands, U.S.                  Panama                1\n",
       "950  Virgin Islands, U.S.            Undetermined                7\n",
       "951                 Yemen                  Panama                1\n",
       "952                Zambia  British Virgin Islands                2\n",
       "953              Zimbabwe  British Virgin Islands                5\n",
       "954              Zimbabwe                  Panama                3\n",
       "\n",
       "[955 rows x 3 columns]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "source_country_2_level_counts = source_country.groupby(['Country1', 'jurisdiction']).size().reset_index()\n",
    "source_country_2_level_counts.rename(columns={0:'total_companies'},inplace=True)\n",
    "source_country_2_level_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country1</th>\n",
       "      <th>jurisdiction</th>\n",
       "      <th>total_companies</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>117</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>37843</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>358</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>27300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>805</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>19650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>634</th>\n",
       "      <td>Not identified</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>18193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>14817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>811</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>Panama</td>\n",
       "      <td>11590</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448</th>\n",
       "      <td>Jersey</td>\n",
       "      <td>British Virgin Islands</td>\n",
       "      <td>10467</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>735</th>\n",
       "      <td>Samoa</td>\n",
       "      <td>Samoa</td>\n",
       "      <td>7677</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>370</th>\n",
       "      <td>Hong Kong</td>\n",
       "      <td>Undetermined</td>\n",
       "      <td>6110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>651</th>\n",
       "      <td>Panama</td>\n",
       "      <td>Panama</td>\n",
       "      <td>6090</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Country1            jurisdiction  total_companies\n",
       "117  British Virgin Islands  British Virgin Islands            37843\n",
       "358               Hong Kong  British Virgin Islands            27300\n",
       "805             Switzerland  British Virgin Islands            19650\n",
       "634          Not identified            Undetermined            18193\n",
       "124  British Virgin Islands            Undetermined            14817\n",
       "811             Switzerland                  Panama            11590\n",
       "448                  Jersey  British Virgin Islands            10467\n",
       "735                   Samoa                   Samoa             7677\n",
       "370               Hong Kong            Undetermined             6110\n",
       "651                  Panama                  Panama             6090"
      ]
     },
     "execution_count": 159,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "source_country_2_level_counts_sorted = source_country_2_level_counts.sort_values(by=['total_companies'], ascending = False)\n",
    "source_country_2_level_counts_sorted.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#In the interest of time, I will export the data for visualisation in plotly. \n",
    "source_country_2_level_counts_sorted.to_csv(\"data/country_jurisdiction_sorted.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~davified/17.embed\" height=\"400\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#I've visualised the data in plotly (it was a breeze!), and it even allows me to embed an interactive chart in\n",
    "#Jupyter notebook! WOW!)\n",
    "HTML('<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~davified/17.embed\" height=\"400\" width=\"100%\"></iframe>')\n",
    "\n",
    "#Note: You can zoom in by dragging and selecting the section of interest. Double click to zoom out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. A heatmap representing the number of companies found in each country, broken down by destination jurisdictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src=\"https://docs.google.com/spreadsheets/d/1kqYZrG7GPBPDWsJWBWlBHBV7DCt9kFaWNi83VkVOZac/pubhtml?gid=1600001876&amp;single=true&amp;widget=true&amp;headers=false\" width=1000 height=500></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#I exported the source_country_2_level_counts_sorted dataframe to create a heatmap on Google Sheets.\n",
    "\n",
    "HTML('<iframe src=\"https://docs.google.com/spreadsheets/d/1kqYZrG7GPBPDWsJWBWlBHBV7DCt9kFaWNi83VkVOZac/pubhtml?gid=1600001876&amp;single=true&amp;widget=true&amp;headers=false\" width=1000 height=500></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Visualising the locations of the Singapore-based companies found in the Panama Papers (with the help of Google Maps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src=\"https://www.google.com/maps/d/embed?mid=1nrJFPuC2TWjS3vVHQbPCnU2jmFQ\" width=\"1000\" height=\"500\"></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 139,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "HTML('<iframe src=\"https://www.google.com/maps/d/embed?mid=1nrJFPuC2TWjS3vVHQbPCnU2jmFQ\" width=\"1000\" height=\"500\"></iframe>')\n",
    "\n",
    "#Hover over the placemarkers to see details of the company"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "###About Python\n",
    "\n",
    "1. Pandas is really powerful and fast\n",
    "2. Plotly is really intuitive and it syncs well with Python and Jupyter Notebook as well\n",
    "\n",
    "###About the Panama Papers\n",
    "\n",
    "Based on the exploratory data analyses above, I've found out that:\n",
    "1. The top 5 source countries of such shady dealings are: British Virgin Islands, Hong Kong, Switzerland, \"Not Identified\" and Panama. I guess the lawyers and accountants have learnt to hide their trail well. Apart from the clear suspects (BVI, Panama), there is a big chunk which is \"Not Identified\" - and this makes it quite hard to conclusively identify tax havens.\n",
    "2. Even though Singapore is often touted as a possible tax haven, the figures show that the degree to which it houses companies related to the Panama Papers is much lesser as compared to Hong Kong, Switzerland, British Virgin islands. That said, as a source country, it is ranked 15th (5,867 companies), right after the United States (6,254 companies). As a jurisdiction, it is ranked 11th (668 companies), right after Nevada (1,260 companies) and Hong Kong (1,331).\n",
    "\n",
    "Do explore the charts above yourself and let me know what else you can conclude from them :-)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}