jonathanmorgan/Prepping_data_for_analysis.ipynb

## Prepping_data_for_analysis.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepping data for analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, the hardest part of analyzing or visualizing data you've gathered is getting it into a format that your analysis or visualization tool can import.  Below are a few general tips for preparing data for analysis, and then specific options you have for pulling data from a database and then converting it into a format that other tools can use to import."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## General tips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- if you are exporting, then converting, consider not outputting more data than you need for the analysis you are trying to do.  If you are planning on doing some exploratory analysis, however, might be better to just bite the bullet and output everything at once.\n",
    "- consider not including text fields unless your analysis requires it - CSV and tab-delimited files are much easier to make and parse if there are no text columns in them.  Text columns can include the delimter for a file, or they can include new lines and other white space characters.  These characters can be worked around, but they might require more tweaking both in the conversion and in the import.\n",
    "- test export and import first with a small sample file before you try it on big data.  Even if a small sample file works, there is still the risk that there are problem rows further on in the data, but at least you'll be likely to have knocked out common problems in your sample testing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One good way to get data out of a database and prepped for analysis is to use pandas to read either a table or the results of a query from your database into a pandas data frame, then use pandas to translate to other formats.\n",
    "\n",
    "Before you start, pandas uses sqlalchemy for its database interactions, so you'll need to install python package sqlalchemy.\n",
    "\n",
    "    conda install sqlalchemy (should already be installed).\n",
    "\n",
    "And, if you are reading from a database, in this Notebook, `cd` to the directory where the database file resides:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/jonathanmorgan/Documents/work/twitter_test/research\n"
     ]
    }
   ],
   "source": [
    "%cd /Users/jonathanmorgan/Documents/work/twitter_test/research"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pandas example: reading from database to data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas lets you either pass a table name in to pandas.read_sql_table() (which it reads in its entirety into a data frame, in memory in your computer):\n",
    "\n",
    "    tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "    \n",
    "OR an SQL query in to:\n",
    "\n",
    "    tweets_query_df = pandas.read_sql_query( tweet_query, sqlalchemy_sqlite3_engine )\n",
    "    \n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas\n",
    "import sqlalchemy\n",
    "\n",
    "# declare variables\n",
    "sqlite_database_URL = 'sqlite:///db.sqlite3'\n",
    "sqlalchemy_sqlite3_engine = None\n",
    "table_name = \"\"\n",
    "tweets_table_df = None\n",
    "tweet_query = \"\"\n",
    "tweets_query_df = None\n",
    "\n",
    "# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
    "#    sqlite3 database tweet_sample.sqlite in the current directory.\n",
    "# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
    "\n",
    "# sqlite://<nohostname>/<path>\n",
    "# where <path> is relative:\n",
    "sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
    "\n",
    "# first, try reading the entire \"tweetnet_tweet\" table into memory.\n",
    "table_name = \"tweetnet_tweet\"\n",
    "tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "print( tweets_table_df )\n",
    "\n",
    "# then try executing a query\n",
    "tweet_query = '''\n",
    "SELECT id, twitter_tweet_id, tweet_text, tweet_language, twitter_user_twitter_id, tweet_retweet_count, tweet_user_mention_count, tweet_hashtag_mention_count, tweet_url_count\n",
    "FROM tweetnet_tweet\n",
    "WHERE tweet_language = 'en'\n",
    "LIMIT 100;'''\n",
    "tweets_query_df = pandas.read_sql_query( tweet_query, sqlalchemy_sqlite3_engine )\n",
    "print( tweets_query_df )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More information:\n",
    "- pandas I/O - SQL Queries: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#sql-queries](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#sql-queries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pandas example: writing to CSV file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the dataframes loaded above, we'll now try outputting them to different data formats.  First up is CSV (Comma-Separated Values).\n",
    "\n",
    "To write a data frame to a CSV file, you use the pandas function `to_csv()`.  `to_csv()` has only one required argument, its first, which is the path or StringIO buffer to which you want to write the CSV file.  There are many other options you can pass as parameters as well:\n",
    "\n",
    "- [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format)\n",
    "\n",
    "When dealing with emoji or languages other than English, one parameter you may need to pass is the `encoding` parameter, to tell it what encoding you want other than the Python 2 default of ASCII, so that the special characters don't blow up the output.  A common encoding for text taht includes emoji, etc. is \"utf-8\".\n",
    "\n",
    "You also will likely have to work with the following parameters to deal with new line characters being in the middle of text:\n",
    "\n",
    "- __line_terminator__: Character sequence denoting line end (default ‘\\n’)\n",
    "- __quoting__: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL)\n",
    "\n",
    "    Options (from [https://docs.python.org/2/library/csv.html](https://docs.python.org/2/library/csv.html)):\n",
    "    \n",
    "    - _csv.QUOTE_ALL_ - Instructs writer objects to quote all fields.\n",
    "    - _csv.QUOTE_MINIMAL_ - Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.\n",
    "    - _csv.QUOTE_NONNUMERIC_ - Instructs writer objects to quote all non-numeric fields.  Instructs the reader to convert all non-quoted fields to type float.\n",
    "    - _csv.QUOTE_NONE_ - Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered.  Instructs reader to perform no special processing of quote characters.\n",
    "\n",
    "- __quotechar__: Character used to quote fields (default ‘”’)\n",
    "- __doublequote__: Control quoting of quotechar in fields (default True)\n",
    "- __escapechar__: Character used to escape sep and quotechar when appropriate (default None)\n",
    "\n",
    "These parameter values could have to change per operating system (Windows has different line ending characters than macs and linux, so you'd have to change it from the default if you wanted to escape out newlines for use on a Windows machine) and per program, depending on how well a given program deals with CSV.\n",
    "\n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas\n",
    "import sqlalchemy\n",
    "\n",
    "# declare variables\n",
    "sqlite_database_URL = 'sqlite:///db.sqlite3'\n",
    "sqlalchemy_sqlite3_engine = None\n",
    "table_name = \"\"\n",
    "tweets_table_df = None\n",
    "tweet_query = \"\"\n",
    "tweets_query_df = None\n",
    "\n",
    "# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
    "#    sqlite3 database tweet_sample.sqlite in the current directory.\n",
    "# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
    "\n",
    "# sqlite://<nohostname>/<path>\n",
    "# where <path> is relative:\n",
    "sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
    "\n",
    "# first, try reading the entire \"tweetnet_tweet\" table into memory.\n",
    "table_name = \"tweetnet_tweet\"\n",
    "tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "#print( tweets_table_df )\n",
    "\n",
    "# then try outputting to a CSV file\n",
    "tweets_table_df.to_csv( \"tweets_table.csv\", encoding = \"utf-8\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More information:\n",
    "\n",
    "- pandas doc - writing to CSV file: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pandas example: writing to tab-delimited file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tab-delimited files are sometimes more robust than commad-delimited files (data is less likely in general to contain tabs than commas).\n",
    "\n",
    "Writing a data frame to a tab-delimited file is very similar to writing to a CSV file.  You use the pandas function `to_csv()`, but you have to specify at least one more parameter -  __`sep`__, the variable that holds the separator between column values for a row.  The rules for `to_csv()` are the same: there is one required argument, its first, which is the path or StringIO buffer to which you want to write the CSV file.\n",
    "\n",
    "There are many other options you can pass as parameters as well:\n",
    "\n",
    "- [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format)\n",
    "\n",
    "You will again potentially have to work with the following parameters to deal with problems like incorrect encoding or new line characters being in the middle of text and causing the parser to get thrown off:\n",
    "\n",
    "- __encoding__: a string representing the encoding to use if the contents are non-ASCII, for python versions prior to 3.  A common encoding for text that can handle emoji, non-English alphabets, etc. is \"utf-8\".\n",
    "- __line_terminator__: Character sequence denoting line end (default ‘\\n’)\n",
    "- __quoting__: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL)\n",
    "\n",
    "    Options (from [https://docs.python.org/2/library/csv.html](https://docs.python.org/2/library/csv.html)):\n",
    "    \n",
    "    - _csv.QUOTE_ALL_ - Instructs writer objects to quote all fields.\n",
    "    - _csv.QUOTE_MINIMAL_ - Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.\n",
    "    - _csv.QUOTE_NONNUMERIC_ - Instructs writer objects to quote all non-numeric fields.  Instructs the reader to convert all non-quoted fields to type float.\n",
    "    - _csv.QUOTE_NONE_ - Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered.  Instructs reader to perform no special processing of quote characters.\n",
    "\n",
    "- __quotechar__: Character used to quote fields (default ‘”’)\n",
    "- __doublequote__: Control quoting of quotechar in fields (default True)\n",
    "- __escapechar__: Character used to escape sep and quotechar when appropriate (default None)\n",
    "\n",
    "These parameter values could change per operating system (Windows has different line ending characters than macs and linux, so you'd have to change it from the default if you wanted to escape out newlines for use on a Windows machine) and per program, depending on how well a given program deals with CSV.\n",
    "\n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas\n",
    "import sqlalchemy\n",
    "\n",
    "# declare variables\n",
    "sqlite_database_URL = 'sqlite:///db.sqlite3'\n",
    "sqlalchemy_sqlite3_engine = None\n",
    "table_name = \"\"\n",
    "tweets_table_df = None\n",
    "tweet_query = \"\"\n",
    "tweets_query_df = None\n",
    "\n",
    "# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
    "#    sqlite3 database tweet_sample.sqlite in the current directory.\n",
    "# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
    "\n",
    "# sqlite://<nohostname>/<path>\n",
    "# where <path> is relative:\n",
    "sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
    "\n",
    "# first, try reading the entire \"tweetnet_tweet\" table into memory.\n",
    "table_name = \"tweetnet_tweet\"\n",
    "tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "#print( tweets_table_df )\n",
    "\n",
    "# then try outputting to a tab-delimited file\n",
    "tweets_table_df.to_csv( \"tweets_table.txt\", sep = \"\\t\", encoding = \"utf-8\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pandas example: writing to JSON file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "JSON is text-based format for capturing and sharing structured data.  For more details, see [http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#JSON](http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#JSON).  Writing data to JSON makes it more readily usable with javascript libraries like D3.js.\n",
    "\n",
    "When you write data to JSON, the main parameters that you'll need to worry about:\n",
    "\n",
    "- __`orient`__ - The format of the JSON string.  You'll probably want \"records\" if you'll be using your data with something like d3.js.  Options:\n",
    "\n",
    "    - split \tdict like {index -> [index], columns -> [columns], data -> [values]}\n",
    "    - records \tlist like [{column -> value}, ... , {column -> value}]\n",
    "    - index \tdict like {index -> {column -> value}}\n",
    "    - columns \tdict like {column -> {index -> value}}\n",
    "    - values \tjust the values array\n",
    "\n",
    "- __`force_ascii`__ - force encoded string to be ASCII, default True.  For emoji/non-standard characters, probably want false.\n",
    "\n",
    "There are a few other options you can pass as parameters as well:\n",
    "\n",
    "- [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#json](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#json)\n",
    "\n",
    "You will again potentially have to work with the parameters to deal with problems like incorrect encoding or new line characters being in the middle of text and causing the parser to get thrown off.\n",
    "\n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas\n",
    "import sqlalchemy\n",
    "\n",
    "# declare variables\n",
    "sqlite_database_URL = 'sqlite:///db.sqlite3'\n",
    "sqlalchemy_sqlite3_engine = None\n",
    "table_name = \"\"\n",
    "tweets_table_df = None\n",
    "tweet_query = \"\"\n",
    "tweets_query_df = None\n",
    "tweets_json = \"\"\n",
    "\n",
    "# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
    "#    sqlite3 database tweet_sample.sqlite in the current directory.\n",
    "# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
    "\n",
    "# sqlite://<nohostname>/<path>\n",
    "# where <path> is relative:\n",
    "sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
    "\n",
    "# first, try reading the entire \"tweetnet_tweet\" table into memory.\n",
    "table_name = \"tweetnet_tweet\"\n",
    "tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "#print( tweets_table_df )\n",
    "\n",
    "# then try turning it into JSON\n",
    "tweets_json = tweets_table_df.to_json( orient = \"records\", force_ascii = False )\n",
    "\n",
    "# print JSON string\n",
    "#print( tweets_json )\n",
    "\n",
    "# could also output to a file\n",
    "tweets_table_df.to_json( \"tweets_json.json\", orient = \"records\", force_ascii = False )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pandas example: writing to Excel file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Quotes from: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files)_\n",
    "\n",
    "\"To write a DataFrame object to a sheet of an Excel file, you can use the `to_excel` instance method. The arguments are largely the same as `to_csv` described above, the first argument being the name of the excel file, and the optional second argument the name of the sheet to which the DataFrame should be written. For example:\n",
    "\n",
    "    df.to_excel( 'path_to_file.xlsx', sheet_name='Sheet1' )\"\n",
    "\n",
    "Excel documents have a more complicated file format than a simple delimited text document, and so there are different rendering engine you can use if you are not getting the results you seek.\n",
    "\n",
    "From the pandas doc: \"pandas chooses an Excel writer via two methods:\n",
    "\n",
    "- the `engine` keyword argument\n",
    "- the filename extension (via the default specified in config options)\n",
    "\n",
    "By default, pandas uses the XlsxWriter for .xlsx and openpyxl for .xlsm files and xlwt for .xls files. If you have multiple engines installed, you can set the default engine through setting the config options io.excel.xlsx.writer and io.excel.xls.writer. pandas will fall back on openpyxl for .xlsx files if Xlsxwriter is not available.\n",
    "\n",
    "To specify which writer you want to use, you can pass an engine keyword argument to to_excel and to ExcelWriter. The built-in engines are:\n",
    "\n",
    "- __openpyxl__: This includes stable support for OpenPyxl 1.6.1 up to but not including 2.0.0, and experimental support for OpenPyxl 2.0.0 and later.\n",
    "\n",
    "    - to install: `conda install openpyxl`\n",
    "\n",
    "- __xlsxwriter__\n",
    "\n",
    "    - to install: `conda install xlsxwriter`\n",
    "\n",
    "- __xlwt__\n",
    "\n",
    "    - to install: `conda install xlwt`\n",
    "\n",
    "For a more detailed explanation of options for Excel output:\n",
    "- [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files)\n",
    "\n",
    "And for a refresher on the parameters you can pass `to_csv()`:\n",
    "- [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format)\n",
    "\n",
    "In addition to one or the other of the engines not working quite right, you still might again have to work with the parameters to deal with problems like incorrect encoding or new line characters being in the middle of text and causing the parser to get thrown off.\n",
    "\n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas\n",
    "import sqlalchemy\n",
    "\n",
    "# declare variables\n",
    "sqlite_database_URL = 'sqlite:///db.sqlite3'\n",
    "sqlalchemy_sqlite3_engine = None\n",
    "table_name = \"\"\n",
    "tweets_table_df = None\n",
    "tweet_query = \"\"\n",
    "tweets_query_df = None\n",
    "tweets_json = \"\"\n",
    "\n",
    "# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
    "#    sqlite3 database tweet_sample.sqlite in the current directory.\n",
    "# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
    "\n",
    "# sqlite://<nohostname>/<path>\n",
    "# where <path> is relative:\n",
    "sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
    "\n",
    "# first, try reading the entire \"tweetnet_tweet\" table into memory.\n",
    "table_name = \"tweetnet_tweet\"\n",
    "tweets_table_df = pandas.read_sql_table( table_name, sqlalchemy_sqlite3_engine )\n",
    "#print( tweets_table_df )\n",
    "\n",
    "# then try turning it into JSON\n",
    "tweets_json = tweets_table_df.to_json( orient = \"records\", force_ascii = False )\n",
    "\n",
    "# print JSON string\n",
    "#print( tweets_json )\n",
    "\n",
    "# could also output to a file\n",
    "tweets_table_df.to_excel( \"tweets.xlsx\", encoding = \"utf-8\", engine = \"openpyxl\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More information on writing from a pandas data frame:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- pandas I/O documentation: [pandas.pydata.org/pandas-docs/version/0.16.0/io.html](pandas.pydata.org/pandas-docs/version/0.16.0/io.html)\n",
    "- read table or query into pandas: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#sql-queries](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#sql-queries)\n",
    "- export to CSV: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#writing-to-csv-format)\n",
    "- export to JSON: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#json](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#json)\n",
    "- export to Excel: [http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files](http://pandas.pydata.org/pandas-docs/version/0.16.0/io.html#excel-files)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Direct access to your database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some applications can also access a database directly, similar to how pandas is able to either load the contents of an entire table or a query's result set.  This direct access might be implemented in a program-specific set of extensions to the analysis package's syntax (like R, for example), or it could be via a standard database access API, the most common of which is ODBC.  This can be a good answer, but there are trade-offs.\n",
    "\n",
    "- good:\n",
    "\n",
    "    - if you update or add data, changes will be reflected next time you load using a query.\n",
    "    - often an easier way to deal with complex data (text with newlines and quotation marks and tabs in it, for example) that contains special characters that could make it hard to do a good job of building a CSV or tab-delimited file.\n",
    "\n",
    "- bad:\n",
    "\n",
    "    - if you update or add data, changes will be reflected next time you load using a query.  If you screw up, or if you didn't intend updates to go to analysis just yet, too bad.\n",
    "    - software has to support it - not all do.\n",
    "    - if it is like R, you might need to learn a new programming language.\n",
    "    - if ODBC, ODBC can be a pain to get configured, and can be pretty brittle.  You often are also dealing with an arkane and neglected UI designed to \"help\" you import from a database.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Application-specific - example: R"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example of R code to connect to SQLite 3 database and pull a table in as a data frame ( _based on: [http://stackoverflow.com/questions/12845639/accessing-sqlite-database-in-r](http://stackoverflow.com/questions/12845639/accessing-sqlite-database-in-r)_ ):\n",
    "\n",
    "    # first, change your working directory to the directory where the database file resides.\n",
    "    setwd( \"/Users/jonathanmorgan/Documents/work/twitter_test/research\" )\n",
    "    \n",
    "    # load DBI and RSQLite\n",
    "    library( DBI )\n",
    "    library( RSQLite )\n",
    "    \n",
    "    # connect to the database\n",
    "    database_file <- \"db.sqlite3\"\n",
    "    dbConnection <- dbConnect( dbDriver( \"SQLite\" ), dbname = database_file )\n",
    "    \n",
    "    # run query to pull back tweetnet_tweet table.\n",
    "    tweet_select_sql <- \"SELECT * FROM tweetnet_tweet;\"\n",
    "    tweet_results <- dbSendQuery( dbConnection, tweet_select_sql )\n",
    "    \n",
    "    # fetch results into an R data frame\n",
    "    tweet_df = dbFetch( tweet_results )\n",
    "    \n",
    "    # output column names (they are the same as column names in database).\n",
    "    colnames( tweet_df )\n",
    "\n",
    "    # try describing a few columns.\n",
    "    summary( tweet_df$tweet_retweet_count )\n",
    "    summary( tweet_df$tweet_url_count )\n",
    "    summary( tweet_df$tweet_hashtag_mention_count )\n",
    "    \n",
    "    # perhaps a histogram?\n",
    "    hist( tweet_df$tweet_hashtag_mention_count ) # long tail\n",
    "    \n",
    "    # Clear the result\n",
    "    dbClearResult( tweet_results )\n",
    "\n",
    "    # Disconnect from the database\n",
    "    dbDisconnect( dbConnection )\n",
    "    \n",
    "More information:\n",
    "\n",
    "- Github home of RSQLite project: [https://github.com/rstats-db/RSQLite](https://github.com/rstats-db/RSQLite)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization Tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, you might ask, why did we talk so much about data formats and options and less about visualization?  Two answers:\n",
    "\n",
    "- 1) I am a database and server guy.  Tailed log files are a pretty good visualization in my book (see how I used some terminology there to even make it seem like I live in a different world...?).\n",
    "- 2) There are a gazillion of them.  And many are specialized - good at some types of plots; types of data; or philosophies of science pedagogy; and awful for others.  And different disciplines or even cliques within disciplines have their favorites.  Your advisor might have one he knows and loves.  Or your department might have one that they buy, so you use for free.  Etc.\n",
    "\n",
    "I personally pick a tool based on the type of data and analysis I am doing, but it is rarely the same tool.  My engineering colleagues always use matlab no matter what (I own a copy just so I could more easily adjust captions close to deadline without having to get them to do it).\n",
    "\n",
    "I have looked into this since the start of the semester, and the best answer I could come up with was to give you strategies for springing your data from the database and let you come up with visualization tools you want to use.\n",
    "\n",
    "Some general tools for visualization:\n",
    "\n",
    "- ggplot, ggplot2 in R\n",
    "- matplotlib in Python for statistical distributions\n",
    "- gnuplot\n",
    "\n",
    "    - http://people.duke.edu/~hpgavin/gnuplot.html\n",
    "    - http://www.gnuplot.info/\n",
    "\n",
    "- d3 in Javascript for interactive web visualizations - http://d3js.org/\n",
    "- Gephi for network visualizations\n",
    "- tableu\n",
    "- heatmaps in R - http://digitheadslabnotebook.blogspot.com/2011/06/drawing-heatmaps-in-r.html?imm_mid=0cd09a&cmp=em-data-na-na-newsltr_20150218"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}