jonathanmorgan/Data_and_databases.ipynb

## Data_and_databases.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data, Databases, and SQL\n",
    "\n",
    "# Table of Contents\n",
    "\n",
    "- [Introduction](#Introduction)\n",
    "    \n",
    "    - [KEEP A NOTEBOOK!](#KEEP-A-NOTEBOOK!)\n",
    "    - [backing up SQLite databases](#backing-up-SQLite-databases)\n",
    "    \n",
    "- [Setup](#Setup)\n",
    "\n",
    "    - [Working folder](#Working-folder)\n",
    "    - [IPython notebook](#IPython-notebook)\n",
    "    - [Databases](#Databases)\n",
    "\n",
    "- [Databases - relational vs. no sql](#Databases---relational-vs.-no-sql)\n",
    "\n",
    "    - [Why relational, and why sqlite?](#Why-relational,-and-why-sqlite?)\n",
    "\n",
    "- [SQL](#SQL)\n",
    "\n",
    "    - [SELECT - Querying the database](#SELECT---Querying-the-database)\n",
    "    - [WHERE clauses - Limiting the results](#WHERE-clauses---Limiting-the-results)\n",
    "    - [JOIN - Connecting multiple tables](#JOIN---Connecting-multiple-tables)\n",
    "    - [GROUP BY and aggregate functions](#GROUP-BY-and-aggregate-functions)\n",
    "    - [ORDER BY - ordering query results](#ORDER-BY---ordering-query-results)\n",
    "    - [Modifying the database](#Modifying-the-database)\n",
    "\n",
    "- [SQL exercise - Moneyball](#SQL-exercise---Moneyball)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "## KEEP A NOTEBOOK!\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "When you are working on data projects, you should keep a notebook that includes the SQL code you write.  This can (and probably should) be digital - a markdown document, an Evernote note, etc.  In it record everything you do, with date and time stamps if appropriate.\n",
    "\n",
    "- notes in english\n",
    "- SQL\n",
    "- python steps\n",
    "- etc.\n",
    "\n",
    "An ipython notebook can be good for this (but you have to keep making new cells when you re-run your program), but so can a text file.  How you do it doesn’t matter.  But, do it.  Keep your code!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## backing up SQLite databases\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "### How do I back up?\n",
    "\n",
    "- backing up sqlite3 database = make copy of file.\n",
    "- can also export, such that you can make exports of specific tables, and then is easier to give data to someone else for inclusion in an existing database.  To do this in SQLite Manager:\n",
    "\n",
    "    - open SQLite Manager.\n",
    "    - open your database.\n",
    "    - in the column on the left, click on the table you want to export.\n",
    "    - select \"Export Table\".\n",
    "    - in the \"Export Wizard\", choose the SQL tab and check \"Include CREATE TABLE statement\".\n",
    "    - Click \"OK\" at the bottom of the window.\n",
    "    - Select a location for the SQL file and click \"Save\".\n",
    "    - The SQL file will be exported and placed where you specified.\n",
    "\n",
    "### When do I back up?\n",
    "\n",
    "- back up before you run any program that will make big changes.\n",
    "- back up before you test out complicated logic, even if it shouldn't alter data.\n",
    "- back up every week or two just to be safe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "## Working Folder\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "For what we're going to do today, we need a work folder.  Either choose an existing folder where you'll work or make a new one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IPython notebook\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "You'll want to download a copy of this IPython notebook to your computer.\n",
    "\n",
    "- download this IPython notebook to your working folder.\n",
    "\n",
    "    - click the download link in the upper right, then choose to save it in your working folder.\n",
    "\n",
    "- Then, either:\n",
    "\n",
    "    - open IPython Notebook from Launcher, navigate to your working folder, then run your local copy of this notebook.\n",
    "    - OR open a shell in your working folder, run `ipython notebook`, then once the browser window opens up, run your local copy of this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Databases\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "For practice with SQL, we'll be looking at Sean Lahman's database of baseball stats, which contains data from 1871 to 2013.  To get the database:\n",
    "\n",
    "- go to the github.com page for the actual database file:\n",
    "\n",
    "    - [https://github.com/jknecht/baseball-archive-sqlite/blob/master/lahman2013.sqlite](https://github.com/jknecht/baseball-archive-sqlite/blob/master/lahman2013.sqlite).\n",
    "\n",
    "- Click the \"view the full file\" link to download it.\n",
    "- Download the database file to your working folder, the same folder where you downloaded this IPython Notebook.\n",
    "- To start, open it with SQLite Manager.\n",
    "\n",
    "Since I have you for the entire class, we'll also take look at some of the twitter data for the next assignment.\n",
    "\n",
    "- Go to exercise 6 in ctools.\n",
    "- Download the file `tweet_sample_sample.txt` to your work folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Databases - relational vs. no-sql\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "A relational database stores information in data files named **_tables_**.  Each **_table_** contains a pre-defined set of fields, called **_columns_**.  In a table, individual records are called **_rows_**, and in a given row, there will be one value for each **_column_**, sometimes called a **_cell_**.  Each row in a given table should have a **_key_** or ID, and when information in one table is related to information in another, in the most basic relational form, the two are linked by one of the rows storing the ID of the related row in a **_foreign key_** column as part of its data.  When you want to bring related data together in a relational database, you use the **_SQL_** querying language to filter and join together data across tables.\n",
    "\n",
    "A no-sql database treats data differently, and at this point, that is about all you can say that is common to all the different data storage systems that call them selves \"no-sql\".  no-sql databases are usually not inherently relational, though some are.  They used to not support SQL, but some have started to.  They tend to store data in more of a record- or document-centric model, where all data is stored in a homogenous pool of records, and the contents of a record are not formally defined.  In this model, some records are made up of name-value pairs (you are responsible for making sure records of a certain type have the same information).  Some hold \"documents\" that can contain substantially different formats (XML, JSON, plain text, PDF files, etc.).  Graph databases, which model network data, use the name-value pair container model, but have two types of records - nodes and ties.  Most grew up around the performance needs of web-based information systems that need to perform on a massive scale.\n",
    "\n",
    "So, how do you choose?  The key is understanding your data, understanding what your needs are, and then understanding what a given database system gives you.\n",
    "\n",
    "- If you really do have a set of documents and you just want to filter based on the text within, a document-based database might be the way to go (or just storing things on the file system and using command line tools...).\n",
    "- If you just have basic flat data (answers from a single set of survey questions for lots and lots of people), a database of name-value pair records might suffice.\n",
    "- If you have complex data made up of information on different distinct and interrelated entities, and you are interested in exploring the relationships, you probably should use a relational database where you can explicitly model your data and its relations (twitter data, plus a survey, plus facebook data and reddit posts, for example)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why relational, and why sqlite?\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "So, why a relational database, and why sqlite?\n",
    "\n",
    "- relational:\n",
    "\n",
    "    - our data is usually relational.  Twitter example:\n",
    "\n",
    "        - one user can have many tweets\n",
    "        - hashtags, urls, @mentions can all appear in multiple tweets.\n",
    "        - tweets can reference each other because of retweeting.\n",
    "\n",
    "- sqlite:\n",
    "\n",
    "    - because it is easy, local, contained in a file, doesn't require a standalone server, and data will likely be modest.\n",
    "    - Other options for when going gets rough:\n",
    "\n",
    "        - PostgreSQL\n",
    "\n",
    "            - Postgres.app for Mac\n",
    "\n",
    "        - MySQL (though doesn't deal as well with really large data sets, or with emoji, as postgresql).\n",
    "        - and many others."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "_Based on [Week 5: SQL lesson](http://webdev.cas.msu.edu/cas992/weeks/week5.html) by Rick Wash at Michigan State University - Thanks Rick!_\n",
    "\n",
    "SQL is a quirky language.  It is different from procedural languages like Python.  It is designed for a very specific purpose: to interact with relational data.  It isn't structured like other languages, and while it can make data access easy, it also can make tasks that would be easy in other languages (though perhaps not exceptionally performant) confoundingly complex.  Let's dive in so you can see it for yourself!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data we're looking at\n",
    "\n",
    "There are a couple of tables that we will be focusing on. Player information is all contained in the **`Master`** table. Team information is contained in the **`Teams`** table. Which team each player played on varies by year, and is contained in the **`Appearances`** table. Batting statistics are in the **`Batting`** table, and Pitching statistics for each pitcher are in the **`Pitching`** table. Finally, player salaries are all in the **`Salaries`** table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SELECT - Querying the database\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "The basic method of querying the database is to use a select statement:\n",
    "\n",
    "    SELECT *\n",
    "    FROM Master;\n",
    "\n",
    "Where:\n",
    "\n",
    "- you put the columns you want returned in the SELECT clause (after the word \"SELECT\" but before the word \"FROM\").  An asterisk ( \"\\*\" ) is a wildcard - it will return all columns for a given table.\n",
    "- you put the name of the table (or names of the tables - more on this in a bit) you want to query after the word \"FROM\", in the FROM clause.\n",
    "- it is considered good style to capitalize words in an SQL query that are SQL words, not variables, table names, or values you are filtering on or searching for - SELECT, FROM, WHERE, etc.\n",
    "- you should learn to end SQL statements with a semi-colon.  It isn't required everywhere, but it is required in some contexts, so better to be aware, get into the habit.\n",
    "\n",
    "Instead of specifying “all” columns ( \"\\*\" ), you can specify which columns you want by name, in a comma-delimited list after \"SELECT\":\n",
    "\n",
    "    SELECT playerID, nameFirst, nameLast\n",
    "    FROM Master;\n",
    "\n",
    "You can specify calculations in the list of columns also:\n",
    "\n",
    "    SELECT playerID, ( AB / G )\n",
    "    FROM Batting;\n",
    "\n",
    "And you can give those new columns names:\n",
    "\n",
    "    SELECT playerID, ( AB / G ) AS avg_bats_per_game\n",
    "    FROM Batting;\n",
    "    \n",
    "You can also use special keywords and functions in the SELECT clause.  For example, the keyword \"DISTINCT\", which only returns any given value in a given column once:\n",
    "\n",
    "    SELECT DISTINCT nameFirst\n",
    "    FROM Master;\n",
    "    \n",
    "And \"COUNT()\", which returns a count of matching rows rather than a list:\n",
    "    \n",
    "    SELECT COUNT( DISTINCT nameFirst )\n",
    "\n",
    "FROM Master;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## WHERE clauses - Limiting the results\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "In a SELECT query, you can add a WHERE clause to limit the results:\n",
    "\n",
    "    SELECT *\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2010;\n",
    "\n",
    "Where:\n",
    "\n",
    "- you are making conditional tests, just like in a Python \"if\" statement.\n",
    "- EXCEPT here, instead of \"==\" being the equality operator, it is just \"=\".\n",
    "- Comparison operators:\n",
    "\n",
    "    - \"**_`=`_**\" - equal to\n",
    "    - \"**_`!=`_**\" or \"**_`<>`_**\" - not equal to\n",
    "    - \"**_`<`_**\" - less than\n",
    "    - \"**_`<=`_**\" - less-than-or-equal-to\n",
    "    - \"**_`>`_**\" - greater than\n",
    "    - \"**_`>=`_**\" - greater-than-or-equal-to\n",
    "    - \"**_`LIKE`_**\" and \"**_`NOT LIKE`_**\" - wild-card matching operator, where percent matches 0 or more characters ( \"%\" ) and an underscore matches any 1 character ( \"_\" ).\n",
    "    - \"**_`IN( value_list )`_**\" and \"**_`NOT IN( value_list )`_**\" - checks whether the value to the left of the \"IN\", usually a column's value in a given row, is either IN or NOT IN the list on the right of the IN.\n",
    "    \n",
    "An example of using LIKE:\n",
    "\n",
    "    SELECT *\n",
    "    FROM Master\n",
    "    WHERE nameLast LIKE 'Jack%'\n",
    "\n",
    "You can specify multiple conditions for matching in your WHERE clauses, as well, to more precisely filter the results of your query:\n",
    "\n",
    "    SELECT *\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2010 and teamID = 'DET'\n",
    "    \n",
    "Note:\n",
    "\n",
    "- when you are matching a column whose type is numeric, you just put the value in the query, with no quotation marks (just like in Python).\n",
    "- when you are filtering a string column, you have to include the value you are looking for (the value on the right-hand side of the equal sign) in single-quotes.  They must be single-quotes, too.  Unlike in Python, double-quotes have an entirely different meaning that single quotes in SQL, and will cause your query to fail.\n",
    "\n",
    "Like \"None\" in Python, the signifier of an unset value in a column for a row is special - NULL.  To check for NULL, you use \"IS NULL\" or \"IS NOT NULL\", rather than the \"=\" or \"!=\".\n",
    "\n",
    "    /* find the fallen */\n",
    "    SELECT *\n",
    "    FROM Master\n",
    "    WHERE deathYear IS NOT NULL;\n",
    "\n",
    "You can also explicitly cut off the number of results your query returns using the LIMIT keyword.  Just LIMITing to 10 only returns the first 10 results for the query:\n",
    "\n",
    "    SELECT *\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2010 and teamID = 'DET'\n",
    "    LIMIT 10;\n",
    "    \n",
    "You can also use LIMIT to skip to the middle of the results by giving it two numbers, separated by a comma.  The first number is the number of records you want to skip, the second number is how many records you want to include after you skip:\n",
    "\n",
    "    /* skip 10, the output 15 */\n",
    "    SELECT *\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2010 and teamID = 'DET'\n",
    "    LIMIT 10, 15;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## JOIN - Connecting multiple tables\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "We can specify multiple tables in the FROM clause of a select query. This is called a “join”. However, when we do, we need to remember to specify how to match up rows across the two tables. Usually, there is a column that is the same in both tables that can be used to match them up. For much of the baseball database, that would be a column like playerID or teamID. Also, we frequently give tables temporary short names to make it easy to refer to them.\n",
    "\n",
    "    /* list home runs for all players. */\n",
    "    SELECT m.nameFirst, m.nameLast, b.HR\n",
    "    FROM Master m, Batting b\n",
    "    WHERE m.playerID = b.playerID;\n",
    "\n",
    "We can still use regular WHERE clauses in these queries, too, to further filter:\n",
    "\n",
    "    /* list home runs for all Detroit players from 2013. */\n",
    "    SELECT m.nameFirst, m.nameLast, b.HR\n",
    "    FROM Master m, Batting b \n",
    "    WHERE m.playerID = b.playerID\n",
    "        AND b.yearID = 2013\n",
    "        AND b.teamID = 'DET';\n",
    "\n",
    "Table joins are the most important feature of SQL databases; they are very powerful and allow us to create all kinds of complex queries. You can also join more than two tables if you like:\n",
    "\n",
    "    /* list players who appeared in at least one game in 2013. */\n",
    "    SELECT m.nameFirst, m.nameLast, t.name\n",
    "    FROM Master m, Appearances a, Teams t \n",
    "    WHERE m.playerID = a.playerID\n",
    "        AND a.teamID = t.teamID\n",
    "        AND a.yearID = t.yearID\n",
    "        AND a.yearID = 2013;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GROUP BY and aggregate functions\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "Often, one thing that you want to do is to aggregate over multiple rows. For example, what is the average number of hits per player for each team? Or what is the total amount of salary spent by each team? To do this, use a GROUP BY clause:\n",
    "\n",
    "    /* sum walks for each team in 2013 */\n",
    "    SELECT teamID, SUM(BB)\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2013\n",
    "    GROUP BY teamID;\n",
    "\n",
    "There are a number of useful aggregate functions:\n",
    "\n",
    "- **_SUM(column)_** : Calculate the sum of column for all the rows in each group\n",
    "- **_AVG(column)_** : Calculate the numeric average for all of the rows in each group\n",
    "- **_COUNT(column)_** : Count the number of rows in each group\n",
    "- **_MIN(column) and MAX(column)_** : Find the minimum or maximum value of column in all the rows in each group\n",
    "\n",
    "Often, it can be very powerful to combine GROUP BY and table joins. To figure out these queries, I recommend first getting the join to return the individual rows correctly, and then adding in the GROUP BY and aggregates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ORDER BY - ordering query results\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "Normally, the results are returned in the order they appear in the database. However, it can be very useful to re-order the results using ORDER BY\n",
    "\n",
    "    SELECT m.nameFirst, m.nameLast, b.teamID, b.HR\n",
    "    FROM Master m, Batting b \n",
    "    WHERE m.playerID = b.playerID\n",
    "        AND b.yearID = 2010\n",
    "    ORDER BY b.HR DESC\n",
    "\n",
    "(After you specify which column to order by, you can optionally specify either ASC for ascending order, or DESC for descending order.)\n",
    "\n",
    "Using ORDER BY with custom column names can be really useful when combined with GROUP BY:\n",
    "\n",
    "    SELECT teamID, SUM(HR) as homeruns\n",
    "    FROM Batting\n",
    "    WHERE yearID = 2010\n",
    "    GROUP BY teamID\n",
    "    ORDER BY homeruns DESC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modifying the database\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "In addition to retrieving information from an existing database, you can also insert data into a database, update existing rows, and delete records using SQL.  Here are some example queries:\n",
    "\n",
    "- **INSERT**: Adding a row to a table\n",
    "\n",
    "        INSERT INTO Master\n",
    "        (playerID, nameFirst, nameLast)\n",
    "        VALUES ('mightycasey5000', 'Mighty', 'Casey')\n",
    "\n",
    "- **UPDATE**: Changing data that is already in a table\n",
    "\n",
    "        UPDATE Master\n",
    "        SET nameGiven = 'The Mighty Casey'\n",
    "        WHERE playerID = 'mightycasey5000'\n",
    "\n",
    "- **DELETE**: Removing one or more rows from a table\n",
    "\n",
    "        DELETE FROM Master\n",
    "        WHERE playerID = 'mightycasey5000'\n",
    "\n",
    "- **ALTER TABLE**: Changing the structure of an existing table\n",
    "\n",
    "        ALTER TABLE Master\n",
    "            ADD COLUMN gender VARCHAR(1) DEFAULT 'M'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL exercise - Moneyball\n",
    "\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "Now that we've experienced SQL, lets try some querying.\n",
    "\n",
    "## Basics\n",
    "\n",
    "- Which player hit the most home runs in 2013?\n",
    "- Which team spent the most money on player salaries in 2013?\n",
    "\n",
    "## Advanced\n",
    "\n",
    "- Which team spent the least money per win on player salaries in 2013?\n",
    "- Which player averaged the fewest at bats between home runs in 2013?\n",
    "- Which player in 2013 had the highest on base percentage?\n",
    "- Which Detroit Tigers pitcher had the most wins in a single season in the 2000s (all seasons starting in 2000 or greater)?\n",
    "- In the 2000s, did the Detroit Tigers draw more or fewer walks (Base-on-Balls or BB) as the 21st century went on?\n",
    "\n",
    "<hr />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}