Skip to content

Instantly share code, notes, and snippets.

@jonathanmorgan
Last active January 12, 2017 11:30
Show Gist options
  • Save jonathanmorgan/a30a8915319edd0a43d2 to your computer and use it in GitHub Desktop.
Save jonathanmorgan/a30a8915319edd0a43d2 to your computer and use it in GitHub Desktop.
Databases and Python - SQLite 3
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Interacting with databases using Python - SQLite 3\n",
"\n",
"# Table of Contents\n",
"\n",
"- [Introduction](#Introduction)\n",
" \n",
" - [KEEP A NOTEBOOK!](#KEEP-A-NOTEBOOK!)\n",
" - [backing up SQLite databases](#backing-up-SQLite-databases)\n",
" \n",
"- [Setup](#Setup)\n",
"\n",
" - [Working folder](#Working-folder)\n",
" - [IPython notebook](#IPython-notebook)\n",
" - [Databases](#Databases)\n",
"\n",
"- [Databases - relational vs. no sql](#Databases---relational-vs.-no-sql)\n",
"\n",
" - [Why relational, and why sqlite?](#Why-relational,-and-why-sqlite?)\n",
"\n",
"- [Databases and python](#Databases-and-python)\n",
"\n",
" - [Making and using a database connection and a cursor](#Making-and-using-a-database-connection-and-a-cursor)\n",
"\n",
" - [Cleaning up!](#Cleaning-up!)\n",
" - [Special note on INSERTs in Python - getting ID of INSERTed row](#Special-note-on-INSERTs-in-Python---getting-ID-of-INSERTed-row)\n",
" \n",
" - [Managing cursors](#Managing-cursors)\n",
" \n",
"- [Database design](#Database-design)\n",
"\n",
" - [Normalizing data](#Normalizing-data)\n",
" \n",
" - [1st Normal Form](#1st-Normal-Form)\n",
" - [2nd Normal Form](#2nd-Normal-Form)\n",
" - [3rd Normal Form](#3rd-Normal-Form)\n",
" - [In addition](#In-addition)\n",
" \n",
"- [Exercise - Twitter data](#Exercise---Twitter-data)\n",
"\n",
" - [Design your database](#Design-your-database)\n",
" - [Strategies for processing and normalizing flat data files](#Strategies-for-processing-and-normalizing-flat-data-files)\n",
" \n",
" - [Using `csv` package to read and process CSV files](#Using-csv-package-to-read-and-process-CSV-files)\n",
" - [Reading CSV files with `pandas`](#Reading-CSV-files-with-pandas)\n",
" - [And the winner is...](#And-the-winner-is...)\n",
" \n",
"- [Final notes on complexity and cost-benefit of normalization](#Final-notes-on-complexity-and-cost-benefit-of-normalization)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"## KEEP A NOTEBOOK!\n",
"\n",
"- record everything you do, with date and time stamps if appropriate.\n",
"\n",
" - notes in english\n",
" - SQL\n",
" - python steps\n",
" - etc.\n",
"\n",
"- ipython notebook can be good for this (but you have to keep making new cells when you re-run your program), but so can a text file. How you do it doesn’t matter. But, do it. Keep your code!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## backing up SQLite databases\n",
"\n",
"### How do I back up?\n",
"\n",
"- backing up sqlite3 database = make copy of file.\n",
"- can also export, such that you can make exports of specific tables, and then is easier to give data to someone else for inclusion in an existing database. To do this in SQLite Manager:\n",
"\n",
" - open SQLite Manager.\n",
" - open your database.\n",
" - in the column on the left, click on the table you want to export.\n",
" - select \"Export Table\".\n",
" - in the \"Export Wizard\", choose the SQL tab and check \"Include CREATE TABLE statement\".\n",
" - Click \"OK\" at the bottom of the window.\n",
" - Select a location for the SQL file and click \"Save\".\n",
" - The SQL file will be exported and placed where you specified.\n",
"\n",
"### When do I back up?\n",
"\n",
"- back up before you run any program that will make big changes.\n",
"- back up before you test out complicated logic, even if it shouldn't alter data.\n",
"- back up every week or two just to be safe."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"## Working Folder\n",
"\n",
"For what we're going to do today, we need a work folder. Either choose an existing folder where you'll work or make a new one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## IPython Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- download this IPython notebook to your working folder.\n",
"\n",
" - click the download link in the upper right, then choose to save it in your working folder.\n",
"\n",
"- Then, either:\n",
"\n",
" - open IPython Notebook from Launcher, navigate to your working folder, then run your local copy of this notebook.\n",
" - OR open a shell in your working folder, run `ipython notebook`, then once the browser window opens up, run your local copy of this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Databases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll take a look at some twitter data in this notebook.\n",
"\n",
"- Go to exercise 6 in ctools.\n",
"- Download the file `tweet_sample_sample.txt` to your work folder."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Databases - relational vs. no-sql\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"A relational database stores information in data files named **_tables_**. Each **_table_** contains a pre-defined set of fields, called **_columns_**. In a table, individual records are called **_rows_**, and in a given row, there will be one value for each **_column_**, sometimes called a **_cell_**. Each row in a given table should have a **_key_** or ID, and when information in one table is related to information in another, in the most basic relational form, the two are linked by one of the rows storing the ID of the related row in a **_foreign key_** column as part of its data. When you want to bring related data together in a relational database, you use the **_SQL_** querying language to filter and join together data across tables.\n",
"\n",
"A no-sql database treats data differently, and at this point, that is about all you can say that is common to all the different data storage systems that call them selves \"no-sql\". no-sql databases are usually not inherently relational, though some are. They used to not support SQL, but some have started to. They tend to store data in more of a record- or document-centric model, where all data is stored in a homogenous pool of records, and the contents of a record are not formally defined. In this model, some records are made up of name-value pairs (you are responsible for making sure records of a certain type have the same information). Some hold \"documents\" that can contain substantially different formats (XML, JSON, plain text, PDF files, etc.). Graph databases, which model network data, use the name-value pair container model, but have two types of records - nodes and ties. Most grew up around the performance needs of web-based information systems that need to perform on a massive scale.\n",
"\n",
"So, how do you choose? The key is understanding your data, understanding what your needs are, and then understanding what a given database system gives you.\n",
"\n",
"- If you really do have a set of documents and you just want to filter based on the text within, a document-based database might be the way to go (or just storing things on the file system and using command line tools...).\n",
"- If you just have basic flat data (answers from a single set of survey questions for lots and lots of people), a database of name-value pair records might suffice.\n",
"- If you have complex data made up of information on different distinct and interrelated entities, and you are interested in exploring the relationships, you probably should use a relational database where you can explicitly model your data and its relations (twitter data, plus a survey, plus facebook data and reddit posts, for example)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why relational, and why sqlite?\n",
"\n",
"- relational:\n",
"\n",
" - our data is usually relational. Twitter example:\n",
"\n",
" - one user can have many tweets\n",
" - hashtags, urls, @mentions can all appear in multiple tweets.\n",
" - tweets can reference each other because of retweeting.\n",
"\n",
"- sqlite:\n",
"\n",
" - because it is easy, local, contained in a file, doesn't require a standalone server, and data will likely be modest.\n",
" - Other options for when going gets rough:\n",
"\n",
" - PostgreSQL\n",
"\n",
" - Postgres.app for Mac\n",
"\n",
" - MySQL (though doesn't deal as well with really large data sets, or with emoji, as postgresql).\n",
" - and many others."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Databases and python\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Python lets you interact with databases using SQL just like you do in SQLite Manager. Python code can do SELECTs, CREATEs, INSERTs, UPDATEs, and DELETEs, and any other SQL you can run other ways, and the results are returned in a format that lets you interact with them after the SQL statements finish.\n",
"\n",
"To interact with a database using python, first you have to connect to the database. Then, you use the connection to create a cursor, used to actually interact with the database over the connection.\n",
"\n",
"A cursor takes SQL written in python, delivers it to the database, then takes the results, converts them to a format that can be interacted with using python, and returns that transformed, usable response to you to play with.\n",
"\n",
"Cursors also let you send other types of commands to the database, including telling the database when you want to actually commit changes you've made (most relational databases are also transactional, meaning they can queue up sets of individual changes that are part of a higher-level transaction and only actually commit them to the database when told the transaction as a whole completed succesfully).\n",
"\n",
"The code below is specific to SQLite, but it uses functions and objects specified in Python's DB-API 2 API, and so you can use almost identical code to interact with other types of databases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making and using a database connection and a cursor\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"To create a database connection, you first must import that database's DB-API implementation, then you call the connect() function, passing it information on where to find the database to which you are trying to connect.\n",
"\n",
"To make a cursor, call the cursor() method on the connection object instance returned by the call to connect.\n",
"\n",
"To execute SQL, just pass the SQL statement to the cursor's execute() method and store the result in a variable.\n",
"\n",
"To access rows from query, make a `for current_row in result_set:` loop, then for each row, you can reference values in that row by using their name as the name reference into a dictionary. Example - get \"nameLast\" column value out of each row:\n",
"\n",
" last_name = result_set[ \"nameLast\" ]\n",
" \n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Detroit Home Run Hitters, 2013!\n",
"==> Alex Avila - 11 home runs!\n",
"==> Miguel Cabrera - 44 home runs!\n",
"==> Andy Dirks - 9 home runs!\n",
"==> Prince Fielder - 25 home runs!\n",
"==> Avisail Garcia - 2 home runs!\n",
"==> Bryan Holaday - 1 home runs!\n",
"==> Torii Hunter - 17 home runs!\n",
"==> Jose Iglesias - 2 home runs!\n",
"==> Omar Infante - 10 home runs!\n",
"==> Austin Jackson - 12 home runs!\n",
"==> Don Kelly - 6 home runs!\n",
"==> Victor Martinez - 14 home runs!\n",
"==> Brayan Pena - 4 home runs!\n",
"==> Jhonny Peralta - 11 home runs!\n",
"==> Ramon Santiago - 1 home runs!\n",
"==> Matt Tuiasosopo - 7 home runs!\n",
"In the 'finally:', cleaning up our mess.\n"
]
}
],
"source": [
"# imports\n",
"import sqlite3\n",
"\n",
"# declare variables\n",
"my_connection = None\n",
"my_cursor = None\n",
"result_set = None\n",
"current_row = None\n",
"\n",
"# use try-->except-->finally to make sure you always close your database\n",
"# connections.\n",
"try:\n",
" \n",
" # make a connection - pass the name of the database file.\n",
" my_connection = sqlite3.connect( 'lahman2013.sqlite' )\n",
" \n",
" # set row_factory that returns values mapped to column names\n",
" # as well as in an ordered list\n",
" my_connection.row_factory = sqlite3.Row\n",
" \n",
" # then, make a cursor.\n",
" my_cursor = my_connection.cursor()\n",
" \n",
" # Actually use cursor here...\n",
" result_set = my_cursor.execute(\n",
" '''\n",
" /* list home runs for all Detroit players from 2013. */\n",
" SELECT m.nameFirst, m.nameLast, b.HR\n",
" FROM Master m, Batting b \n",
" WHERE m.playerID = b.playerID\n",
" AND b.yearID = 2013\n",
" AND b.teamID = 'DET'\n",
" AND b.HR > 0;\n",
" ''' )\n",
" \n",
" # output results\n",
" print( \"Detroit Home Run Hitters, 2013!\" )\n",
" for current_row in result_set:\n",
" \n",
" print( \"==> \" + current_row[ \"nameFirst\" ] + \" \" + current_row[ \"nameLast\" ] + \" - \" + str( current_row[ \"HR\" ] ) + \" home runs!\" )\n",
" pass\n",
"\n",
" #-- END loop over results --#\n",
"\n",
"except Exception as e:\n",
" \n",
" print( \"Exception making connection or creating cursor!\" )\n",
" \n",
"finally:\n",
" \n",
" print( \"In the 'finally:', cleaning up our mess.\" )\n",
" \n",
" # close cursor\n",
" my_cursor.close()\n",
" \n",
" # close connection\n",
" my_connection.close()\n",
" \n",
"#-- END try-->except-->finally around database connection. --#"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleaning up!\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"You'll notice here that we did a little more than just open the connection and cursor. Whenever you do Input/Output (IO) - reading or writing from a file, a database table, or a network connection, or any other external resource - you should always clean up after yourself. Close whatever you've opened, so you don't accidentally lock others out of resources.\n",
"\n",
"Try-->except-->finally helps make being a good IO citizen easier. We've discussed \"try\" and \"except\". Now to \"finally\" - things in the \"finally\" clause always execute even if there is an exceptio. So, if you, say, open a database file for reading and want to make as certain as you can that it will be closed once you are done, you place a call to my_connection.close() inside the \"finally:\" block, so it gets called no matter what happens in the try and except. Bbe careful with it, but always use it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Special note on INSERTs in Python - getting ID of INSERTed row\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"When you do an INSERT in Python, you'll often want to know the ID of the row you just inserted, so you can reference it in relations from other rows. The method for getting the ID of the record created in the most recently executed INSERT is different for different databases. For SQLite, as long as your primary key is an auto-incremented integer, you can reference the ID of the last row processed by the cursor in the variable `lastrowid`, stored on your database cursor: `my_cursor.lastrowid`\n",
"\n",
"Example (and example of making an in-memory instance of sqlite):"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New record inserted with ID = 1\n",
"In the 'finally:', cleaning up our mess.\n"
]
}
],
"source": [
"# imports\n",
"import sqlite3\n",
"\n",
"# Based on answer in: http://stackoverflow.com/questions/6242756/how-to-retrieve-inserted-id-after-inserting-row-in-sqlite-using-python\n",
"\n",
"# declare variables\n",
"my_connection = None\n",
"my_cursor = None\n",
"create_sql = \"\"\n",
"insert_sql = \"\"\n",
"new_record_id = None\n",
"\n",
"# use try-->except-->finally to make sure you always close your database\n",
"# connections.\n",
"try:\n",
" \n",
" # make a connection - for in-memory database, pass ':memory:'\n",
" my_connection = sqlite3.connect( ':memory:' )\n",
" \n",
" # set row_factory that returns values mapped to column names\n",
" # as well as in an ordered list\n",
" my_connection.row_factory = sqlite3.Row\n",
" \n",
" # then, make a cursor.\n",
" my_cursor = my_connection.cursor()\n",
" \n",
" # create table with integer auto-increment ID.\n",
" create_sql = '''\n",
" CREATE TABLE foo ( id integer primary key autoincrement,\n",
" username varchar( 50 ),\n",
" password varchar( 50 ) );\n",
" '''\n",
" my_cursor.execute( create_sql )\n",
" \n",
" # Make INSERT SQL string\n",
" # Place question marks (no quotes around them regardless of type of column)\n",
" # where the values you'll place in the record go, to guard against\n",
" # SQL injection - see next section for more details.\n",
" insert_sql = '''\n",
" INSERT INTO foo ( username, password ) VALUES ( ?, ? );\n",
" '''\n",
" \n",
" # pass execute() INSERT SQL, then a tuple list of values for each question mark\n",
" # in your INSERT (username and password, no ID - ID is auto-generated.).\n",
" my_cursor.execute( insert_sql, ( 'test_username', 'test_password' ) )\n",
" \n",
" # get ID of INSERTed record.\n",
" new_record_id = my_cursor.lastrowid\n",
" \n",
" # output result\n",
" print( \"New record inserted with ID = \" + str( new_record_id ) )\n",
"\n",
"except Exception as e:\n",
" \n",
" print( \"Exception making connection or creating cursor!\" )\n",
" print( e )\n",
" \n",
"finally:\n",
" \n",
" print( \"In the 'finally:', cleaning up our mess.\" )\n",
" \n",
" # close cursor\n",
" my_cursor.close()\n",
" \n",
" # close connection\n",
" my_connection.close()\n",
" \n",
"#-- END try-->except-->finally around database connection. --#"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Managing cursors\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Cursors keep track of your interactions with a database. If you do a select, when you loop over the things returned, they keep track of your position within the results. They also keep information about the queries you run, so you can get more details after the fact, in case of errors or wanting a precise status. They are designed to work with both networked and local databases, and so they also handle behind-the-scenes tasks like loading in new rows of a query result set as you request them, so you don't load the entire result set from the database into memory at once.\n",
"\n",
"Because of this work they are doing behind the scenes, cursors can sometimes be quirky about doing more than one thing at once, in particular in simple databases like SQLite. Be aware that you might need to create multiple cursors whenever you are doing multiple interactions with a database at the same time. For example:\n",
"\n",
"- if you do a SELECT inside a loop over the result set of another SELECT, you might need to do each SELECT from a separate cursor, so the progress through each is kept separate.\n",
"\n",
"- if you are INSERTing a record inside a loop over a SELECT statement's results, you might need to use a separate cursor for the INSERTs from the one you used to run the SELECT, since cursors sometimes have trouble when you read, then write at the same time.\n",
"\n",
"You shouldn't need to make separate connections. One should be enough. You might have to use that one connection to make and store multiple cursors in multiple variables, though. A brief example:\n",
"\n",
" # declare variables\n",
" connection = None\n",
" cursor1 = None\n",
" cursor2 = None\n",
" result1 = None\n",
" \n",
" try:\n",
" \n",
" # make connection\n",
" connection = sqlite3.connect( \"tweet_sample.sqlite\" )\n",
" cursor1 = conn.cursor()\n",
" cursor2 = connection.cursor()\n",
" \n",
" # do a select with the first cursor\n",
" result1 = cursor1.execute( \"SELECT * FROM tweet_sample_raw\" )\n",
" \n",
" # loop\n",
" for row in result1:\n",
" \n",
" # do an INSERT with cursor 2, not cursor 1\n",
" cursor2.execute( \"INSERT INTO example ( test1, test2 ) VALUES ( \"test1\", \"test2\" ) ) \n",
" \n",
" #-- END loop over results --#\n",
" \n",
" except:\n",
" \n",
" print( \"Exception!!!\" )\n",
" \n",
" finally:\n",
" \n",
" # close cursors\n",
" cursor1.close()\n",
" cursor2.close()\n",
" \n",
" # close connection\n",
" connection.close()\n",
" \n",
" #-- END try-->except-->finally --#\n",
"\n",
"\n",
"Some databases might not have problems with these kinds of scenarios, but the details of how a given database's implementation handles all this can be complicated and broken out very differently from database to database, so it is a good idea to not assume that any databases will be able to deal well with them. It is probably fine to try doing multiple things with a single cursor, but be on the lookout for problems if you don't make separate cursors when you have multiple database interactions going on at once. And, make sure that you close all the cursors you open in the \"finally\" after your database interactions are complete."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Odds and ends\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"### Preventing SQL Injection\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"From wikipedia: \"SQL injection is a code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker).\"\n",
"\n",
"- [https://en.wikipedia.org/wiki/SQL_injection](https://en.wikipedia.org/wiki/SQL_injection)\n",
"\n",
"SQL injection attacks are usually run against web forms. Remember that a semi-colon signifies the end of a statement in SQL? In SQL Injection attacks, a malicious entity finds a way to pass a value that will be part of an SQL query in to your database with a semi-colon and then SQL statements they want to execute in your database. They could dump all your data to a location they control, or depending on the privileges of the database user, they could also delete or alter tables.\n",
"\n",
"Example:\n",
"- When I worked at the Detroit News, some of the application code we had re-written by a third party took values form a search form and chucked them directly into the database. It took 3 or so years for someone to figure out that this exploit was there, but once someone found it, the hackers went in and started creating admin users in our admins and creating and updating users in our forums through form submissions to unrelated parts of the site (the ones that were not secure), and it took a full week to really get everything straightened out.\n",
"\n",
"How to avoid it:\n",
"\n",
"- from [http://bobby-tables.com/python.html](http://bobby-tables.com/python.html):\n",
"- DON'T just throw string values directly into SQL.\n",
"- instead, build out your SQL with question marks ( \"?\" ) where you'll substitute in values.\n",
"\n",
" - No need for quotation marks around string value question marks. The substitution will take care of all of that for you.\n",
"\n",
"- then, when you call execute(), pass in the string that contains your SQL in the first argument, then the list of values to be substituted for each, in order, in the second parameter.\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"update_sql = \"UPDATE people SET name = ? WHERE id = ?;\"\n",
"my_cursor.execute( update_sql, ( \"Mighty Casey\", 2 ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Troubleshooting\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Some commone errors, and what they likely mean:\n",
"\n",
"- ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.\n",
"\n",
" If you see this error, you are taking UTF-8 encoded text, which could have EMOJI!!! and storing it in a byte string (8-bit/1-byte strings, in which emoji cause problems).\n",
" \n",
" When reading values out of a SELECT query result set, make sure for strings, that you know your encoding and convert them to unicode:\n",
" \n",
" # unicode( \"string_value\", \"<encoding>\" )\n",
" # for UTF-8: unicode( \"string_value\", \"UTF-8\" )\n",
" unicode_string = unicode( result_set[ \"string_column\" ], \"UTF-8\" )\n",
" \n",
"- Conversely, when preparing data for use in an INSERT or UPDATE statement, you need to make sure that all variable values are cast (converted) to appropriate types for the column type in the database (INTEGER = int(), VARCHAR or TEXT BLOB = unicode(), etc.). With VARCHAR or TEXT columns, the default encoding in SQLite3 is \"UTF-8\", so you'll want to either pass Unicode strings or UTF-8 encoded text.the conversion will look like this:\n",
"\n",
"- If you mess up the path you pass to the sqlite3.connect() call, be aware that for SQLite3, if you try to access a database that doesn't exist, if you have write access to the file system, Python will create a new database to match the path you entered. So, if you try to connect to the database and it fails, but then when you go to look at the directory, you'll see a database that is named the right name but that is empty. In this case, just whack that database file and put in the right path and filename next time, or move the database to the location where you thought it was."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Database design\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Considering the two types of databases we discussed earlier, relational and no-sql, each is flexible enough that you can choose to store data in them in many different ways, and you can even replicate the way each works in the other with a little help from the program you use to interact with a given data store.\n",
"\n",
"With big, complicated, inter-related data, however, if you want to manage the complexity and integrity of your data, you have to do more than just dump information into your database willy-nilly. You need to design the structure and relations within your database so that you can easily understand what is where, and can efficiently and reliably make use of your information. Thus, we aim to design our databases, and the most commonly used process for database design is normalization.\n",
"\n",
"Microsoft's explanation of normalization and why we do it is the most clear and concise I've found (from [http://support.microsoft.com/kb/283878](http://support.microsoft.com/kb/283878)):\n",
"\n",
">\"Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.\n",
">\n",
">Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. A customer address change is much easier to implement if that data is stored only in the Customers table and nowhere else in the database.\n",
">\n",
">What is an \"inconsistent dependency\"? While it is intuitive for a user to look in the Customers table for the address of a particular customer, it may not make sense to look there for the salary of the employee who calls on that customer. The employee's salary is related to, or dependent on, the employee and thus should be moved to the Employees table. Inconsistent dependencies can make data difficult to access because the path to find the data may be missing or broken.\"\n",
"\n",
"A few key points:\n",
"\n",
"- **eliminating redundancy** - If a given data item could be applicable to many records in your data set, you should have one copy of it in a separate table, and link records to it using its unique key. This is in the same family as DRY (don't repeat yourself), making functions out of code you repeat (single-source programming), etc. Examples:\n",
"\n",
" - addresses - shared by all members of a family, and also potentially used for billing address, etc.\n",
" - hash tags - could be applied to many items in a given system (twitter), and could be used across systems as well (facebook, instagram, twitter, etc.).\n",
" - demographic data about a participant in a study where you've gathered data on that individual from multiple sources.\n",
" \n",
"- **minimizing \"inconsistent dependency\"** - For complex data, you should break information out by entity, and for each entity you should aim to 1) only include information in an entity's table that is intimately or closely related to it AND 2) specifically remove information from the table if it ISN'T intimately or closely related. For information that is loosely related, place it in its own table, and make links between the two. Example:\n",
"\n",
" - entity - person:\n",
" \n",
" - height?\n",
" - shoe size?\n",
" - mother's first name?\n",
" - name of childhood best friend's current dog?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalizing data\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"There are different levels of normalization of data: 1st-normal, 2nd-normal, 3rd-normal, and on. Each subsquent form requires the data to be in the previous form(s). 3rd normal form is considered the \"good enough\" level of normalization, so we'll just cover to there, and if you take to this, you can look into it more at your leisure.\n",
"\n",
"### 1st Normal Form\n",
"\n",
"- **_Each cell in a given row should only contain a single value._**\n",
"\n",
" - No lists of values in a single column. Movie rental - don't have a column named \"movies\" that has a list of movies rented.\n",
"\n",
"- **_There should be only one column in a given table intended to hold a certain type of data._**\n",
"\n",
" - so, if a person rented two movies, they should have two rental records, one per movie, rather than having movie_1 and movie_2 (what if you rent 3 or 4 movies? 3000?).\n",
"\n",
"- **_Each row in a given database table should be unique._**\n",
"\n",
" - no two rows should have exactly the same values in all of their columns.\n",
"\n",
"### 2nd Normal Form\n",
"\n",
"- **_Data must be in 1st Normal Form._**\n",
"- **_Each row must have a single-column primary key (a unique identifier of the row)._**\n",
"\n",
" - primary keys can be any format, but I recommend they be:\n",
"\n",
" - integers\n",
" - auto-incremented (so leave out of INSERT, it just gets automatically set, database handle making sure it is unique)\n",
"\n",
"- **_create separate tables for sets of values that apply to multiple records_**\n",
"\n",
" - this includes things like categories - marital status, for example, or ethnicity - where you have a limited set of values that will be applied to many different people.\n",
"\n",
"- **_in the multiple records, then, swap the values out for the key of the related record in the new separate table_**\n",
"\n",
" - for tables that contain values broken out into a separate table, then, remove the actual value and replace it with the ID of the row in the separate table that contains that value - so, the ID of your ethnicity, or the ID of your marital status. This is referred to as a \"Foreign Key\".\n",
" - for different types of relations, this is done differently:\n",
" \n",
" - one-to-one - one refers to the other, one must have foreign key, can have but don't necessarily need each to refer to the other (current spouses assuming monogamy).\n",
" - one-to-many and many-to-one - foreign key stored in each of the many that refers to the one (child, mother).\n",
" - many-to-many - join table - 3rd separate table with two IDs, two foreign keys, one to each side of the relation (aunts to nieces and nephews).\n",
"\n",
"### 3rd Normal Form\n",
"\n",
"- **_Data must be in 2nd Normal Form_**\n",
"\n",
"- **_Remove unrelated data from tables_**\n",
"\n",
"### In addition\n",
"\n",
"- columns should be the appropriate type for data (but sometimes OK to just chuck things into strings if it is an interim step).\n",
"\n",
"More information:\n",
"\n",
"- [http://support.microsoft.com/kb/283878](http://support.microsoft.com/kb/283878)\n",
"- [http://www.studytonight.com/dbms/database-normalization.php](http://www.studytonight.com/dbms/database-normalization.php)\n",
"- [http://databases.about.com/od/specificproducts/a/normalization.htm](http://databases.about.com/od/specificproducts/a/normalization.htm)\n",
"- [http://www.guru99.com/database-normalization.html](http://www.guru99.com/database-normalization.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise - Twitter data\n",
"\n",
"## A little more setup\n",
"\n",
"Make sure you grabbed the smaller twitter_sample_sample.txt file from the exercise 6 assignment in ctools and placed it in your work directory."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Design your database\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Given the following columns, split up into 2 or 3 person groups and work together to make a 3rd normal form database design for the information in these columns. Your design should include a list of tables, and the the column that should be included in each table.\n",
"\n",
"- **_`tweet_timestamp`_** - time stamp of date and time when tweet was sent, in strptime format %a %b %d %H:%M:%S +0000 %Y (example: Sun Feb 08 20:15:19 +0000 2015). For more information on format, see [https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior)\n",
"- **_`twitter_tweet_id`_** - Twitter's ID for this tweet.\n",
"- **_`tweet_text`_** - actual text of tweet. Might contain \"\\n\" - newlines.\n",
"- **_`tweet_language`_** - abbreviation of language of tweet, if known.\n",
"- **_`twitter_user_twitter_id`_** - sender - the Twitter ID of the user who sent the tweet.\n",
"- **_`twitter_user_screenname`_** - sender - the screen name of the Twitter user who sent the tweet.\n",
"- **_`user_followers_count`_** - sender - number of followers the sender has.\n",
"- **_`user_favorites_count`_** - sender - number of favorites the sender has.\n",
"- **_`user_created - sender`_** - time stamp of date and time when sender's user was created, in strptime format %a %b %d %H:%M:%S +0000 %Y (example: Sun Feb 08 20:15:19 +0000 2015). For more information on format, see [https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior)\n",
"- **_`user_location`_** - sender - the location of the user, if set.\n",
"- **_`tweet_retweet_count`_** - number of retweets the tweet has\n",
"- **_`tweet_place`_** - location the tweet was sent, if set.\n",
"- **_`tweet_user_mention_count`_** - number of users mentioned in the tweet.\n",
"- **_`tweet_users_mentioned_screennames`_** - screen names of the users mentioned in tweet, separated by commas. This list is associated by position with the list Twitter IDs in tweet_users_mentioned_ids.\n",
"- **_`tweet_users_mentioned_ids`_** - Twitter IDs of the users mentione in tweet, separated by commas. This list is associated by position with the list of screen names in tweet_users_mentioned_screennames.\n",
"- **_`tweet_hashtag_mention_count`_** - count of hashtags mentioned in the tweet.\n",
"- **_`tweet_hashtags_mentioned`_** - list of hashtags mentioned in the tweet, separated by commas.\n",
"- **_`tweet_url_count`_** - count of URLs included in the tweet.\n",
"- **_`tweet_shortened_urls_mentioned`_** - list of the shortened versions of the URLs in the tweet. Associated by position in the list with the full URLs in tweet_full_urls_mentioned and the display URLs in tweet_display_urls_mentioned.\n",
"- **_`tweet_full_urls_mentioned`_** - list of the full versions of the URLs in the tweet. Associated by position in the list with the shortened URLs in tweet_shortened_urls_mentioned and the display URLs in tweet_display_urls_mentioned.\n",
"- **_`user_description`_** - sender - description from sending user's account.\n",
"- **_`user_friends_count`_** - sender - count of friends sender has.\n",
"- **_`user_statuses_count`_** - sender - count of statuses (tweets) sending user has sent.\n",
"- **_`tweet_display_urls_mentioned`_** - list of the displayed versions of the URLs in the tweet. Associated by position in the list with the full URLs in tweet_full_urls_mentioned and the shortened URLs in tweet_shortened_urls_mentioned."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategies for processing and normalizing flat data files\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"For relatively simple flat data files, there are at least a couple of options for reading in the file and converting it to a more normalized structure:\n",
"\n",
"- You can read it in line by line and do all the manipulations you need to in memory as you go, doing all the translating and converting needed for each row of flat data to migrate it into your normalized relational data structure, then storing it once you are done.\n",
"- As the complexity of your data increases, and as you start to want to derive more inforomation from your data in addition to simply storing it in a clean and accurate normalized store, you'll likely want to break the process of importing and cleaning your data and then deriving data from it up into smaller incremental steps, such that you can run and verify each step before moving on.\n",
"\n",
"For this exercise in class, we are going to work through the second approach - first get your data into a flat file in the database, then move it into better data structures from there using SQL and Python. Below are two methods for processing data in a CSV and dumping it into a table. The first uses the \"csv\" package to read in the twitter sample data file line by line and INSERT each line's contents as a row in the database. The second uses the \"pandas\" package to do the same thing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `csv` package to read and process CSV files\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Before running this program, you'll need to:\n",
"\n",
"- download the file \"create_table-tweet_sample_raw.sql\" from exercise 6 in ctools.\n",
"- open it up in an editor and copy the contents (a CREATE TABLE SQL statement) to your clipboard.\n",
"- open SQLite Manager and create a new database.\n",
"- click on the \"Execute SQL\" tab.\n",
"- paste the CREATE TABLE SQL you copied out of the file into the window just below the words \"Enter SQL\" and just above the \"Run SQL\" button.\n",
"- click the \"Run SQL\" button to create the table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import unicode_literals\n",
"\n",
"# imports\n",
"import sqlite3\n",
"import csv\n",
"import sys\n",
"import traceback\n",
"\n",
"\n",
"# declare variables\n",
"conn = None\n",
"cursor = None\n",
"tab_delimited_file = None\n",
"tweet_reader = None\n",
"tweet_counter = -1\n",
"field_count = -1\n",
"current_tweet_row = None\n",
"\n",
"# variables to hold fields\n",
"tweet_timestamp = \"\"\n",
"twitter_tweet_id = \"\"\n",
"tweet_text = \"\"\n",
"tweet_language = \"\"\n",
"twitter_user_twitter_id = \"\"\n",
"twitter_user_screenname = \"\"\n",
"user_followers_count = \"\"\n",
"user_favorites_count = \"\"\n",
"user_created = \"\"\n",
"user_location = \"\"\n",
"tweet_retweet_count = \"\"\n",
"tweet_place = \"\"\n",
"tweet_user_mention_count = \"\"\n",
"tweet_users_mentioned_screennames = \"\"\n",
"tweet_users_mentioned_ids = \"\"\n",
"tweet_hashtag_mention_count = \"\"\n",
"tweet_hashtags_mentioned = \"\"\n",
"tweet_url_count = \"\"\n",
"tweet_shortened_urls_mentioned = \"\"\n",
"tweet_full_urls_mentioned = \"\"\n",
"user_description = \"\"\n",
"user_friends_count = \"\"\n",
"user_statuses_count = \"\"\n",
"tweet_display_urls_mentioned = \"\"\n",
"\n",
"# for SQL INSERT string\n",
"sql_insert_string = \"\"\n",
"\n",
"# variables for exception handling\n",
"exception_type = \"\"\n",
"exception_value = \"\"\n",
"exception_traceback = None\n",
"\n",
"# build INSERT statement here, once rather than each time through the loop.\n",
"sql_insert_string = '''\n",
" INSERT INTO tweet_sample_raw\n",
" (\n",
" tweet_timestamp,\n",
" twitter_tweet_id,\n",
" tweet_text,\n",
" tweet_language,\n",
" twitter_user_twitter_id,\n",
" twitter_user_screenname,\n",
" user_followers_count,\n",
" user_favorites_count,\n",
" user_created,\n",
" user_location,\n",
" tweet_retweet_count,\n",
" tweet_place,\n",
" tweet_user_mention_count,\n",
" tweet_users_mentioned_screennames,\n",
" tweet_users_mentioned_ids,\n",
" tweet_hashtag_mention_count,\n",
" tweet_hashtags_mentioned,\n",
" tweet_url_count,\n",
" tweet_shortened_urls_mentioned,\n",
" tweet_full_urls_mentioned,\n",
" user_description,\n",
" user_friends_count,\n",
" user_statuses_count,\n",
" tweet_display_urls_mentioned\n",
" )\n",
" VALUES\n",
" (\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?,\n",
" ?\n",
" )\n",
"'''\n",
"\n",
"# wrap database access in try-->except-->finally, so we make sure to close the connection and cursor.\n",
"try:\n",
"\n",
" # connect to database and make a cursor\n",
" # assume that we've already created a database and a table to match the\n",
" # structure of our CSV file.\n",
" conn = sqlite3.connect( \"tweet_sample.sqlite\" )\n",
" cursor = conn.cursor()\n",
"\n",
" # open the data file for reading.\n",
" with open( 'tweet_sample_sample.txt', 'rb' ) as tab_delimited_file:\n",
"\n",
" # feed the file to csv.reader to parse.\n",
" \n",
" # standardized tab-delimited based on how Excel reads/writes tab-delimited files:\n",
" tweet_reader = csv.reader( tab_delimited_file, dialect = \"excel-tab\" )\n",
"\n",
" # standardized CSV based on how Excel reads/writes CSV files:\n",
" #tweet_reader = csv.reader( tab_delimited_file, dialect = \"excel\" )\n",
"\n",
" # example of when you use manual options:\n",
" #tweet_reader = csv.reader( tab_delimited_file, delimiter = '\\t', quotechar = '\\\"', strict = True, lineterminator = \"\\n\" )\n",
"\n",
" # loop over logical rows in the file.\n",
" tweet_counter = 0\n",
" for current_tweet_row in tweet_reader:\n",
"\n",
" tweet_counter = tweet_counter + 1\n",
" field_count = len( current_tweet_row )\n",
"\n",
" # print some info\n",
" # print( \"====> line \" + str( tweet_counter ) + \" - \" + str( field_count ) + \" fields - text: \" + '|||'.join( current_tweet_row ) )\n",
"\n",
" # only do stuff after first row\n",
" if ( tweet_counter > 1 ):\n",
"\n",
" # get fields\n",
" tweet_timestamp = unicode( current_tweet_row[ 0 ], \"UTF-8\" )\n",
" twitter_tweet_id = unicode( current_tweet_row[ 1 ], \"UTF-8\" )\n",
" tweet_text = unicode( current_tweet_row[ 2 ], \"UTF-8\" )\n",
" tweet_language = unicode( current_tweet_row[ 3 ], \"UTF-8\" )\n",
" twitter_user_twitter_id = int( current_tweet_row[ 4 ] )\n",
" twitter_user_screenname = unicode( current_tweet_row[ 5 ], \"UTF-8\" )\n",
" user_followers_count = int( current_tweet_row[ 6 ] )\n",
" user_favorites_count = int ( current_tweet_row[ 7 ] )\n",
" user_created = unicode( current_tweet_row[ 8 ], \"UTF-8\" )\n",
" user_location = unicode( current_tweet_row[ 9 ], \"UTF-8\" )\n",
" tweet_retweet_count = int( current_tweet_row[ 10 ] )\n",
" tweet_place = unicode( current_tweet_row[ 11 ], \"UTF-8\" )\n",
" tweet_user_mention_count = unicode( current_tweet_row[ 12 ], \"UTF-8\" )\n",
" tweet_users_mentioned_screennames = unicode( current_tweet_row[ 13 ], \"UTF-8\" )\n",
" tweet_users_mentioned_ids = unicode( current_tweet_row[ 14 ], \"UTF-8\" )\n",
" tweet_hashtag_mention_count = unicode( current_tweet_row[ 15 ], \"UTF-8\" )\n",
" tweet_hashtags_mentioned = unicode( current_tweet_row[ 16 ], \"UTF-8\" )\n",
" tweet_url_count = unicode( current_tweet_row[ 17 ], \"UTF-8\" )\n",
" tweet_shortened_urls_mentioned = unicode( current_tweet_row[ 18 ], \"UTF-8\" )\n",
" tweet_full_urls_mentioned = unicode( current_tweet_row[ 19 ], \"UTF-8\" )\n",
" user_description = unicode( current_tweet_row[ 20 ], \"UTF-8\" )\n",
" user_friends_count = int( current_tweet_row[ 21 ] )\n",
" user_statuses_count = int( current_tweet_row[ 22 ] )\n",
" tweet_display_urls_mentioned = unicode( current_tweet_row[ 23 ], \"UTF-8\" )\n",
"\n",
" # print tweet ID\n",
" print ( \"====> line \" + str( tweet_counter ) + \" - \" + str( field_count ) + \" Twitter Tweet ID = \" + twitter_tweet_id )\n",
"\n",
" # execute the INSERT SQL command.\n",
" cursor.execute( sql_insert_string, ( tweet_timestamp, twitter_tweet_id, tweet_text, tweet_language, twitter_user_twitter_id, twitter_user_screenname, user_followers_count, user_favorites_count, user_created, user_location, tweet_retweet_count, tweet_place, tweet_user_mention_count, tweet_users_mentioned_screennames, tweet_users_mentioned_ids, tweet_hashtag_mention_count, tweet_hashtags_mentioned, tweet_url_count, tweet_shortened_urls_mentioned, tweet_full_urls_mentioned, user_description, user_friends_count, user_statuses_count, tweet_display_urls_mentioned ) )\n",
"\n",
" # commit.\n",
" conn.commit()\n",
"\n",
" #-- END check to make sure we aren't the first row. --#\n",
"\n",
" #-- END loop over rows in file --#\n",
"\n",
" #-- END use of tab_delimited_file --#\n",
" \n",
"except Exception as e:\n",
" \n",
" # get exception details\n",
" exception_type, exception_value, exception_traceback = sys.exc_info()\n",
" \n",
" # Output exception messages.\n",
" print( \"Exception caught: \" + str( e ) )\n",
" print( \"- args = \" + str( e.args ) )\n",
" print( \"- type = \" + str( exception_type ) )\n",
" print( \"- value = \" + str( exception_value ) )\n",
" print( \"- traceback = \" + str( traceback.format_exc() ) )\n",
"\n",
"finally:\n",
"\n",
" # close cursor\n",
" cursor.close()\n",
"\n",
" # close connection\n",
" conn.close()\n",
"\n",
"#-- END try-->except-->finally around database access. --#"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading CSV files with `pandas`\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CSV file \"tweet_sample_sample.txt\" opened.\n",
"Data output to database \"sqlite:///tweet_sample.sqlite\".\n"
]
}
],
"source": [
"# imports\n",
"import pandas\n",
"import sqlalchemy\n",
"\n",
"# declare variables\n",
"csv_file_path = \"tweet_sample_sample.txt\"\n",
"csv_tweets_data_frame = None\n",
"sqlite_database_URL = 'sqlite:///tweet_sample.sqlite'\n",
"sqlalchemy_sqlite3_engine = None\n",
"\n",
"# read the CSV file into a pandas data frame.\n",
"# - more details: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table\n",
"csv_tweets_data_frame = pandas.read_csv( csv_file_path, dialect = \"excel-tab\", encoding = \"utf-8\" )\n",
"\n",
"print( \"CSV file \\\"\" + csv_file_path + \"\\\" opened.\" )\n",
"\n",
"# write it to an sqlite database.\n",
"\n",
"# first, initialize an sqlalchemy sqlite3 engine, referencing a new \n",
"# sqlite3 database tweet_sample.sqlite in the current directory.\n",
"# - more details: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite\n",
"\n",
"# sqlite://<nohostname>/<path>\n",
"# where <path> is relative:\n",
"sqlalchemy_sqlite3_engine = sqlalchemy.create_engine( sqlite_database_URL )\n",
"\n",
"# write the data to table \"tweet_sample_raw\" in the database.\n",
"# - more details: http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries\n",
"csv_tweets_data_frame.to_sql( \"tweet_sample_raw_pandas\", sqlalchemy_sqlite3_engine )\n",
"\n",
"print( \"Data output to database \\\"\" + sqlite_database_URL + \"\\\".\" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### And the winner is...\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"So, bearing in mind that the little\n",
"\n",
" dialect = \"excel-tab\"\n",
" \n",
"there in `pandas.read_csv()` took a little while to figure out (you can see the csv configuration options I was trying in the first example before I found that dialect on stackoverflow that actually dealt correctly with the tweets being able to have newlines inside them), the pandas code is a lot simpler and more straightforward than that other code.\n",
"\n",
"I'd use pandas if I were you.\n",
"\n",
"But it is good to know that you have other options. pandas is pretty great. It looks like it will save you a lot of work. It looks like it could save me a lot of work. It also loads the entire data file into memory when it reads the CSV file, and some of its choices for column types in sqlite are a little weird - there were no decimal places in any of the numbers, but some of them are type float in the database it generated for me, for example.\n",
"\n",
"If you have a REALLY big data file, so big that it wouldn't all fit into memory, you might need to do something more like the first example, even if you were just moving it all to a database with no other transformations, so you could load it in piece by piece. The first set of code isn't that hard or complicated, but it also isn't as easy as the pandas code.\n",
"\n",
"The first example is also a reasonable example of casting values into their proper types when doing an insert, and of structuring an INSERT statement so that it escapes values to prevent an SQL Injection attack."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Final notes on complexity and cost-benefit of normalization\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Growth in complexity as it relates to adding code/data/transformations/analysis steps:\n",
"\n",
"- Linear - the dream, complexity never grows in a linear relationship with amount of code with a slope near 1.\n",
"- Multiplicative - linear with slope of 2, 3, 4, etc. this is what adding code looks like at first\n",
"- Exponential - as a program gets bigger, this is how complexity relates if you don't design, plan, etc. it starts out being simpler to just fold new things in, but after a while, very hard to manage.\n",
"- Logarithmic - this is what real programming looks like - if you abstract as you encounter things that should be abstracted, there is some additional complexity early, but over time, complexity is more manageable.\n",
"\n",
"Some of the normalization might seem like a little much if you have relatively simple data from one source. As soon as you start deriving time series data, or deriving social network data, then start to tie data sets together to people or places, your data starts to get complicated. Carefully designing and normalizing your data is one way to manage this complexity.\n",
"\n",
"Also, don't feel you need to become an expert on every library and every analysis software package and every computer language ever. Pick a few tools that you enjoy using and that make sense to you, learn those well, and use your core tools whenever you can. For example, I am proficient at SQL, but I don't enjoy how it works. I tend to use SQL to move data around, then use Python or another language to actually work with the data. pandas is a great tool for this way of thinking - it helps to get data from one format to another, so you can use it in a system you are comfortable with."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment