Skip to content

Instantly share code, notes, and snippets.

@anshulkgupta93
Last active January 7, 2016 18:22
Show Gist options
  • Save anshulkgupta93/4190f510911678c2025e to your computer and use it in GitHub Desktop.
Save anshulkgupta93/4190f510911678c2025e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performing data analysis on Diamonds dataset with Sqlalchemy. \n",
"## Load the BigDiamonds.csv dataset in mysql-server. \n",
"\n",
"\n",
"Following instructions are need to be execture in my-sql server first-\n",
"\n",
">> CREATE DATABASE diamonds;\n",
"\n",
">> USE diamonds;\n",
"\n",
">> CREATE TABLE BigDiamonds;\n",
"\n",
">> create table BigDiamonds(\n",
"Unnamed int,\n",
"carat float4, \n",
"cut varchar(60),\n",
"color character,\n",
"clarity varchar(60),\n",
"tabl float4,\n",
"depth float4,\n",
"cert varchar(60),\n",
"measurements varchar(60),\n",
"price float4,\n",
"x float4,\n",
"y float4,\n",
"z float4,\n",
"PRIMARY KEY (Unnamed)\n",
");\n",
"\n",
"We have now defined the schema for the table. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_After this quit mysql-server and execute following instruction on terminal to read large BigDiamonds csv file in mysql:-_\n",
"\n",
">> mysqlimport --fields-optionally-enclosed-by='\"' --fields-terminated-by=, --lines-terminated-by=\"\\n\" -u root -p diamonds --local --ignore-lines=1 /home/anshul/data-sets/BigDiamonds.csv\n",
"\n",
"\n",
"_Ignore-lines_ skips the first line “Some Large Database.” The _–fields-terminated by_ tell the utility that the commas separate the columns. The _Database_ is the name of the database in which your table is stored. You must put the absolute path of the csv file for it to register with the utility. The “BigDiamonds.csv” has to match the name of the table in your mysql database._\n",
"\n",
"Now we'll do analysis on this table through python using sqlalchemy."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sqlalchemy\n",
"engine = sqlalchemy.create_engine('mysql://root:Jarvis@127.0.0.1:3306/diamonds')\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[u'Unnamed', u'carat', u'cut', u'color', u'clarity', u'tabl', u'depth', u'cert', u'measurements', u'price', u'x', u'y', u'z']\n",
"[(1L, 0.25, 'V.Good', 'K', 'I1', 59.0, 63.7, 'GIA', '3.96 x 3.95 x 2.52', 0.0, 3.96, 3.95, 2.52), (2L, 0.23, 'Good', 'G', 'I1', 61.0, 58.1, 'GIA', '4.00 x 4.05 x 2.30', 0.0, 4.0, 4.05, 2.3), (3L, 0.34, 'Good', 'J', 'I2', 58.0, 58.7, 'GIA', '4.56 x 4.53 x 2.67', 0.0, 4.56, 4.53, 2.67), (4L, 0.21, 'V.Good', 'D', 'I1', 60.0, 60.6, 'GIA', '3.80 x 3.82 x 2.31', 0.0, 3.8, 3.82, 2.31), (5L, 0.31, 'V.Good', 'K', 'I1', 59.0, 62.2, 'EGL', '4.35 x 4.26 x 2.68', 0.0, 4.35, 4.26, 2.68), (6L, 0.2, 'Good', 'G', 'SI2', 60.0, 64.4, 'GIA', '3.74 x 3.67 x 2.38', 0.0, 3.74, 3.67, 2.38), (7L, 0.2, 'Good', 'G', 'SI2', 63.0, 62.6, 'GIA', '3.72 x 3.65 x 2.31', 0.0, 3.72, 3.65, 2.31), (8L, 0.22, 'V.Good', 'D', 'I1', 61.0, 59.2, 'GIA', '3.95 x 3.97 x 2.34', 0.0, 3.95, 3.97, 2.34), (9L, 0.23, 'V.Good', 'K', 'SI2', 57.5, 63.6, 'IGI', '3.87 x 3.90 x 2.47', 0.0, 3.87, 3.9, 2.47), (10L, 0.2, 'Good', 'F', 'SI1', 65.0, 54.9, 'GIA', '3.83 x 4.00 x 2.14', 0.0, 3.83, 4.0, 2.14)]\n"
]
}
],
"source": [
"from sqlalchemy import text\n",
"with engine.connect() as con:\n",
"\n",
"\n",
" rs = con.execute(text('SELECT * FROM BigDiamonds limit 10'))\n",
"\n",
" print rs.keys()\n",
" print rs.fetchall()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unnamed INTEGER(11)\n",
"carat FLOAT\n",
"cut VARCHAR(60)\n",
"color CHAR(1)\n",
"clarity VARCHAR(60)\n",
"tabl FLOAT\n",
"depth FLOAT\n",
"cert VARCHAR(60)\n",
"measurements VARCHAR(60)\n",
"price FLOAT\n",
"x FLOAT\n",
"y FLOAT\n",
"z FLOAT\n"
]
}
],
"source": [
"from sqlalchemy import inspect\n",
"insp=inspect(engine)\n",
"for row in insp.get_columns(\"BigDiamonds\"):\n",
" print row[\"name\"], row[\"type\"]\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(1L, 0.25, 'V.Good', 'K', 'I1', 59.0, 63.7, 'GIA', '3.96 x 3.95 x 2.52', 0.0, 3.96, 3.95, 2.52), (2L, 0.23, 'Good', 'G', 'I1', 61.0, 58.1, 'GIA', '4.00 x 4.05 x 2.30', 0.0, 4.0, 4.05, 2.3), (3L, 0.34, 'Good', 'J', 'I2', 58.0, 58.7, 'GIA', '4.56 x 4.53 x 2.67', 0.0, 4.56, 4.53, 2.67), (4L, 0.21, 'V.Good', 'D', 'I1', 60.0, 60.6, 'GIA', '3.80 x 3.82 x 2.31', 0.0, 3.8, 3.82, 2.31), (5L, 0.31, 'V.Good', 'K', 'I1', 59.0, 62.2, 'EGL', '4.35 x 4.26 x 2.68', 0.0, 4.35, 4.26, 2.68), (6L, 0.2, 'Good', 'G', 'SI2', 60.0, 64.4, 'GIA', '3.74 x 3.67 x 2.38', 0.0, 3.74, 3.67, 2.38), (7L, 0.2, 'Good', 'G', 'SI2', 63.0, 62.6, 'GIA', '3.72 x 3.65 x 2.31', 0.0, 3.72, 3.65, 2.31), (8L, 0.22, 'V.Good', 'D', 'I1', 61.0, 59.2, 'GIA', '3.95 x 3.97 x 2.34', 0.0, 3.95, 3.97, 2.34), (9L, 0.23, 'V.Good', 'K', 'SI2', 57.5, 63.6, 'IGI', '3.87 x 3.90 x 2.47', 0.0, 3.87, 3.9, 2.47), (10L, 0.2, 'Good', 'F', 'SI1', 65.0, 54.9, 'GIA', '3.83 x 4.00 x 2.14', 0.0, 3.83, 4.0, 2.14)]\n"
]
}
],
"source": [
"from sqlalchemy import (MetaData, Table,select)\n",
"with engine.connect() as con:\n",
"\n",
" meta = MetaData(engine)\n",
" diamonds = Table('BigDiamonds', meta, autoload=True) \n",
"\n",
" stm = select([diamonds]).limit(10)\n",
" rs = con.execute(stm) \n",
"\n",
" print rs.fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Executing Raw SQL queries with sqlalchemy"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(8742.582102056105,)\n"
]
}
],
"source": [
"rs = engine.execute(text('SELECT avg(price) FROM BigDiamonds'))\n",
"print rs.first()\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1.0712967522829258,)\n"
]
}
],
"source": [
"rs = engine.execute(text('select avg(carat) from BigDiamonds'))\n",
"print rs.fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using SQLAlchemy Expression Language to execute queries\n",
"\n",
"_The SQLAlchemy Expression Language represents relational database structures and expressions using Python constructs. The expression language improves the maintainability of the code by hiding the SQL language and thus allowing not to mix Python code and SQL code._"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" avg_1 color\n",
"0 0.826618225403 D\n",
"1 0.831882373286 E\n",
"2 0.941053186937 F\n",
"3 1.06384079506 G\n",
"4 1.20994065749 H\n",
"5 1.27128226011 I\n",
"6 1.34753987722 J\n",
"7 1.49506455673 K\n",
"8 1.3632705044 L\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sqlalchemy import func\n",
"with engine.connect() as con:\n",
"\n",
" meta = MetaData(engine)\n",
" diamonds = Table('BigDiamonds', meta, autoload=True) \n",
"\n",
" stm = select([func.avg(diamonds.c.carat),diamonds.c.color]).group_by(diamonds.c.color)\n",
" rs = con.execute(stm) \n",
" \n",
" \n",
" print pd.DataFrame(np.array(rs.fetchall()),columns=rs.keys())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" avg(carat) cut\n",
"0 0.900303114521 Good\n",
"1 1.12019355723 Ideal\n",
"2 1.0247596733 V.Good\n"
]
}
],
"source": [
"rs=engine.execute(text('select avg(carat),cut from BigDiamonds group by cut'))\n",
"print pd.DataFrame(np.array(rs.fetchall()),columns=rs.keys())"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rs = engine.execute(text('SELECT * FROM BigDiamonds'))\n",
"diamonds=pd.DataFrame(np.array(rs.fetchall()),columns=rs.keys())"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed</th>\n",
" <th>carat</th>\n",
" <th>cut</th>\n",
" <th>color</th>\n",
" <th>clarity</th>\n",
" <th>tabl</th>\n",
" <th>depth</th>\n",
" <th>cert</th>\n",
" <th>measurements</th>\n",
" <th>price</th>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" <th>z</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" <td>598024</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>598024</td>\n",
" <td>628</td>\n",
" <td>3</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>187</td>\n",
" <td>285</td>\n",
" <td>9</td>\n",
" <td>241453</td>\n",
" <td>40313</td>\n",
" <td>1031</td>\n",
" <td>912</td>\n",
" <td>962</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>527446</td>\n",
" <td>0.3</td>\n",
" <td>Ideal</td>\n",
" <td>G</td>\n",
" <td>SI1</td>\n",
" <td>57.0</td>\n",
" <td>62.4</td>\n",
" <td>GIA</td>\n",
" <td>0.00 x 0.00 x 0.00</td>\n",
" <td>740.0</td>\n",
" <td>4.3</td>\n",
" <td>4.32</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1</td>\n",
" <td>32388</td>\n",
" <td>369448</td>\n",
" <td>96204</td>\n",
" <td>116631</td>\n",
" <td>122574</td>\n",
" <td>24303</td>\n",
" <td>463555</td>\n",
" <td>425</td>\n",
" <td>1702</td>\n",
" <td>3415</td>\n",
" <td>3218</td>\n",
" <td>5415</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed carat cut color clarity tabl depth cert \\\n",
"count 598024 598024 598024 598024 598024 598024 598024 598024 \n",
"unique 598024 628 3 9 9 187 285 9 \n",
"top 527446 0.3 Ideal G SI1 57.0 62.4 GIA \n",
"freq 1 32388 369448 96204 116631 122574 24303 463555 \n",
"\n",
" measurements price x y z \n",
"count 598024 598024 598024 598024 598024 \n",
"unique 241453 40313 1031 912 962 \n",
"top 0.00 x 0.00 x 0.00 740.0 4.3 4.32 4.0 \n",
"freq 425 1702 3415 3218 5415 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diamonds.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment