Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gdsaxton/b5fea065e6dd746050e7 to your computer and use it in GitHub Desktop.
Save gdsaxton/b5fea065e6dd746050e7 to your computer and use it in GitHub Desktop.
iPython notebook showing how to run and output a descriptive statistics (summary statistics) table for social scientific publication
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"In this notebook I will show you how to run descriptive statistics for your dataset and save the output. The desired end product is a CSV table of key summary statistics -- count, mean, std. dev., min. and max -- for the variables in your dataset."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Import packages"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import DataFrame\n",
"from pandas import Series\n",
"import statsmodels #FOR NEXT STEP -- RUNNING REGRESSIONS \n",
"import statsmodels.api as sm\n",
"import statsmodels.formula.api as smf #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Set PANDAS to show all columns in DataFrame\n",
"pd.set_option('display.max_columns', None)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>I like suppressing scientific notation in my numbers. So, if you'd rather see \"0.48\" than \"4.800000e-01\", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.set_option('display.float_format', lambda x: '%.2f' % x)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"I'm running PANDAS 0.13 here."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print pd.__version__\n",
"print statsmodels.__version__"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.13.1\n",
"0.6.1\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Read in dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm using data on a sample of 1,500 Facebook statuses of a sample of US-based health organizations"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.read_excel('fb.xls', 'Sheet1', header=0)\n",
"print len(df)\n",
"df.head(2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1500\n"
]
},
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>No.</th>\n",
" <th>id</th>\n",
" <th>feed_id</th>\n",
" <th>org_name</th>\n",
" <th>FB_org_id</th>\n",
" <th>location</th>\n",
" <th>link</th>\n",
" <th>message_id</th>\n",
" <th>org_id</th>\n",
" <th>status_id</th>\n",
" <th>status_link</th>\n",
" <th>content</th>\n",
" <th>published_date</th>\n",
" <th>date_inserted</th>\n",
" <th>last_comment</th>\n",
" <th>type</th>\n",
" <th>status_type</th>\n",
" <th>video_source</th>\n",
" <th>picture_link</th>\n",
" <th>link_name</th>\n",
" <th>link_caption</th>\n",
" <th>link_description</th>\n",
" <th>num_mentions</th>\n",
" <th>mentions</th>\n",
" <th>num_likes</th>\n",
" <th>like_count</th>\n",
" <th>comment_count</th>\n",
" <th>share_count</th>\n",
" <th>hashtag_count</th>\n",
" <th>hashtags</th>\n",
" <th>mentions_count</th>\n",
" <th>actions</th>\n",
" <th>application</th>\n",
" <th>properties</th>\n",
" <th>message_output</th>\n",
" <th>time_since_post_days</th>\n",
" <th>like_count_7days</th>\n",
" <th>comment_count_7days</th>\n",
" <th>share_count_7days</th>\n",
" <th>time_since_post_14days</th>\n",
" <th>like_count_14days</th>\n",
" <th>comment_count_14days</th>\n",
" <th>share_count_14days</th>\n",
" <th>feed_organization</th>\n",
" <th>images</th>\n",
" <th>urls_count</th>\n",
" <th>urls_count_true</th>\n",
" <th>mentions_scrape</th>\n",
" <th>source</th>\n",
" <th>mission/non-mission</th>\n",
" <th>mission_focus</th>\n",
" <th>I-C-A</th>\n",
" <th>subcode</th>\n",
" <th>source_External</th>\n",
" <th>source_Internal</th>\n",
" <th>mission_N</th>\n",
" <th>mission_Y</th>\n",
" <th>mission_focus_Building capacity</th>\n",
" <th>mission_focus_Patient advocacy</th>\n",
" <th>mission_focus_Prevention</th>\n",
" <th>ICA_A</th>\n",
" <th>ICA_C</th>\n",
" <th>ICA_I</th>\n",
" <th>ICA_subcode_AIDS-related day</th>\n",
" <th>ICA_subcode_Awareness</th>\n",
" <th>ICA_subcode_Complementary support</th>\n",
" <th>ICA_subcode_Cover/profile photo</th>\n",
" <th>ICA_subcode_Dialogue</th>\n",
" <th>ICA_subcode_Donation</th>\n",
" <th>ICA_subcode_Event info</th>\n",
" <th>ICA_subcode_Event promotion</th>\n",
" <th>ICA_subcode_Event update</th>\n",
" <th>ICA_subcode_Get tested</th>\n",
" <th>ICA_subcode_HIV/AIDS info/news</th>\n",
" <th>ICA_subcode_Idea promotion</th>\n",
" <th>ICA_subcode_LGBT/Transgender information</th>\n",
" <th>ICA_subcode_Lobbying</th>\n",
" <th>ICA_subcode_Media action</th>\n",
" <th>ICA_subcode_Medication</th>\n",
" <th>ICA_subcode_National holiday/Holiday</th>\n",
" <th>ICA_subcode_Organizational news/announcement</th>\n",
" <th>ICA_subcode_Other</th>\n",
" <th>ICA_subcode_Patient stories</th>\n",
" <th>ICA_subcode_Recognition</th>\n",
" <th>ICA_subcode_Research/survey</th>\n",
" <th>ICA_subcode_Viewing action</th>\n",
" <th>ICA_subcode_Volunteer</th>\n",
" <th>type_event</th>\n",
" <th>type_link</th>\n",
" <th>type_music</th>\n",
" <th>type_photo</th>\n",
" <th>type_status</th>\n",
" <th>type_video</th>\n",
" <th>status_type_added_photos</th>\n",
" <th>status_type_added_video</th>\n",
" <th>status_type_created_event</th>\n",
" <th>status_type_created_note</th>\n",
" <th>status_type_mobile_status_update</th>\n",
" <th>status_type_published_story</th>\n",
" <th>status_type_shared_story</th>\n",
" <th>video_dummy</th>\n",
" <th>picture_dummy</th>\n",
" <th>Org_ID</th>\n",
" <th>Org_Name</th>\n",
" <th>Total_Revenue</th>\n",
" <th>Assets</th>\n",
" <th>followers_count</th>\n",
" <th>talking_about_count</th>\n",
" <th>were_here_count</th>\n",
" <th>Log_of_Assets</th>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <th>Log_of_Followers</th>\n",
" <th>likes_binary</th>\n",
" <th>comment_binary</th>\n",
" <th>share_binary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1118</td>\n",
" <td> 74</td>\n",
" <td> 1</td>\n",
" <td> AIDS Healthcare Foundation</td>\n",
" <td> 34661411150</td>\n",
" <td> {'location': {'city': 'Los Angeles', 'zip': '9...</td>\n",
" <td> https://www.facebook.com/AIDShealth/photos/a.1...</td>\n",
" <td> 34661411150_10152940344646151</td>\n",
" <td> 34661411150</td>\n",
" <td> 10152940344646100</td>\n",
" <td> https://www.facebook.com/34661411150/posts/101...</td>\n",
" <td> 'TIS THE SEASON: We know you've heard us say, ...</td>\n",
" <td> 2014-12-18T05:19:30+0000</td>\n",
" <td>2015-03-09 00:00:39</td>\n",
" <td> 2014-12-20T13:33:45+0000</td>\n",
" <td> photo</td>\n",
" <td> added_photos</td>\n",
" <td> NaN</td>\n",
" <td> https://scontent.xx.fbcdn.net/hphotos-xpf1/v/t...</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td>1.00</td>\n",
" <td> Art Hearts Fashion</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td> 1</td>\n",
" <td> UseACondom</td>\n",
" <td>nan</td>\n",
" <td> [{'link': 'https://www.facebook.com/3466141115...</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> {'picture': 'https://scontent.xx.fbcdn.net/hph...</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>81.00</td>\n",
" <td> 101</td>\n",
" <td> 3</td>\n",
" <td> 6</td>\n",
" <td> AIDS HEALTHCARE FOUNDATION</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td> Internal</td>\n",
" <td> Y</td>\n",
" <td> Building capacity</td>\n",
" <td> A</td>\n",
" <td> Event promotion</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> AIDS HEALTHCARE FOUNDATION</td>\n",
" <td> 209986539</td>\n",
" <td> 227735541</td>\n",
" <td> 721419</td>\n",
" <td> 896</td>\n",
" <td> 3891</td>\n",
" <td>19.24</td>\n",
" <td>19.16</td>\n",
" <td>13.49</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 1262</td>\n",
" <td> 79</td>\n",
" <td> 1</td>\n",
" <td> AIDS Healthcare Foundation</td>\n",
" <td> 34661411150</td>\n",
" <td> NaN</td>\n",
" <td> http://www.frontiersmedia.com/frontiers-blog/2...</td>\n",
" <td> 34661411150_10152926215086151</td>\n",
" <td> 34661411150</td>\n",
" <td> 10152926215086100</td>\n",
" <td> https://www.facebook.com/34661411150/posts/101...</td>\n",
" <td> AHF mourns the tragic loss of our friend and c...</td>\n",
" <td> 2014-12-11T18:02:10+0000</td>\n",
" <td>2015-03-09 00:00:40</td>\n",
" <td> 2014-12-14T16:41:29+0000</td>\n",
" <td> link</td>\n",
" <td> shared_story</td>\n",
" <td> NaN</td>\n",
" <td> https://fbexternal-a.akamaihd.net/safe_image.p...</td>\n",
" <td> Dana Miller\u00e2\u20ac\u201dProducer, AIDS Advocate, Frontie...</td>\n",
" <td> frontiersmedia.com</td>\n",
" <td> \u00e2\u20ac\u0153Dana was legendary in gay Los Angeles and H...</td>\n",
" <td>0.00</td>\n",
" <td> NaN</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td> 1</td>\n",
" <td> DanaMiller</td>\n",
" <td>nan</td>\n",
" <td> [{'link': 'https://www.facebook.com/3466141115...</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> {'picture': 'https://fbexternal-a.akamaihd.net...</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>88.00</td>\n",
" <td> 78</td>\n",
" <td> 9</td>\n",
" <td> 4</td>\n",
" <td> AIDS HEALTHCARE FOUNDATION</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td>nan</td>\n",
" <td> External</td>\n",
" <td> Y</td>\n",
" <td> Building capacity</td>\n",
" <td> C</td>\n",
" <td> Recognition</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> AIDS HEALTHCARE FOUNDATION</td>\n",
" <td> 209986539</td>\n",
" <td> 227735541</td>\n",
" <td> 721419</td>\n",
" <td> 896</td>\n",
" <td> 3891</td>\n",
" <td>19.24</td>\n",
" <td>19.16</td>\n",
" <td>13.49</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows \u00d7 115 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 81,
"text": [
" No. id feed_id org_name FB_org_id \\\n",
"0 1118 74 1 AIDS Healthcare Foundation 34661411150 \n",
"1 1262 79 1 AIDS Healthcare Foundation 34661411150 \n",
"\n",
" location \\\n",
"0 {'location': {'city': 'Los Angeles', 'zip': '9... \n",
"1 NaN \n",
"\n",
" link \\\n",
"0 https://www.facebook.com/AIDShealth/photos/a.1... \n",
"1 http://www.frontiersmedia.com/frontiers-blog/2... \n",
"\n",
" message_id org_id status_id \\\n",
"0 34661411150_10152940344646151 34661411150 10152940344646100 \n",
"1 34661411150_10152926215086151 34661411150 10152926215086100 \n",
"\n",
" status_link \\\n",
"0 https://www.facebook.com/34661411150/posts/101... \n",
"1 https://www.facebook.com/34661411150/posts/101... \n",
"\n",
" content \\\n",
"0 'TIS THE SEASON: We know you've heard us say, ... \n",
"1 AHF mourns the tragic loss of our friend and c... \n",
"\n",
" published_date date_inserted last_comment \\\n",
"0 2014-12-18T05:19:30+0000 2015-03-09 00:00:39 2014-12-20T13:33:45+0000 \n",
"1 2014-12-11T18:02:10+0000 2015-03-09 00:00:40 2014-12-14T16:41:29+0000 \n",
"\n",
" type status_type video_source \\\n",
"0 photo added_photos NaN \n",
"1 link shared_story NaN \n",
"\n",
" picture_link \\\n",
"0 https://scontent.xx.fbcdn.net/hphotos-xpf1/v/t... \n",
"1 https://fbexternal-a.akamaihd.net/safe_image.p... \n",
"\n",
" link_name link_caption \\\n",
"0 NaN NaN \n",
"1 Dana Miller\u00e2\u20ac\u201dProducer, AIDS Advocate, Frontie... frontiersmedia.com \n",
"\n",
" link_description num_mentions \\\n",
"0 NaN 1.00 \n",
"1 \u00e2\u20ac\u0153Dana was legendary in gay Los Angeles and H... 0.00 \n",
"\n",
" mentions num_likes like_count comment_count share_count \\\n",
"0 Art Hearts Fashion nan nan nan nan \n",
"1 NaN nan nan nan nan \n",
"\n",
" hashtag_count hashtags mentions_count \\\n",
"0 1 UseACondom nan \n",
"1 1 DanaMiller nan \n",
"\n",
" actions application properties \\\n",
"0 [{'link': 'https://www.facebook.com/3466141115... NaN NaN \n",
"1 [{'link': 'https://www.facebook.com/3466141115... NaN NaN \n",
"\n",
" message_output time_since_post_days \\\n",
"0 {'picture': 'https://scontent.xx.fbcdn.net/hph... nan \n",
"1 {'picture': 'https://fbexternal-a.akamaihd.net... nan \n",
"\n",
" like_count_7days comment_count_7days share_count_7days \\\n",
"0 nan nan nan \n",
"1 nan nan nan \n",
"\n",
" time_since_post_14days like_count_14days comment_count_14days \\\n",
"0 81.00 101 3 \n",
"1 88.00 78 9 \n",
"\n",
" share_count_14days feed_organization images urls_count \\\n",
"0 6 AIDS HEALTHCARE FOUNDATION nan nan \n",
"1 4 AIDS HEALTHCARE FOUNDATION nan nan \n",
"\n",
" urls_count_true mentions_scrape source mission/non-mission \\\n",
"0 nan nan Internal Y \n",
"1 nan nan External Y \n",
"\n",
" mission_focus I-C-A subcode source_External source_Internal \\\n",
"0 Building capacity A Event promotion 0 1 \n",
"1 Building capacity C Recognition 1 0 \n",
"\n",
" mission_N mission_Y mission_focus_Building capacity \\\n",
"0 0 1 1 \n",
"1 0 1 1 \n",
"\n",
" mission_focus_Patient advocacy mission_focus_Prevention ICA_A ICA_C \\\n",
"0 0 0 1 0 \n",
"1 0 0 0 1 \n",
"\n",
" ICA_I ICA_subcode_AIDS-related day ICA_subcode_Awareness \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"\n",
" ICA_subcode_Complementary support ICA_subcode_Cover/profile photo \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Dialogue ICA_subcode_Donation ICA_subcode_Event info \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"\n",
" ICA_subcode_Event promotion ICA_subcode_Event update \\\n",
"0 1 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Get tested ICA_subcode_HIV/AIDS info/news \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Idea promotion ICA_subcode_LGBT/Transgender information \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Lobbying ICA_subcode_Media action ICA_subcode_Medication \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"\n",
" ICA_subcode_National holiday/Holiday \\\n",
"0 0 \n",
"1 0 \n",
"\n",
" ICA_subcode_Organizational news/announcement ICA_subcode_Other \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Patient stories ICA_subcode_Recognition \\\n",
"0 0 0 \n",
"1 0 1 \n",
"\n",
" ICA_subcode_Research/survey ICA_subcode_Viewing action \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" ICA_subcode_Volunteer type_event type_link type_music type_photo \\\n",
"0 0 0 0 0 1 \n",
"1 0 0 1 0 0 \n",
"\n",
" type_status type_video status_type_added_photos status_type_added_video \\\n",
"0 0 0 1 0 \n",
"1 0 0 0 0 \n",
"\n",
" status_type_created_event status_type_created_note \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" status_type_mobile_status_update status_type_published_story \\\n",
"0 0 0 \n",
"1 0 0 \n",
"\n",
" status_type_shared_story video_dummy picture_dummy Org_ID \\\n",
"0 0 0 1 1 \n",
"1 1 0 1 1 \n",
"\n",
" Org_Name Total_Revenue Assets followers_count \\\n",
"0 AIDS HEALTHCARE FOUNDATION 209986539 227735541 721419 \n",
"1 AIDS HEALTHCARE FOUNDATION 209986539 227735541 721419 \n",
"\n",
" talking_about_count were_here_count Log_of_Assets Log_of_Total_Revenue \\\n",
"0 896 3891 19.24 19.16 \n",
"1 896 3891 19.24 19.16 \n",
"\n",
" Log_of_Followers likes_binary comment_binary share_binary \n",
"0 13.49 1 1 1 \n",
"1 13.49 1 1 1 \n",
"\n",
"[2 rows x 115 columns]"
]
}
],
"prompt_number": 81
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"List all the columns in the DataFrame"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.columns"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 82,
"text": [
"Index([u'No.', u'id', u'feed_id', u'org_name', u'FB_org_id', u'location', u'link', u'message_id', u'org_id', u'status_id', u'status_link', u'content', u'published_date', u'date_inserted', u'last_comment', u'type', u'status_type', u'video_source', u'picture_link', u'link_name', u'link_caption', u'link_description', u'num_mentions', u'mentions', u'num_likes', u'like_count', u'comment_count', u'share_count', u'hashtag_count', u'hashtags', u'mentions_count', u'actions', u'application', u'properties', u'message_output', u'time_since_post_days', u'like_count_7days', u'comment_count_7days', u'share_count_7days', u'time_since_post_14days', u'like_count_14days', u'comment_count_14days', u'share_count_14days', u'feed_organization', u'images', u'urls_count', u'urls_count_true', u'mentions_scrape', u'source', u'mission/non-mission', u'mission_focus', u'I-C-A', u'subcode', u'source_External', u'source_Internal', u'mission_N', u'mission_Y', u'mission_focus_Building capacity', u'mission_focus_Patient advocacy', u'mission_focus_Prevention', u'ICA_A', u'ICA_C', u'ICA_I', u'ICA_subcode_AIDS-related day', u'ICA_subcode_Awareness', u'ICA_subcode_Complementary support', u'ICA_subcode_Cover/profile photo', u'ICA_subcode_Dialogue', u'ICA_subcode_Donation', u'ICA_subcode_Event info', u'ICA_subcode_Event promotion', u'ICA_subcode_Event update', u'ICA_subcode_Get tested', u'ICA_subcode_HIV/AIDS info/news', u'ICA_subcode_Idea promotion', u'ICA_subcode_LGBT/Transgender information', u'ICA_subcode_Lobbying', u'ICA_subcode_Media action', u'ICA_subcode_Medication', u'ICA_subcode_National holiday/Holiday', u'ICA_subcode_Organizational news/announcement', u'ICA_subcode_Other', u'ICA_subcode_Patient stories', u'ICA_subcode_Recognition', u'ICA_subcode_Research/survey', u'ICA_subcode_Viewing action', u'ICA_subcode_Volunteer', u'type_event', u'type_link', u'type_music', u'type_photo', u'type_status', u'type_video', u'status_type_added_photos', u'status_type_added_video', u'status_type_created_event', u'status_type_created_note', u'status_type_mobile_status_update', u'status_type_published_story', u'status_type_shared_story', ...], dtype='object')"
]
}
],
"prompt_number": 82
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br> \n",
"\n",
"You might not want to include all of your variables in the summary statistics table. When you're dealing with a dataset with a lot of columns, I find the easiest way is to output the column names to a list, copy and paste the output into a text editor, do a replace all command to delete the all the u's before each column name, and then paste back into iPython and create your desired sub-set."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.columns.tolist()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 56,
"text": [
"[u'No.',\n",
" u'id',\n",
" u'feed_id',\n",
" u'org_name',\n",
" u'FB_org_id',\n",
" u'location',\n",
" u'link',\n",
" u'message_id',\n",
" u'org_id',\n",
" u'status_id',\n",
" u'status_link',\n",
" u'content',\n",
" u'published_date',\n",
" u'date_inserted',\n",
" u'last_comment',\n",
" u'type',\n",
" u'status_type',\n",
" u'video_source',\n",
" u'picture_link',\n",
" u'link_name',\n",
" u'link_caption',\n",
" u'link_description',\n",
" u'num_mentions',\n",
" u'mentions',\n",
" u'num_likes',\n",
" u'like_count',\n",
" u'comment_count',\n",
" u'share_count',\n",
" u'hashtag_count',\n",
" u'hashtags',\n",
" u'mentions_count',\n",
" u'actions',\n",
" u'application',\n",
" u'properties',\n",
" u'message_output',\n",
" u'time_since_post_days',\n",
" u'like_count_7days',\n",
" u'comment_count_7days',\n",
" u'share_count_7days',\n",
" u'time_since_post_14days',\n",
" u'like_count_14days',\n",
" u'comment_count_14days',\n",
" u'share_count_14days',\n",
" u'feed_organization',\n",
" u'images',\n",
" u'urls_count',\n",
" u'urls_count_true',\n",
" u'mentions_scrape',\n",
" u'source',\n",
" u'mission/non-mission',\n",
" u'mission_focus',\n",
" u'I-C-A',\n",
" u'subcode',\n",
" u'source_External',\n",
" u'source_Internal',\n",
" u'mission_N',\n",
" u'mission_Y',\n",
" u'mission_focus_Building capacity',\n",
" u'mission_focus_Patient advocacy',\n",
" u'mission_focus_Prevention',\n",
" u'ICA_A',\n",
" u'ICA_C',\n",
" u'ICA_I',\n",
" u'ICA_subcode_AIDS-related day',\n",
" u'ICA_subcode_Awareness',\n",
" u'ICA_subcode_Complementary support',\n",
" u'ICA_subcode_Cover/profile photo',\n",
" u'ICA_subcode_Dialogue',\n",
" u'ICA_subcode_Donation',\n",
" u'ICA_subcode_Event info',\n",
" u'ICA_subcode_Event promotion',\n",
" u'ICA_subcode_Event update',\n",
" u'ICA_subcode_Get tested',\n",
" u'ICA_subcode_HIV/AIDS info/news',\n",
" u'ICA_subcode_Idea promotion',\n",
" u'ICA_subcode_LGBT/Transgender information',\n",
" u'ICA_subcode_Lobbying',\n",
" u'ICA_subcode_Media action',\n",
" u'ICA_subcode_Medication',\n",
" u'ICA_subcode_National holiday/Holiday',\n",
" u'ICA_subcode_Organizational news/announcement',\n",
" u'ICA_subcode_Other',\n",
" u'ICA_subcode_Patient stories',\n",
" u'ICA_subcode_Recognition',\n",
" u'ICA_subcode_Research/survey',\n",
" u'ICA_subcode_Viewing action',\n",
" u'ICA_subcode_Volunteer',\n",
" u'type_event',\n",
" u'type_link',\n",
" u'type_music',\n",
" u'type_photo',\n",
" u'type_status',\n",
" u'type_video',\n",
" u'status_type_added_photos',\n",
" u'status_type_added_video',\n",
" u'status_type_created_event',\n",
" u'status_type_created_note',\n",
" u'status_type_mobile_status_update',\n",
" u'status_type_published_story',\n",
" u'status_type_shared_story',\n",
" u'video_dummy',\n",
" u'picture_dummy',\n",
" u'Org_ID',\n",
" u'Org_Name',\n",
" u'Total_Revenue',\n",
" u'Assets',\n",
" u'followers_count',\n",
" u'talking_about_count',\n",
" u'were_here_count',\n",
" u'Log_of_Assets',\n",
" u'Log_of_Total_Revenue',\n",
" u'Log_of_Followers',\n",
" u'likes_binary',\n",
" u'comment_binary',\n",
" u'share_binary']"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"I've copy and pasted the above output into TextWrangler, omitted all the 'u's, and selected the columns I want. I will now limit the dataframe to just those columns I want. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = df[['hashtag_count', 'like_count_14days', 'comment_count_14days', 'share_count_14days',\n",
" 'source', 'source_External', 'video_dummy', 'picture_dummy',\n",
" 'Total_Revenue', 'Log_of_Total_Revenue', 'followers_count', 'Log_of_Followers']]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 84
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br> \n",
"\n",
"As you can see, you now have a dataframe with only 12 columns (variables)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print len(df) \n",
"print len(df.columns)\n",
"df.head(2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1500\n",
"12\n"
]
},
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hashtag_count</th>\n",
" <th>like_count_14days</th>\n",
" <th>comment_count_14days</th>\n",
" <th>share_count_14days</th>\n",
" <th>source</th>\n",
" <th>source_External</th>\n",
" <th>video_dummy</th>\n",
" <th>picture_dummy</th>\n",
" <th>Total_Revenue</th>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <th>followers_count</th>\n",
" <th>Log_of_Followers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 101</td>\n",
" <td> 3</td>\n",
" <td> 6</td>\n",
" <td> Internal</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 209986539</td>\n",
" <td>19.16</td>\n",
" <td> 721419</td>\n",
" <td>13.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 1</td>\n",
" <td> 78</td>\n",
" <td> 9</td>\n",
" <td> 4</td>\n",
" <td> External</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 209986539</td>\n",
" <td>19.16</td>\n",
" <td> 721419</td>\n",
" <td>13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows \u00d7 12 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 85,
"text": [
" hashtag_count like_count_14days comment_count_14days share_count_14days \\\n",
"0 1 101 3 6 \n",
"1 1 78 9 4 \n",
"\n",
" source source_External video_dummy picture_dummy Total_Revenue \\\n",
"0 Internal 0 0 1 209986539 \n",
"1 External 1 0 1 209986539 \n",
"\n",
" Log_of_Total_Revenue followers_count Log_of_Followers \n",
"0 19.16 721419 13.49 \n",
"1 19.16 721419 13.49 \n",
"\n",
"[2 rows x 12 columns]"
]
}
],
"prompt_number": 85
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Generate Summary Statistics "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the basic way to produce summary statistics for all variables in your dataframe"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.describe()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hashtag_count</th>\n",
" <th>like_count_14days</th>\n",
" <th>comment_count_14days</th>\n",
" <th>share_count_14days</th>\n",
" <th>source_External</th>\n",
" <th>video_dummy</th>\n",
" <th>picture_dummy</th>\n",
" <th>Total_Revenue</th>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <th>followers_count</th>\n",
" <th>Log_of_Followers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td> 0.48</td>\n",
" <td> 30.59</td>\n",
" <td> 0.59</td>\n",
" <td> 3.11</td>\n",
" <td> 0.42</td>\n",
" <td> 0.03</td>\n",
" <td> 0.88</td>\n",
" <td> 11486117.19</td>\n",
" <td> 15.36</td>\n",
" <td> 20150.46</td>\n",
" <td> 7.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td> 1.12</td>\n",
" <td> 355.34</td>\n",
" <td> 2.62</td>\n",
" <td> 14.95</td>\n",
" <td> 0.49</td>\n",
" <td> 0.18</td>\n",
" <td> 0.33</td>\n",
" <td> 31061197.86</td>\n",
" <td> 1.12</td>\n",
" <td>107022.51</td>\n",
" <td> 1.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1015497.00</td>\n",
" <td> 13.83</td>\n",
" <td> 45.00</td>\n",
" <td> 3.81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td> 0.00</td>\n",
" <td> 2.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1954081.00</td>\n",
" <td> 14.49</td>\n",
" <td> 1089.50</td>\n",
" <td> 6.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td> 0.00</td>\n",
" <td> 5.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 4179577.00</td>\n",
" <td> 15.25</td>\n",
" <td> 2134.00</td>\n",
" <td> 7.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td> 0.00</td>\n",
" <td> 12.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 10633968.00</td>\n",
" <td> 16.18</td>\n",
" <td> 3481.00</td>\n",
" <td> 8.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td> 10.00</td>\n",
" <td>11004.00</td>\n",
" <td> 59.00</td>\n",
" <td> 219.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td>209986539.00</td>\n",
" <td> 19.16</td>\n",
" <td>721419.00</td>\n",
" <td> 13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows \u00d7 11 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 86,
"text": [
" hashtag_count like_count_14days comment_count_14days \\\n",
"count 1500.00 1500.00 1500.00 \n",
"mean 0.48 30.59 0.59 \n",
"std 1.12 355.34 2.62 \n",
"min 0.00 0.00 0.00 \n",
"25% 0.00 2.00 0.00 \n",
"50% 0.00 5.00 0.00 \n",
"75% 0.00 12.00 0.00 \n",
"max 10.00 11004.00 59.00 \n",
"\n",
" share_count_14days source_External video_dummy picture_dummy \\\n",
"count 1500.00 1500.00 1500.00 1500.00 \n",
"mean 3.11 0.42 0.03 0.88 \n",
"std 14.95 0.49 0.18 0.33 \n",
"min 0.00 0.00 0.00 0.00 \n",
"25% 0.00 0.00 0.00 1.00 \n",
"50% 0.00 0.00 0.00 1.00 \n",
"75% 1.00 1.00 0.00 1.00 \n",
"max 219.00 1.00 1.00 1.00 \n",
"\n",
" Total_Revenue Log_of_Total_Revenue followers_count Log_of_Followers \n",
"count 1500.00 1500.00 1500.00 1500.00 \n",
"mean 11486117.19 15.36 20150.46 7.79 \n",
"std 31061197.86 1.12 107022.51 1.34 \n",
"min 1015497.00 13.83 45.00 3.81 \n",
"25% 1954081.00 14.49 1089.50 6.99 \n",
"50% 4179577.00 15.25 2134.00 7.67 \n",
"75% 10633968.00 16.18 3481.00 8.16 \n",
"max 209986539.00 19.16 721419.00 13.49 \n",
"\n",
"[8 rows x 11 columns]"
]
}
],
"prompt_number": 86
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br> \n",
"\n",
"If you'd like to see the help for the describe function"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DataFrame.describe?"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"Use the <i>dir</i> function to get an alphabetical listing of valid names (attributes) in an object."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dir(df.describe())"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 93
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"CHANGE TO TWO DECIMALS (n.b. - This step is not necessary if you have run the display.float_format command earlier)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.round(df.describe(), 2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hashtag_count</th>\n",
" <th>like_count_14days</th>\n",
" <th>comment_count_14days</th>\n",
" <th>share_count_14days</th>\n",
" <th>source_External</th>\n",
" <th>video_dummy</th>\n",
" <th>picture_dummy</th>\n",
" <th>Total_Revenue</th>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <th>followers_count</th>\n",
" <th>Log_of_Followers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" <td> 1500.00</td>\n",
" <td>1500.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td> 0.48</td>\n",
" <td> 30.59</td>\n",
" <td> 0.59</td>\n",
" <td> 3.11</td>\n",
" <td> 0.42</td>\n",
" <td> 0.03</td>\n",
" <td> 0.88</td>\n",
" <td> 11486117.19</td>\n",
" <td> 15.36</td>\n",
" <td> 20150.46</td>\n",
" <td> 7.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td> 1.12</td>\n",
" <td> 355.34</td>\n",
" <td> 2.62</td>\n",
" <td> 14.95</td>\n",
" <td> 0.49</td>\n",
" <td> 0.18</td>\n",
" <td> 0.33</td>\n",
" <td> 31061197.86</td>\n",
" <td> 1.12</td>\n",
" <td>107022.51</td>\n",
" <td> 1.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1015497.00</td>\n",
" <td> 13.83</td>\n",
" <td> 45.00</td>\n",
" <td> 3.81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td> 0.00</td>\n",
" <td> 2.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1954081.00</td>\n",
" <td> 14.49</td>\n",
" <td> 1089.50</td>\n",
" <td> 6.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td> 0.00</td>\n",
" <td> 5.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 4179577.00</td>\n",
" <td> 15.25</td>\n",
" <td> 2134.00</td>\n",
" <td> 7.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td> 0.00</td>\n",
" <td> 12.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 10633968.00</td>\n",
" <td> 16.18</td>\n",
" <td> 3481.00</td>\n",
" <td> 8.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td> 10.00</td>\n",
" <td>11004.00</td>\n",
" <td> 59.00</td>\n",
" <td> 219.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td>209986539.00</td>\n",
" <td> 19.16</td>\n",
" <td>721419.00</td>\n",
" <td> 13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows \u00d7 11 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 88,
"text": [
" hashtag_count like_count_14days comment_count_14days \\\n",
"count 1500.00 1500.00 1500.00 \n",
"mean 0.48 30.59 0.59 \n",
"std 1.12 355.34 2.62 \n",
"min 0.00 0.00 0.00 \n",
"25% 0.00 2.00 0.00 \n",
"50% 0.00 5.00 0.00 \n",
"75% 0.00 12.00 0.00 \n",
"max 10.00 11004.00 59.00 \n",
"\n",
" share_count_14days source_External video_dummy picture_dummy \\\n",
"count 1500.00 1500.00 1500.00 1500.00 \n",
"mean 3.11 0.42 0.03 0.88 \n",
"std 14.95 0.49 0.18 0.33 \n",
"min 0.00 0.00 0.00 0.00 \n",
"25% 0.00 0.00 0.00 1.00 \n",
"50% 0.00 0.00 0.00 1.00 \n",
"75% 1.00 1.00 0.00 1.00 \n",
"max 219.00 1.00 1.00 1.00 \n",
"\n",
" Total_Revenue Log_of_Total_Revenue followers_count Log_of_Followers \n",
"count 1500.00 1500.00 1500.00 1500.00 \n",
"mean 11486117.19 15.36 20150.46 7.79 \n",
"std 31061197.86 1.12 107022.51 1.34 \n",
"min 1015497.00 13.83 45.00 3.81 \n",
"25% 1954081.00 14.49 1089.50 6.99 \n",
"50% 4179577.00 15.25 2134.00 7.67 \n",
"75% 10633968.00 16.18 3481.00 8.16 \n",
"max 209986539.00 19.16 721419.00 13.49 \n",
"\n",
"[8 rows x 11 columns]"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"NOW LET'S TRANSPOSE THE OUTPUT -- necessary for a more typical social scientific presentation of the data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.round(df.describe(), 2).T"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>25%</th>\n",
" <th>50%</th>\n",
" <th>75%</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>hashtag_count</th>\n",
" <td>1500.00</td>\n",
" <td> 0.48</td>\n",
" <td> 1.12</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>like_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 30.59</td>\n",
" <td> 355.34</td>\n",
" <td> 0.00</td>\n",
" <td> 2.00</td>\n",
" <td> 5.00</td>\n",
" <td> 12.00</td>\n",
" <td> 11004.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>comment_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 0.59</td>\n",
" <td> 2.62</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 59.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>share_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 3.11</td>\n",
" <td> 14.95</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 219.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>source_External</th>\n",
" <td>1500.00</td>\n",
" <td> 0.42</td>\n",
" <td> 0.49</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>video_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.03</td>\n",
" <td> 0.18</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>picture_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.88</td>\n",
" <td> 0.33</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td>11486117.19</td>\n",
" <td>31061197.86</td>\n",
" <td>1015497.00</td>\n",
" <td>1954081.00</td>\n",
" <td>4179577.00</td>\n",
" <td>10633968.00</td>\n",
" <td>209986539.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td> 15.36</td>\n",
" <td> 1.12</td>\n",
" <td> 13.83</td>\n",
" <td> 14.49</td>\n",
" <td> 15.25</td>\n",
" <td> 16.18</td>\n",
" <td> 19.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>followers_count</th>\n",
" <td>1500.00</td>\n",
" <td> 20150.46</td>\n",
" <td> 107022.51</td>\n",
" <td> 45.00</td>\n",
" <td> 1089.50</td>\n",
" <td> 2134.00</td>\n",
" <td> 3481.00</td>\n",
" <td> 721419.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Followers</th>\n",
" <td>1500.00</td>\n",
" <td> 7.79</td>\n",
" <td> 1.34</td>\n",
" <td> 3.81</td>\n",
" <td> 6.99</td>\n",
" <td> 7.67</td>\n",
" <td> 8.16</td>\n",
" <td> 13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>11 rows \u00d7 8 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 89,
"text": [
" count mean std min 25% \\\n",
"hashtag_count 1500.00 0.48 1.12 0.00 0.00 \n",
"like_count_14days 1500.00 30.59 355.34 0.00 2.00 \n",
"comment_count_14days 1500.00 0.59 2.62 0.00 0.00 \n",
"share_count_14days 1500.00 3.11 14.95 0.00 0.00 \n",
"source_External 1500.00 0.42 0.49 0.00 0.00 \n",
"video_dummy 1500.00 0.03 0.18 0.00 0.00 \n",
"picture_dummy 1500.00 0.88 0.33 0.00 1.00 \n",
"Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 1954081.00 \n",
"Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 14.49 \n",
"followers_count 1500.00 20150.46 107022.51 45.00 1089.50 \n",
"Log_of_Followers 1500.00 7.79 1.34 3.81 6.99 \n",
"\n",
" 50% 75% max \n",
"hashtag_count 0.00 0.00 10.00 \n",
"like_count_14days 5.00 12.00 11004.00 \n",
"comment_count_14days 0.00 0.00 59.00 \n",
"share_count_14days 0.00 1.00 219.00 \n",
"source_External 0.00 1.00 1.00 \n",
"video_dummy 0.00 0.00 1.00 \n",
"picture_dummy 1.00 1.00 1.00 \n",
"Total_Revenue 4179577.00 10633968.00 209986539.00 \n",
"Log_of_Total_Revenue 15.25 16.18 19.16 \n",
"followers_count 2134.00 3481.00 721419.00 \n",
"Log_of_Followers 7.67 8.16 13.49 \n",
"\n",
"[11 rows x 8 columns]"
]
}
],
"prompt_number": 89
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"We won't typically want the percentile columns in a social scientific publication. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>hashtag_count</th>\n",
" <td>1500.00</td>\n",
" <td> 0.48</td>\n",
" <td> 1.12</td>\n",
" <td> 0.00</td>\n",
" <td> 10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>like_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 30.59</td>\n",
" <td> 355.34</td>\n",
" <td> 0.00</td>\n",
" <td> 11004.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>comment_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 0.59</td>\n",
" <td> 2.62</td>\n",
" <td> 0.00</td>\n",
" <td> 59.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>share_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 3.11</td>\n",
" <td> 14.95</td>\n",
" <td> 0.00</td>\n",
" <td> 219.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>source_External</th>\n",
" <td>1500.00</td>\n",
" <td> 0.42</td>\n",
" <td> 0.49</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>video_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.03</td>\n",
" <td> 0.18</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>picture_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.88</td>\n",
" <td> 0.33</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td>11486117.19</td>\n",
" <td>31061197.86</td>\n",
" <td>1015497.00</td>\n",
" <td>209986539.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td> 15.36</td>\n",
" <td> 1.12</td>\n",
" <td> 13.83</td>\n",
" <td> 19.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>followers_count</th>\n",
" <td>1500.00</td>\n",
" <td> 20150.46</td>\n",
" <td> 107022.51</td>\n",
" <td> 45.00</td>\n",
" <td> 721419.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Followers</th>\n",
" <td>1500.00</td>\n",
" <td> 7.79</td>\n",
" <td> 1.34</td>\n",
" <td> 3.81</td>\n",
" <td> 13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>11 rows \u00d7 5 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 90,
"text": [
" count mean std min max\n",
"hashtag_count 1500.00 0.48 1.12 0.00 10.00\n",
"like_count_14days 1500.00 30.59 355.34 0.00 11004.00\n",
"comment_count_14days 1500.00 0.59 2.62 0.00 59.00\n",
"share_count_14days 1500.00 3.11 14.95 0.00 219.00\n",
"source_External 1500.00 0.42 0.49 0.00 1.00\n",
"video_dummy 1500.00 0.03 0.18 0.00 1.00\n",
"picture_dummy 1500.00 0.88 0.33 0.00 1.00\n",
"Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 209986539.00\n",
"Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 19.16\n",
"followers_count 1500.00 20150.46 107022.51 45.00 721419.00\n",
"Log_of_Followers 1500.00 7.79 1.34 3.81 13.49\n",
"\n",
"[11 rows x 5 columns]"
]
}
],
"prompt_number": 90
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#ALTERNATIVE WAY OF WRITING\n",
"np.round(df.describe(), 2).transpose()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>25%</th>\n",
" <th>50%</th>\n",
" <th>75%</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>hashtag_count</th>\n",
" <td>1500.00</td>\n",
" <td> 0.48</td>\n",
" <td> 1.12</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>like_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 30.59</td>\n",
" <td> 355.34</td>\n",
" <td> 0.00</td>\n",
" <td> 2.00</td>\n",
" <td> 5.00</td>\n",
" <td> 12.00</td>\n",
" <td> 11004.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>comment_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 0.59</td>\n",
" <td> 2.62</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 59.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>share_count_14days</th>\n",
" <td>1500.00</td>\n",
" <td> 3.11</td>\n",
" <td> 14.95</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 219.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>source_External</th>\n",
" <td>1500.00</td>\n",
" <td> 0.42</td>\n",
" <td> 0.49</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>video_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.03</td>\n",
" <td> 0.18</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>picture_dummy</th>\n",
" <td>1500.00</td>\n",
" <td> 0.88</td>\n",
" <td> 0.33</td>\n",
" <td> 0.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" <td> 1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td>11486117.19</td>\n",
" <td>31061197.86</td>\n",
" <td>1015497.00</td>\n",
" <td>1954081.00</td>\n",
" <td>4179577.00</td>\n",
" <td>10633968.00</td>\n",
" <td>209986539.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Total_Revenue</th>\n",
" <td>1500.00</td>\n",
" <td> 15.36</td>\n",
" <td> 1.12</td>\n",
" <td> 13.83</td>\n",
" <td> 14.49</td>\n",
" <td> 15.25</td>\n",
" <td> 16.18</td>\n",
" <td> 19.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>followers_count</th>\n",
" <td>1500.00</td>\n",
" <td> 20150.46</td>\n",
" <td> 107022.51</td>\n",
" <td> 45.00</td>\n",
" <td> 1089.50</td>\n",
" <td> 2134.00</td>\n",
" <td> 3481.00</td>\n",
" <td> 721419.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Log_of_Followers</th>\n",
" <td>1500.00</td>\n",
" <td> 7.79</td>\n",
" <td> 1.34</td>\n",
" <td> 3.81</td>\n",
" <td> 6.99</td>\n",
" <td> 7.67</td>\n",
" <td> 8.16</td>\n",
" <td> 13.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>11 rows \u00d7 8 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 92,
"text": [
" count mean std min 25% \\\n",
"hashtag_count 1500.00 0.48 1.12 0.00 0.00 \n",
"like_count_14days 1500.00 30.59 355.34 0.00 2.00 \n",
"comment_count_14days 1500.00 0.59 2.62 0.00 0.00 \n",
"share_count_14days 1500.00 3.11 14.95 0.00 0.00 \n",
"source_External 1500.00 0.42 0.49 0.00 0.00 \n",
"video_dummy 1500.00 0.03 0.18 0.00 0.00 \n",
"picture_dummy 1500.00 0.88 0.33 0.00 1.00 \n",
"Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 1954081.00 \n",
"Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 14.49 \n",
"followers_count 1500.00 20150.46 107022.51 45.00 1089.50 \n",
"Log_of_Followers 1500.00 7.79 1.34 3.81 6.99 \n",
"\n",
" 50% 75% max \n",
"hashtag_count 0.00 0.00 10.00 \n",
"like_count_14days 5.00 12.00 11004.00 \n",
"comment_count_14days 0.00 0.00 59.00 \n",
"share_count_14days 0.00 1.00 219.00 \n",
"source_External 0.00 1.00 1.00 \n",
"video_dummy 0.00 0.00 1.00 \n",
"picture_dummy 1.00 1.00 1.00 \n",
"Total_Revenue 4179577.00 10633968.00 209986539.00 \n",
"Log_of_Total_Revenue 15.25 16.18 19.16 \n",
"followers_count 2134.00 3481.00 721419.00 \n",
"Log_of_Followers 7.67 8.16 13.49 \n",
"\n",
"[11 rows x 8 columns]"
]
}
],
"prompt_number": 92
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"SAVE THE OUTPUT OF THE TABLE AS A CSV FILE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"We can use the above commands and output to CSV"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#WITH FOUR DECIMAL PLACES (DEFAULT)\n",
"df.describe().transpose().to_csv('summary stats.csv', sep=',')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"For a typical social scientific publication, we would not need the percentile columns. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.describe().transpose()[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"The problem with the above output is that more than 2 decimal places are showing. If you want only two, then run the following version."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#WITH TWO DECIMAL PLACES\n",
"np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"Now you have a CSV file containing the columns you'll need for a typical Summary Statistics or Descriptive Statistics table for a submission to a social science journal. You likely won't want all of the columns in the final table, so I would probably open up the CSV file in Excel, delete unwanted variables, then copy and paste into Word. At that point you just need some formatting for aesthetics. If you do want to select which specific variables to include, you can specify the columns like this."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cols = ['hashtag_count','like_count_14days']\n",
"np.round(df[cols].describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats (partial).csv', sep=',')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter <a href='https://twitter.com/gregorysaxton'>@gregorysaxton</a>"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment