ekzhang/Workshop.ipynb

## Workshop.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HCS Workshop 1: Data Science\n",
    "\n",
    "Welcome to the data science workshop! Your instructors today are [**Will Cooper '24**](http://www.will-cooper.net/), leading the discussion, and [**Eric Zhang '23**](https://www.ekzhang.com/), who designed and wrote this iteration of the workshop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Imports\n",
    "\n",
    "We'll be using `pandas` for data storage and cleaning, along with `altair` and `seaborn` for exploratory visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import altair as alt\n",
    "from altair import datum, expr\n",
    "\n",
    "alt.data_transformers.disable_max_rows()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Loading Data\n",
    "\n",
    "The cell below will load your Facebook messages data into a table called `df`. Each row of this table corresponds to a single message, and it has four columns:\n",
    "\n",
    "- `chat`: The ID of the messenger chat that this message was sent in.\n",
    "- `sender`: The name of the user who sent the message.\n",
    "- `time`: A timestamp specifying exactly when the message was sent.\n",
    "- `content`: The text of the message.\n",
    "\n",
    "The `build_dataframe()` function loads all of this data in files downloaded from Facebook. It takes one argument, `path`, containing the folder where the unzipped data resides. If you're running this notebook in the same directory as `messages` then no need to change anything; otherwise, please specify the path to your `messages` folder as an argument to the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_dataframe(path='messages'):\n",
    "    rows_list = []\n",
    "    for filename in Path(path).glob('inbox/*/message_*.json'):\n",
    "        chat = filename.parent.name\n",
    "        with open(filename, 'r') as f:\n",
    "            obj = json.load(f)\n",
    "        for entry in obj['messages']:\n",
    "            if entry['type'] == 'Generic' and entry.get('content') is not None:\n",
    "                rows_list.append({\n",
    "                    'chat': chat,\n",
    "                    'sender': entry['sender_name'],\n",
    "                    'time': pd.to_datetime(entry['timestamp_ms'], unit='ms'),\n",
    "                    'content': entry['content'],\n",
    "                })\n",
    "    return pd.DataFrame(rows_list)\n",
    "\n",
    "df = build_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly check to make sure that our data was loaded properly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.chat.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Data Visualization\n",
    "\n",
    "Now we can start exploring! First, how many messages have we sent over time?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alt.Chart(df).mark_line().encode(\n",
    "    x='yearmonth(time):T',\n",
    "    y='count()',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's color-code this graph by conversation. We can add tooltips and a title along the way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alt.Chart(df).mark_bar().encode(\n",
    "    x='yearmonth(time):T',\n",
    "    y='count()',\n",
    "    color='chat:N',\n",
    "    tooltip=['chat', 'count()'],\n",
    ").properties(\n",
    "    title='Number of Facebook Messages',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How about the number of messages sent by each participant in a particular chat?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_id = 'naptime_gotdtb0uea'\n",
    "\n",
    "alt.Chart(df.query(f'chat == \"{chat_id}\"')).mark_bar().encode(\n",
    "    color='sender:N',\n",
    "    x='yearmonth(time):T',\n",
    "    y='count()',\n",
    "    tooltip=['sender', 'count()'],\n",
    ").properties(\n",
    "    title='Facebook Messages in Group Chat',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which participants tend to send the most messages in each of your conversations, if any?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alt.Chart(df).mark_bar().encode(\n",
    "    alt.X('count()', stack='normalize', title='frequency'),\n",
    "    alt.Y('chat'),\n",
    "    alt.Color('sender'),\n",
    "    tooltip=['sender', alt.Tooltip('count()', title='messages')],\n",
    ").properties(\n",
    "    title='Who Dominates the Conversation?',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's look at the number of messages (and words) you sent on each day of the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sender = 'Eric Zhang'\n",
    "\n",
    "alt.Chart(df).mark_rect().encode(\n",
    "    alt.X('date(time):O', title='day'),\n",
    "    alt.Y('yearmonth(time):O', title='month'),\n",
    "    alt.Color('count()', scale=alt.Scale(type='linear')),\n",
    "    tooltip=[\n",
    "        alt.Tooltip('count()', title='Messages'),\n",
    "        alt.Tooltip('sum(words):Q', title='Words'),\n",
    "    ],\n",
    ").transform_filter(\n",
    "    datum.sender == sender,\n",
    ").transform_calculate(\n",
    "    words=expr.length(expr.split(datum.content, ' ')),\n",
    ").properties(\n",
    "    title='Number of Messages Sent by Day',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4: Exercises\n",
    "\n",
    "Create two more data visualizations, different from the ones above. What are you curious to learn about? Some suggestions:\n",
    "\n",
    "- Change some of the chart types to line graphs or scatter plots, or experiment with [scales](https://vega.github.io/vega/docs/scales/) and [color schemes](https://vega.github.io/vega/docs/schemes/).\n",
    "- On which hours and days of the week are you most active?\n",
    "- Has the average length of your messages, increased, decreased, or stayed the same over time?\n",
    "- Filter your messages by a particular word. Who do you talk to most about, e.g., sports?\n",
    "- Who sends you the most messages with [positive sentiment](https://www.nltk.org/api/nltk.sentiment.html)? Has this changed over time?\n",
    "\n",
    "If you're looking for more inspiration, check out the [Altair example gallery](https://altair-viz.github.io/gallery/index.html)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Viz 1: Your code here!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Viz 2: Your code here!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 5: Submission"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're comping HCS and would like to receive credit for completing this workshop, here are the instructions to do so.\n",
    "\n",
    "1. Click the Kernel >> \"Restart Kernel and Run All Cells\" button at the top-left of the screen. This will run your entire notebook from top to bottom, ensuring that your code is reproducible.\n",
    "2. Click on the ellipsis icon at the top-right of your two personal visualizations, and save them as SVG graphics files.\n",
    "3. Click the Kernel >> \"Restart Kernel and Clear All Outputs\" button at the top-left of the screen. This will remove all outputs from your notebook and leave only the code, which will greatly reduce file size. After this, you're all set!\n",
    "4. Save the notebook, then drag-and-drop your `.ipynb` file to a [GitHub Gist](https://gist.github.com/).\n",
    "5. Submit at the [Google Form](https://forms.gle/D1S3AY1d7SuKobCp8).\n",
    "\n",
    "Congratulations for finishing!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# HCS Workshop 1: Data Science\n",
	"\n",
	"Welcome to the data science workshop! Your instructors today are [Will Cooper '24](http://www.will-cooper.net/), leading the discussion, and [Eric Zhang '23](https://www.ekzhang.com/), who designed and wrote this iteration of the workshop."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Part 1: Imports\n",
	"\n",
	"We'll be using `pandas` for data storage and cleaning, along with `altair` and `seaborn` for exploratory visualization."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import json\n",
	"import os\n",
	"from pathlib import Path\n",
	"\n",
	"import numpy as np\n",
	"import pandas as pd\n",
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import altair as alt\n",
	"from altair import datum, expr\n",
	"\n",
	"alt.data_transformers.disable_max_rows()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Part 2: Loading Data\n",
	"\n",
	"The cell below will load your Facebook messages data into a table called `df`. Each row of this table corresponds to a single message, and it has four columns:\n",
	"\n",
	"- `chat`: The ID of the messenger chat that this message was sent in.\n",
	"- `sender`: The name of the user who sent the message.\n",
	"- `time`: A timestamp specifying exactly when the message was sent.\n",
	"- `content`: The text of the message.\n",
	"\n",
	"The `build_dataframe()` function loads all of this data in files downloaded from Facebook. It takes one argument, `path`, containing the folder where the unzipped data resides. If you're running this notebook in the same directory as `messages` then no need to change anything; otherwise, please specify the path to your `messages` folder as an argument to the function."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def build_dataframe(path='messages'):\n",
	" rows_list = []\n",
	" for filename in Path(path).glob('inbox//message_.json'):\n",
	" chat = filename.parent.name\n",
	" with open(filename, 'r') as f:\n",
	" obj = json.load(f)\n",
	" for entry in obj['messages']:\n",
	" if entry['type'] == 'Generic' and entry.get('content') is not None:\n",
	" rows_list.append({\n",
	" 'chat': chat,\n",
	" 'sender': entry['sender_name'],\n",
	" 'time': pd.to_datetime(entry['timestamp_ms'], unit='ms'),\n",
	" 'content': entry['content'],\n",
	" })\n",
	" return pd.DataFrame(rows_list)\n",
	"\n",
	"df = build_dataframe()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's quickly check to make sure that our data was loaded properly."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df.chat.value_counts()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Part 3: Data Visualization\n",
	"\n",
	"Now we can start exploring! First, how many messages have we sent over time?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"alt.Chart(df).mark_line().encode(\n",
	" x='yearmonth(time):T',\n",
	" y='count()',\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's color-code this graph by conversation. We can add tooltips and a title along the way."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"alt.Chart(df).mark_bar().encode(\n",
	" x='yearmonth(time):T',\n",
	" y='count()',\n",
	" color='chat:N',\n",
	" tooltip=['chat', 'count()'],\n",
	").properties(\n",
	" title='Number of Facebook Messages',\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"How about the number of messages sent by each participant in a particular chat?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"chat_id = 'naptime_gotdtb0uea'\n",
	"\n",
	"alt.Chart(df.query(f'chat == \"{chat_id}\"')).mark_bar().encode(\n",
	" color='sender:N',\n",
	" x='yearmonth(time):T',\n",
	" y='count()',\n",
	" tooltip=['sender', 'count()'],\n",
	").properties(\n",
	" title='Facebook Messages in Group Chat',\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Which participants tend to send the most messages in each of your conversations, if any?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"alt.Chart(df).mark_bar().encode(\n",
	" alt.X('count()', stack='normalize', title='frequency'),\n",
	" alt.Y('chat'),\n",
	" alt.Color('sender'),\n",
	" tooltip=['sender', alt.Tooltip('count()', title='messages')],\n",
	").properties(\n",
	" title='Who Dominates the Conversation?',\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally, let's look at the number of messages (and words) you sent on each day of the year."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"sender = 'Eric Zhang'\n",
	"\n",
	"alt.Chart(df).mark_rect().encode(\n",
	" alt.X('date(time):O', title='day'),\n",
	" alt.Y('yearmonth(time):O', title='month'),\n",
	" alt.Color('count()', scale=alt.Scale(type='linear')),\n",
	" tooltip=[\n",
	" alt.Tooltip('count()', title='Messages'),\n",
	" alt.Tooltip('sum(words):Q', title='Words'),\n",
	" ],\n",
	").transform_filter(\n",
	" datum.sender == sender,\n",
	").transform_calculate(\n",
	" words=expr.length(expr.split(datum.content, ' ')),\n",
	").properties(\n",
	" title='Number of Messages Sent by Day',\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Part 4: Exercises\n",
	"\n",
	"Create two more data visualizations, different from the ones above. What are you curious to learn about? Some suggestions:\n",
	"\n",
	"- Change some of the chart types to line graphs or scatter plots, or experiment with [scales](https://vega.github.io/vega/docs/scales/) and [color schemes](https://vega.github.io/vega/docs/schemes/).\n",
	"- On which hours and days of the week are you most active?\n",
	"- Has the average length of your messages, increased, decreased, or stayed the same over time?\n",
	"- Filter your messages by a particular word. Who do you talk to most about, e.g., sports?\n",
	"- Who sends you the most messages with [positive sentiment](https://www.nltk.org/api/nltk.sentiment.html)? Has this changed over time?\n",
	"\n",
	"If you're looking for more inspiration, check out the [Altair example gallery](https://altair-viz.github.io/gallery/index.html)!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Viz 1: Your code here!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Viz 2: Your code here!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Part 5: Submission"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you're comping HCS and would like to receive credit for completing this workshop, here are the instructions to do so.\n",
	"\n",
	"1. Click the Kernel >> \"Restart Kernel and Run All Cells\" button at the top-left of the screen. This will run your entire notebook from top to bottom, ensuring that your code is reproducible.\n",
	"2. Click on the ellipsis icon at the top-right of your two personal visualizations, and save them as SVG graphics files.\n",
	"3. Click the Kernel >> \"Restart Kernel and Clear All Outputs\" button at the top-left of the screen. This will remove all outputs from your notebook and leave only the code, which will greatly reduce file size. After this, you're all set!\n",
	"4. Save the notebook, then drag-and-drop your `.ipynb` file to a [GitHub Gist](https://gist.github.com/).\n",
	"5. Submit at the [Google Form](https://forms.gle/D1S3AY1d7SuKobCp8).\n",
	"\n",
	"Congratulations for finishing!"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}