Skip to content

Instantly share code, notes, and snippets.

@rmoff
Created October 14, 2022 18:00
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmoff/e5cd6e04fc54e592fc4c232863252607 to your computer and use it in GitHub Desktop.
Save rmoff/e5cd6e04fc54e592fc4c232863252607 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "08ce7313",
"metadata": {},
"source": [
"# Current 22 - Session Analysis\n",
"\n",
"_Mainly to show off using DuckDB - in a notebook! See [docs](https://duckdb.org/docs/guides/python/jupyter)_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec6bcd4c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import sys\n",
"!{sys.executable} -m pip install duckdb notebook pandas ipython-sql SQLAlchemy duckdb-engine"
]
},
{
"cell_type": "markdown",
"id": "c760cd74",
"metadata": {},
"source": [
"Import the relevant libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "efb68f9d",
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"import pandas as pd\n",
"import sqlalchemy\n",
"\n",
"# Import ipython-sql Jupyter extension to create SQL cells\n",
"%load_ext sql"
]
},
{
"cell_type": "markdown",
"id": "f11e811d",
"metadata": {},
"source": [
"Set configrations on ipython-sql to directly output data to Pandas and to simplify the output that is printed to the notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4cbff209",
"metadata": {},
"outputs": [],
"source": [
"%config SqlMagic.autopandas = True\n",
"%config SqlMagic.feedback = False\n",
"%config SqlMagic.displaycon = False"
]
},
{
"cell_type": "markdown",
"id": "f9632ccf",
"metadata": {},
"source": [
"Connect ipython-sql to DuckDB using a SQLAlchemy-style connection string. You may either connect to an in memory DuckDB, or a file backed db."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59998a83",
"metadata": {},
"outputs": [],
"source": [
"%sql duckdb:////work/current_data.duckdb"
]
},
{
"cell_type": "markdown",
"id": "b9abe4f3",
"metadata": {},
"source": [
"## Load the raw data"
]
},
{
"cell_type": "markdown",
"id": "cd7274c3",
"metadata": {},
"source": [
"The data was provided in an Excel sheet which I've exported to a CSV file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32795e6c",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"create table raw_data as \n",
" select * from read_csv_auto('/work/Current 2022_Session Rating Detail.xlsx - Sheet 1.csv');"
]
},
{
"cell_type": "markdown",
"id": "cd42a83d-bdfa-42de-9809-4625ba24fbd7",
"metadata": {},
"source": [
"The row count here 👆️ matches the row count of the source file ✅\n",
"\n",
"Now let's see what the schema looks like. It's automagically derived from the CSV column headings and field values. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "515c7c4f",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"describe raw_data;"
]
},
{
"cell_type": "markdown",
"id": "cb43030c-6e38-4ccf-a7d2-a0ee044df936",
"metadata": {},
"source": [
"Let's wrangle the data a bit. \n",
"\n",
"It'd be nice to get the `Start Time` as a timestamp (currently a `VARCHAR`). Everything else looks OK from a data type point of view. \n",
"\n",
"We'll also drop some fields that we don't need or want. For example, we don't want sensitive information like the names of the people who left the ratings."
]
},
{
"cell_type": "markdown",
"id": "58844c06-3169-4387-81b4-135b41b28284",
"metadata": {},
"source": [
"### Converting the `VARCHAR` timestamp field to a real `TIMESTAMP`"
]
},
{
"cell_type": "markdown",
"id": "a0b01e89-391b-4ddc-bbfb-704e90122d75",
"metadata": {},
"source": [
"Let's first check some of the values and check out the nifty `SAMPLE` function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "463a8c03-3367-488a-8122-68645a59a05b",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"SELECT \"Start Time\" FROM raw_data USING SAMPLE 5;"
]
},
{
"cell_type": "markdown",
"id": "599ed425-737c-48dd-ac3e-cfd13e05f018",
"metadata": {},
"source": [
"So the format is a mixture of US date (month / day / year) and 24hr time. \n",
"\n",
"Before going ahead with the transform let's just do a dry-run to make sure we've got our [format string](https://duckdb.org/docs/sql/functions/dateformat#format-specifiers) correct. (_I just noticed on that page too that I could have specified this at the point at which [the CSV file was read](https://duckdb.org/docs/sql/functions/dateformat#csv-parsing)_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e1a4d90-038c-49dd-9532-67b00aef0941",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"SELECT \"Start Time\", \n",
" strptime(\"Start Time\",'%-m/%-d/%-y %-H:%M') as start_ts,\n",
" strftime(strptime(\"Start Time\",'%-m/%-d/%-y %-H:%M'), '%-H:%M') as start_time\n",
" FROM raw_data \n",
"USING SAMPLE 5;"
]
},
{
"cell_type": "markdown",
"id": "6f9934f5-7b1b-4868-b1ba-30734b5fc3ac",
"metadata": {},
"source": [
"### Ratings are left by different types of attendee\n",
"\n",
"The system categorised attendees not only as in-person or virtual, but also based on whether they were sponsors, etc. \n",
"\n",
"I'm just interested in \"was the person there\" or \"was the person watching it on their computer elsewhere\"?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad0e3af1-f4e0-4341-b920-40c173bec72c",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"WITH A AS (\n",
"SELECT CASE WHEN \"Attendee type\"='Virtual' \n",
" THEN 'Virtual'\n",
" ELSE 'In-person' END AS attendee_type\n",
" FROM raw_data) \n",
"\n",
"SELECT attendee_type,count(*) as ratings_ct FROM A group by attendee_type;"
]
},
{
"cell_type": "markdown",
"id": "88b7ae67-d1d2-4bbf-b718-381f530f8def",
"metadata": {},
"source": [
"### Go Go Gadget\n",
"\n",
"Let's transform the data!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc625741-7cff-46a6-8267-9708a6dde365",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"DROP TABLE IF EXISTS current_ratings; \n",
"\n",
"CREATE TABLE current_ratings AS \n",
" SELECT SessionID as ID,\n",
" title AS session, \n",
" strptime(\"Start Time\",'%-m/%-d/%-y %-H:%M') as start_ts,\n",
" CASE WHEN \"Attendee type\"='Virtual' \n",
" THEN 'Virtual'\n",
" ELSE 'In-person' END AS attendee_type,\n",
" \"Rating Type\" as rating_type,\n",
" rating,\n",
" comment\n",
" FROM raw_data;"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7b19e89-d16a-438e-af1a-b6068517254c",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"DESCRIBE current_ratings;"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "989d8ebc-60ae-4647-a1fb-953cdd37a03a",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"SELECT id, start_ts, attendee_type, rating_type, rating FROM current_ratings USING SAMPLE 5;"
]
},
{
"cell_type": "markdown",
"id": "c36abbdd-852d-4993-afeb-f4bd7aed2d45",
"metadata": {},
"source": [
"## Here comes the analysis"
]
},
{
"cell_type": "markdown",
"id": "4c26c88e-a874-4c12-85a5-145839d2b1a9",
"metadata": {},
"source": [
"### 1. Was there a noticable difference between ratings given by virtual attendees vs in-person?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f666bf4-5dfd-4260-b205-acdf6e970c21",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"SELECT attendee_type, rating_type, MEDIAN(rating)\n",
" FROM current_ratings\n",
"GROUP BY attendee_type, rating_type;"
]
},
{
"cell_type": "markdown",
"id": "ff71cc48-5299-4afb-9dea-0e6801da91fa",
"metadata": {},
"source": [
"A median score of 5 out of 5 tells us that the ratings were high across the board - but hides the nuances of the data. We could apply further percentile functions, but this is where visualisation comes into its own. "
]
},
{
"cell_type": "markdown",
"id": "a4591ccb-6516-4a3e-9153-dc99ebd47b69",
"metadata": {},
"source": [
"#### Using `bokeh` and `altair` for visualisation"
]
},
{
"cell_type": "markdown",
"id": "341736c9-c4e1-4c48-989c-468c2eb3dd13",
"metadata": {},
"source": [
"Import the [bokeh](https://docs.bokeh.org/en/latest/index.html) libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed6e6d26-7dda-43fc-9068-1167f27f9e29",
"metadata": {},
"outputs": [],
"source": [
"from bokeh.plotting import figure, show\n",
"from bokeh.transform import jitter\n",
"from bokeh.transform import factor_cmap, factor_mark\n",
"from bokeh.models import HoverTool\n",
"\n",
"output_notebook(hide_banner=True)"
]
},
{
"cell_type": "markdown",
"id": "592ec28a-6691-4dbe-af23-dd6497188122",
"metadata": {},
"source": [
"Load the ratings data into a dataframe for visualising"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28065154-4114-4cac-863b-09ee61cfd250",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"attendee_df << SELECT *, strftime(start_ts, '%H:%M') as start_time FROM current_ratings ;"
]
},
{
"cell_type": "markdown",
"id": "d47e61e4-3d5c-4472-b43c-205d9eadfb14",
"metadata": {},
"source": [
"Use bokeh to plot the ratings, by rating type and attendee type"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "649101b2-6a1a-4c96-905e-aeef20e4337d",
"metadata": {},
"outputs": [],
"source": [
"p = figure(y_range=sorted(attendee_df.rating_type.unique()),\n",
" sizing_mode=\"fixed\",\n",
" toolbar_location=None,\n",
" title='Current 22: Ratings Distribution, broken down by attendee and rating type')\n",
"\n",
"\n",
"ix=factor_cmap('attendee_type',\n",
" palette=['green', 'orange'], \n",
" factors=sorted(attendee_df.attendee_type.unique()))\n",
" \n",
"p.scatter(x=jitter('rating',0.4),\n",
" y=jitter('rating_type',0.2,range=p.y_range),\n",
" color=ix,\n",
" source=attendee_df,\n",
" size=9, \n",
" alpha=0.4,\n",
" legend_group='attendee_type')\n",
"show(p)"
]
},
{
"cell_type": "markdown",
"id": "eb3a95e8-ec04-4e1b-9b7a-c6a2b5b814f2",
"metadata": {},
"source": [
"_This shows us that the concentration of the ratings were certainly favourable. Let's take another angle on it and use a histogram instead of scatterplot. For this I switched to `altair` because it supports the aggregation of data (using `count()`) which I'd need to figure out how to do otherwise with the dataframe - and that's one step of yak-shaving too far for today…_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "578a5404-7c5d-4c16-997a-76f095933595",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt\n",
"\n",
"chart = alt.Chart(attendee_df).mark_bar().encode(\n",
" column=alt.Column(\n",
" 'rating', \n",
" header=alt.Header(orient='bottom')\n",
" ),\n",
" x=alt.X('rating_type', axis=alt.Axis(ticks=False, labels=True, title='')),\n",
" y=alt.Y('count()', axis=alt.Axis(grid=False)),\n",
" color='attendee_type'\n",
").configure_view(\n",
" stroke=None,\n",
")\n",
"\n",
"chart.display()"
]
},
{
"cell_type": "markdown",
"id": "ec447b38-d1ce-4af4-9bc1-38d87c41c714",
"metadata": {},
"source": [
"This is OK, but it's the absolute values, of which there are disproportionally more for in-person than virtual. Can we plot the values as a percentage instead? \n",
"I hit the extent of my altair understanding here (I think [`transform_joinaggregate`](https://altair-viz.github.io/user_guide/transform/joinaggregate.html) might have helped but I got impatient), so resorted to SQL to help instead. \n",
"\n",
"DuckDB supports [window functions](https://duckdb.org/docs/sql/window_functions) (and has a rather good docs page to explain some of the concepts). I needed the total number of ratings by `attendee_type` so that I could work out the percentage of each of the counts, and this is what `sum(count(*)) over (partition by attendee_type)` gave me. First, I'll check the numbers that I'm expected; "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "195334df-dac8-4e3a-b704-58d5f38aa086",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"SELECT attendee_type, count(*)\n",
"FROM current_ratings \n",
"GROUP BY attendee_type;"
]
},
{
"cell_type": "markdown",
"id": "4cee7840-cd24-42df-a4f4-ad51f82a6566",
"metadata": {},
"source": [
"and then use the window function with the above numbers to check that it's what I'm expecting. \n",
"\n",
"Note the use of `CAST…AS FLOAT` — without this the numbers stay as integers and resolve to a big long list of zeroes…"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe7d974f-f73c-410d-9828-18951a4460e8",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"%%sql \n",
"\n",
"SELECT attendee_type, rating_type, rating, count(*) as rating_ct, \n",
" sum(count(*)) over (partition by attendee_type) as total_attendee_type_ratings,\n",
" (cast(count(*) as float) / cast(sum(count(*)) over (partition by attendee_type) as float))*100 as rating_attendee_type_pct\n",
"FROM current_ratings \n",
"GROUP BY attendee_type, rating_type, rating \n",
"ORDER BY rating_type,rating_attendee_type_pct DESC\n",
"FETCH FIRST 10 ROWS ONLY;"
]
},
{
"cell_type": "markdown",
"id": "7b6e8582-026e-47ce-a643-3f599f01c862",
"metadata": {},
"source": [
"Now we can load it into a dataframe and plot it like we did above"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9044fbed-cc5e-405d-9c5b-71ff9f8b6eb0",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"%%sql \n",
"\n",
"ratings_df << SELECT attendee_type, rating_type, rating, count(*) as rating_ct, \n",
" sum(count(*)) over (partition by attendee_type) as total_attendee_type_ratings,\n",
" (cast(count(*) as float) / cast(sum(count(*)) over (partition by attendee_type) as float))*100 as rating_attendee_type_pct\n",
"FROM current_ratings \n",
"GROUP BY attendee_type, rating_type, rating \n",
"ORDER BY attendee_type,rating_attendee_type_pct DESC;"
]
},
{
"cell_type": "markdown",
"id": "03c99736-214c-48cc-bb8f-ab399a765a96",
"metadata": {},
"source": [
"Let's plot the data with `altair` and facet the charts for ease of readability: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07913b78-f552-4d07-a155-82da2488ecfb",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt\n",
"from altair.expr import datum\n",
"\n",
"base = alt.Chart(ratings_df).mark_bar().encode(\n",
" column='rating',\n",
" x=alt.X('attendee_type', axis=alt.Axis(ticks=False, labels=True, title='')),\n",
" y=alt.Y('rating_attendee_type_pct', axis=alt.Axis(grid=False)),\n",
" color='attendee_type',\n",
" tooltip=['rating', 'rating_attendee_type_pct', 'attendee_type', 'rating_type']\n",
"\n",
")\n",
"\n",
"chart = alt.hconcat()\n",
"for r in sorted(ratings_df.rating_type.unique()):\n",
" chart |= base.transform_filter(datum.rating_type == r).properties(title=r)\n",
"chart.display()"
]
},
{
"cell_type": "markdown",
"id": "971718b4-776e-47da-8074-6521dda57f59",
"metadata": {},
"source": [
"#### Analysis conclusion: \n",
"\n",
"Across all three ratings categories, there is a marginal but present difference between how in-person attendees and virtual attendees rated sessions. Those in-person were more likely to rate a session `5` whilst those attending virtually did predominantly rate sessions `5` but if not `5` then more often `4` than those in person. "
]
},
{
"cell_type": "markdown",
"id": "0bf843ed-721c-4090-a1ad-8831c8b28e54",
"metadata": {},
"source": [
"### Analysis: Did Ratings vary of the course of the Day?\n",
"\n",
"Probably not a very scientific study, but with 7+ tracks of content and two days' worth of samples, here's how it looks if we take an average of the rating per timeslot: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a19837c-1e63-46ad-bcc9-da5dabc23a01",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"%%sql \n",
"\n",
"df << SELECT strftime(start_ts, '%H:%M') as start_time,attendee_type, avg(rating) as avg_rating\n",
"FROM current_ratings \n",
"WHERE rating_type='Overall Experience'\n",
"GROUP BY start_time, attendee_type \n",
"ORDER BY start_time, attendee_type ;\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9cb04fd-7a0b-4591-bc9a-2a8338ab7fea",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt\n",
"from altair.expr import datum\n",
"\n",
"chart = alt.Chart(df).mark_line(opacity=0.8,width=10).encode(\n",
" x='start_time',\n",
" y=alt.Y(\"avg_rating:Q\", stack=None),\n",
" color='attendee_type'\n",
")\n",
"\n",
"chart"
]
},
{
"cell_type": "markdown",
"id": "33345b21-af7a-4209-ac65-b1cbf86793c8",
"metadata": {},
"source": [
"It's not a great plot because the data is lumpy; some sessions were 45 minutes and other 10 minutes. Let's see if the [range framing](https://duckdb.org/docs/sql/window_functions#range-framing) in DuckDB will help us here to smooth it out a bit with a rolling hourly average: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "193e9369-9a00-4014-af23-9666508a6b07",
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"df << SELECT attendee_type, start_ts, strftime(start_ts, '%H:%M') as start_time, strftime(start_ts, '%d %b %y') as date,\n",
" AVG(rating) OVER (\n",
" PARTITION BY attendee_type\n",
" ORDER BY start_ts ASC\n",
" RANGE BETWEEN INTERVAL 30 MINUTES PRECEDING\n",
" AND INTERVAL 30 MINUTES FOLLOWING)\n",
" AS avg_rating\n",
"FROM current_ratings\n",
"ORDER BY 1, 2;"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f36b5ec5-d6cd-4947-9b2a-c16422e50deb",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt\n",
"from altair.expr import datum\n",
"\n",
"chart = alt.Chart(df).mark_line(opacity=0.8,width=10).encode(\n",
" x='start_time',\n",
" y=alt.Y(\"avg_rating:Q\", stack=None),\n",
" color='attendee_type', column='date'\n",
")\n",
"\n",
"\n",
"chart"
]
},
{
"cell_type": "markdown",
"id": "dbf4b8f6-a3aa-4730-bfc7-8c3e6ec59ba7",
"metadata": {},
"source": [
"That's much better. It also shows the scale isn't so useful, so let's adjust that"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d978118b-a266-4a4c-bcef-1a33e0307199",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt\n",
"from altair.expr import datum\n",
"\n",
"chart = alt.Chart(df).mark_line(opacity=0.8,width=10).encode(\n",
" x='start_time',\n",
" y=alt.Y(\"avg_rating:Q\", stack=None,scale=alt.Scale(domain=[2.5, 5])),\n",
" color='attendee_type', column='date'\n",
")\n",
"\n",
"\n",
"chart"
]
},
{
"cell_type": "markdown",
"id": "fc6cecc1-e808-4301-8189-7ec8ace573e1",
"metadata": {},
"source": [
"Observations: \n",
"\n",
"1. Virtual attendees got happier as day 1 (4th Oct) went on.\n",
"2. Something happened around 13:00 on day 2 (5th Oct) that upset the Virtual attendees but not those in-person—perhaps a problem with the livestream? Let's remember that the scale here has been magnified, so this \"drop\" is only relative. \n",
"\n",
"Let's have a look at the second point here and see if we can figure out what's going on."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f9a76ec-600d-4429-a4ea-69403a87c54b",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"select start_ts, rating, comment \n",
"from current_ratings \n",
"where start_ts between '2022-10-05 12:00:00' and '2022-10-05 13:00:00' \n",
" and attendee_type='Virtual' "
]
},
{
"cell_type": "markdown",
"id": "c54eec19-51db-4929-8e70-1386dcac836a",
"metadata": {},
"source": [
"Here's the fun thing about conference feedback: it's patchy, it's completely subjective - and it's often contradictory. These are comments for **_the same session_** - one person loved it, the other thought it shouldn't have even been on the agenda. Whaddya gonna do? `¯\\_(ツ)_/¯`"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment