Last active
August 18, 2017 18:06
-
-
Save bsmedberg/0f073efaa27e8e63b890df9b884e47f7 to your computer and use it in GitHub Desktop.
tab-usage-release-users
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
# # Tab/Pageload Behavior of Firefox Release Users | |
# | |
# Author: bsmedberg@mozilla.com | |
# | |
# Date: 8-May-2017 | |
# | |
# Status: Partially-reviewed draft. | |
# In[1]: | |
import plotly.plotly as py | |
from plotly.graph_objs import * | |
import pandas as pd | |
import matplotlib.pyplot as plt | |
import pyspark.sql.functions | |
import numpy as np | |
# In[2]: | |
main_summary = spark.read.option("mergeSchema", "true") .parquet('s3://telemetry-parquet/main_summary/v3/') .createOrReplaceTempView("main_summary") | |
# In[37]: | |
sample_unfiltered = spark.sql(""" | |
SELECT | |
client_id, | |
SUM(CASE WHEN app_version < '52' THEN 1 ELSE 0 END) AS bad_version, | |
count(1) as subsession_count, | |
max(max_concurrent_tab_count) as max_concurrent_tab_count, | |
sum(tab_open_event_count) as tab_open_event_count, | |
SUM(CASE WHEN subsession_length < 0 OR subsession_length > 172800 THEN 1 ELSE 0 END) AS bad_length, | |
sum(subsession_length) as subsession_length_sum, | |
sum(total_uri_count) as total_uri_count, | |
SUM(CASE WHEN reason = 'aborted-session' THEN 1 ELSE 0 END) AS aborted_sessions, | |
SUM(CASE WHEN reason = 'shutdown' THEN 1 ELSE 0 END) AS shutdown_session, | |
SUM(CASE WHEN reason = 'environment-change' THEN 1 ELSE 0 END) AS environment_session, | |
SUM(CASE WHEN reason = 'daily' THEN 1 ELSE 0 END) AS daily_session | |
FROM main_summary | |
WHERE | |
submission_date_s3 >= '20170415' AND submission_date_s3 <= '20170429' | |
AND subsession_start_date >= '2017-04-16' AND subsession_start_date < '2017-04-23' | |
GROUP BY 1 | |
HAVING bad_version = 0 AND bad_length = 0 | |
""" | |
).cache() | |
# In[38]: | |
sample_unfiltered.count() | |
# # Discarding severe outliers | |
# | |
# When defining heavy users in terms of total pageloads, we can suffer from bots and other unreal situations that inflate pageloads. Discard anyone who loads more than a page every 10 seconds or more than 30000/week (because WTF). | |
# In[39]: | |
total_uri_count_unfiltered = sample_unfiltered.select(pyspark.sql.functions.sum(sample_unfiltered.total_uri_count)).first()[0] | |
total_uri_count_unfiltered | |
# In[40]: | |
outliers = sample_unfiltered.where((sample_unfiltered.total_uri_count > 30000) | (sample_unfiltered.subsession_length_sum / sample_unfiltered.total_uri_count < 10)) | |
outliers_uri_count = outliers.select(pyspark.sql.functions.sum(outliers.total_uri_count)).first()[0] | |
print "total outliers: {}".format(outliers.count()) | |
print "total URIs loaded by outliers: {} ({:.0f}%)".format(outliers.count(), 100.0 * outliers_uri_count / total_uri_count_unfiltered) | |
print "example outliers:" | |
outliers.select("subsession_count", "max_concurrent_tab_count", "tab_open_event_count", "subsession_length_sum", "total_uri_count").sample(False, 0.001).toPandas() | |
# In[61]: | |
# how many users just have >30000 uri loads? | |
sample_unfiltered.where(sample_unfiltered.total_uri_count > 30000).count() | |
# In[41]: | |
sample = sample_unfiltered.where((sample_unfiltered.total_uri_count <= 30000) & (sample_unfiltered.subsession_length_sum / sample_unfiltered.total_uri_count >= 10)) | |
sample.createOrReplaceTempView("sample") | |
# ## Heavy users | |
# | |
# There are multiple definitions of a heavy user being proposed. We can't calculate heavy users by activeTicks as proposed at https://docs.google.com/a/mozilla.com/document/d/1TJUo1dbTA5mcWg5ZznKqYn3mpSqQeaDlqCRzUsi0_Js/edit?usp=sharing because this dataset doesn't yet have activeTicks. | |
# | |
# Instead, this calculates it by pageloads: at what threshold of pageloads (total_uri_count) do we find the users who are responsible for 80% of all pageloads? | |
# In[42]: | |
total_users = sample.count() | |
total_uri_count = sample.select(pyspark.sql.functions.sum(sample.total_uri_count)).first()[0] | |
# In[43]: | |
ranked_by_uri = spark.sql(""" | |
SELECT | |
total_uri_count, | |
RANK() OVER (ORDER BY total_uri_count) AS rank | |
FROM sample | |
ORDER BY total_uri_count | |
""").cache() | |
# In[44]: | |
def top_npct_uris(n): | |
topn = ranked_by_uri.where(ranked_by_uri.rank > total_users * (100.0 - n) / 100) | |
cutoff = topn.first().total_uri_count | |
uri_count = topn.select(pyspark.sql.functions.sum(sample.total_uri_count)).first()[0] | |
print "The top {}% of users accounted for {:.0f}% of total URI loads. (cutoff={})".format(n, 100.0 * uri_count / total_uri_count, cutoff) | |
return cutoff | |
top_npct_uris(1) | |
top_npct_uris(5) | |
heavy_user_cutoff = top_npct_uris(10) | |
_ = top_npct_uris(30) | |
# For the purposes of this analysis and as suggested by bcolloran, we're going to look at the top 10% of users as "heavy users". | |
# In[45]: | |
print "Heavy users are users who loaded more than {} pages this week.".format(heavy_user_cutoff) | |
# ## Do page loads distribute differently than usage hours | |
# | |
# bcolloran did a previous analysis of the distribution of session hours and active ticks. We don't have active ticks yet, but let's calculate the distribution of usage hours to double-check our data and repeat the previous analysis | |
# In[46]: | |
total_session_length = sample.select(pyspark.sql.functions.sum(sample.subsession_length_sum)).first()[0] | |
# In[47]: | |
ranked_by_session_length = spark.sql(""" | |
SELECT | |
subsession_length_sum, | |
RANK() OVER (ORDER BY subsession_length_sum) AS rank | |
FROM sample | |
ORDER BY subsession_length_sum | |
""").cache() | |
# In[48]: | |
def top_npct_length(n): | |
topn = ranked_by_session_length.where(ranked_by_session_length.rank > total_users * (100.0 - n) / 100) | |
cutoff = topn.first().subsession_length_sum | |
length_sum = topn.select(pyspark.sql.functions.sum(sample.subsession_length_sum)).first()[0] | |
print "The top {}% of users accounted for {:.0f}% of total session hours. (cutoff={:.1f} hours)".format(n, 100.0 * length_sum / total_session_length, cutoff / 60.0 / 60.0) | |
top_npct_length(1) | |
top_npct_length(5) | |
top_npct_length(10) | |
top_npct_length(30) | |
# ## Comparing light and heavy user behavior | |
# | |
# The following charts compare the behavior of heavy users and normal users. | |
# In[49]: | |
def iter_with_next(l): | |
"Generate (first, next) pairs for a list (skipping the last item)." | |
i = iter(l) | |
prev = i.next() | |
for v in i: | |
yield (prev, v) | |
prev = v | |
# In[50]: | |
def show_histogram(colname, minv, maxv, num_buckets, label): | |
buckets = range(minv, maxv, (maxv - minv) / num_buckets) + [maxv] | |
selected = sample.select(pyspark.sql.functions.expr(colname).alias("data"), "total_uri_count") | |
label_map = { | |
-1: "<{}".format(minv), | |
maxv: ">={}".format(maxv) | |
} | |
caseExpr = pyspark.sql.functions.when(selected['data'] < minv, -1) | |
for start, end in iter_with_next(buckets): | |
caseExpr = caseExpr.when(selected['data'] < end, start) | |
label_map[start] = start == end - 1 and str(start) or "{}-{}".format(start, end - 1) | |
caseExpr = caseExpr.otherwise(maxv) | |
def print_chart(data, subgroup): | |
bucketed = data.select(caseExpr.alias("bucket")) | |
counts = bucketed.groupBy("bucket").count().toPandas().set_index("bucket").sort_index() | |
max_count = counts.max()['count'] | |
total_count = counts.sum()['count'] | |
counts.plot(kind="bar", legend=None, figsize=((num_buckets + 2) / 3, float(max_count) / total_count * 5)) | |
xticks = [label_map[v] for v in counts.index.values] | |
plt.xticks(range(0, num_buckets + 1), xticks) | |
plt.xlabel(label) | |
ytick_ratios = np.arange(0.05, float(max_count) / total_count, 0.05) | |
yticks = ["{:.0f}%".format(r * 100) for r in ytick_ratios] | |
yticks_pos = [r * total_count for r in ytick_ratios] | |
plt.yticks(yticks_pos, yticks) | |
plt.ylabel("% of {} users".format(subgroup)) | |
plt.show() | |
print_chart(selected, "all") | |
print_chart(selected.where(selected['total_uri_count'] > heavy_user_cutoff), "heavy") | |
# In[51]: | |
show_histogram("subsession_length_sum / 60 / 60", 0, 160, 20, "usage hours") | |
# Possible conclusions: | |
# | |
# 1. heavy users tend to use their browser more (let's state the obvious!) | |
# 2. heavy users are more likely to be the kind of user who keeps their browser open overnight. | |
# In[52]: | |
show_histogram("subsession_count", 0, 100, 10, "# of subsessions") | |
# No easy conclusions here, although there appear to be a bunch of heavy users who still open and close their browser regularly? Should validate this with the ping reason data. | |
# In[53]: | |
show_histogram("max_concurrent_tab_count", 1, 25, 12, "max # of tabs open any time during the week") | |
# Conclusions: heavy users tend toward a few more tabs in general, and there is a more significant proportion of tab-hoarders who regularly keep >25 tabs open. | |
# In[54]: | |
show_histogram("tab_open_event_count", 0, 2000, 25, "# of tabs opened") | |
# In[67]: | |
# how many users have a null tab_open_event_count? This could mean they didn't open any tabs... need to check how this behaves more precisely. | |
null_tabopen_count = sample.where(pyspark.sql.functions.isnull(sample.tab_open_event_count)).count() | |
print "users with no tab open count: {:,} ({:.0f}%)".format(null_tabopen_count, 100.0 * null_tabopen_count / total_users) | |
# In[55]: | |
show_histogram("total_uri_count", 0, 10000, 25, "Total toplevel URIs loaded") | |
# In[56]: | |
show_histogram("total_uri_count / subsession_length_sum * 60 * 60", 0, 1000, 25, "pageloads per session-hour") | |
# In[57]: | |
show_histogram("total_uri_count / tab_open_event_count", 0, 500, 25, "page loads per tab-open-event") | |
# ## Other checking | |
# In[58]: | |
show_histogram("daily_session", 0, 8, 7, "'daily' pings (evidence of leaving browser on overnight)") | |
# Conclusion: **confirmed** that heavy users are much more likely to leave their browser open overnight. | |
# In[59]: | |
show_histogram("aborted_sessions", 0, 20, 10, "'aborted-session' ping count (evidence of crash or abrupt logoff)") | |
# An environment-change ping can be caused by changing some settings, installing/enabling/disabling addons. It probably doesn't mean much. | |
# In[60]: | |
show_histogram("environment_session", 0, 20, 10, "'environment-change' ping count") | |
To close the loop on this, there was an error in the first version of this sheet which @bcolloran detected and has since been fixed. This has been rereviewed and this statement is perhaps the key takeaway: "The top 10% of users accounted for 48% of total URI loads. (cutoff=1490)"
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
(copying from email thread per request from rweiss)
"""
Hi Benjamin, this looks great. I did notice one thing that I think bears looking into "Users with more than 3741 page loads per week account for 80% of all page loads. This is 1.4% of all users." -- In terms of session hours, the top 10% of users accounted for only about 58% of of all session hours recorded; the top 1% for only about 10% of session hours (see https://docs.google.com/document/d/1cctqvm7fbaPemb1C900KGQJsO6GsBsgUIGJLVJV9jKo/edit# ; unfortunately I don't have similar breakdowns for activeTicks or URI loads). The relationship may well be different for URI loads, but the top 1.4% of users accounting for 80% of URI loads is so sharply different that it raises a red flag and should be investigated further. I wonder if there might be an outlier in your sample that is singlehandedly accounting for some crazily large proportion of URI loads or something like that? FWIW, I think you would be justified in throwing away or truncating stratospherically high page load counts -- it seems fine to impose some king of physically realistic limit on what a human can do in a day (something like: 16hrs/day * 3600secs/hr * .1 clicks/sec = 5760 clicks/day -- this is totally made up, but I think it's acceptable to make stuff like this up if you call out your assumptions).
FWIW, using URI loads rather than activeTicks is fine, that is the direction we want to head eventually. We were waiting for the URI measure to reach a 'big enough' (whatever that means) share of the population, but since you are focusing on clients with sufficiently new versions, URI loads and activeTicks are strongly enough correlated that either will work fine.
For most of the heavy user stuff, we've been aiming for the top 10% of the population. So you should be able to skip the cumulative sum of total_uri_count step -- just sort users by total URI loads and mark the top 10% as the heavy users. Does that make sense?
Interesting stuff, please keep me posted!
"""