Create a gist now

Instantly share code, notes, and snippets.

@benmiroglio /Bug 1364243.ipynb Secret
Last active May 17, 2017

What would you like to do?
Bug 1364243
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# The following analysis is in response to:
#
# *...It would be really nice to have someone who understands telemetry help us figure out what is actually going on here. For example, it would be good to know if there was a dropoff in telemetry ping submissions from nightly after the 128ms BHR patch landed on May 2.*
#
# via [Bug 1364243](https://bugzilla.mozilla.org/show_bug.cgi?id=1364243#c5)
#
# and requests around ping size in nightly
#
#
# **TLDR**:
# * I don't see any dropoff in ping submissions after May 2nd in Nightly
# * 90th percentile ping size (in bytes) on Apr 11th = **382420.8**, May 11th = **408748.0**)
# In[59]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from operator import add
import seaborn as sns
import sys
from moztelemetry import Dataset
sns.set(style='whitegrid')
get_ipython().magic(u'matplotlib inline')
plt.rcParams["figure.figsize"] = (15, 10)
# ## Main Ping Submissions On Nightly
# In[22]:
# load main_summary for relevent dates and channel
# nightly is small enough to look at 100% of data :)
ms = sqlContext.read.option("mergeSchema", True) .parquet("s3://telemetry-parquet/main_summary/v4") .filter("submission_date_s3 >= '20170410'") .filter("submission_date_s3 <= '20170517'") .filter("app_name = 'Firefox'") .filter("normalized_channel = 'nightly'")
# In[23]:
# count pings by day and ping reason in case a certain
# reason is especially affected
daily_time_series_by_reason = ms.groupBy(["submission_date_s3", "reason"]).count().toPandas()
# In[24]:
# convert date for plotting
daily_time_series_by_reason.submission_date_s3 = pd.to_datetime(daily_time_series_by_reason.submission_date_s3, format='%Y%m%d')
# In[25]:
plt.rcParams["figure.figsize"] = (15, 10)
# iterate through reasons to plot multi-line time series
fig, ax = plt.subplots(1,1)
for reason, data in daily_time_series_by_reason.groupby('reason'):
data.plot(x='submission_date_s3', y='count', ax=ax, label=str(reason))
# add marker for may 2nd and some formatting
marker = pd.to_datetime('20170502', format='%Y%m%d')
plt.axvline(x=marker,color='grey', linestyle='dashed')
plt.text(marker, 55000, ' May 2nd')
plt.legend(bbox_to_anchor=(1.2, 1))
plt.title("Total Ping Submissions in Nightly by Day and Reason")
plt.ylabel('Count')
# I've allowed for a long window to realize the day-of-week seasonality apparent in the chart. Although the line for `shutdown` main pings drops after May 2nd, this is expected behavior as demonstrated by previous weeks. Nothing after May 2nd looks out of the ordinary.
# # Ping Size
#
# Now let's look at how many pings fall into extreme ping size buckets from April 11th to present. The most efficient way to see any difference is to first anchor the value at say, the ping size at the 90th percentile on April 10th, and see how many pings exceed that threshold going forward. The raw counts aren't super informative since we are using a 50% sample, focus should be directed toward the general trend.
# In[97]:
pings = Dataset.from_source("telemetry").where(docType='main').where(submissionDate=lambda x: x>'20170410' and x<'20170516').where(appUpdateChannel='nightly').where(appName='Firefox').records(sc, sample=.5)
# In[98]:
threshold_apr11 = np.percentile(pings.filter(lambda x: x.get('meta', {}).get('submissionDate') == '20170411') .map(lambda x: sys.getsizeof(str(x))).collect(), 90)
# In[99]:
daily_count = pings.filter(lambda x: sys.getsizeof(str(x)) >= threshold_apr11) .map(lambda x: (x.get('meta', {}).get('submissionDate'), 1)) .reduceByKey(add).collect()
# In[100]:
df = pd.DataFrame(daily_count)
df.columns = ['submission_date', 'count']
# In[101]:
df['submission_date'] = pd.to_datetime(df['submission_date'], format='%Y%m%d')
# In[106]:
title="Number of pings above April 11th's 90th percentile ping size"
df.plot(x='submission_date', title=title, legend=None)
plt.ylabel('Count')
# The above plot shows the number of pings whose size exceeds the size at the 90th percentile on April 11th. There looks to be a steady increase, however no clear correlation to May 2nd. The increases could be attributed to ongoing main ping additions, but this is just speculation.
# In[103]:
threshold_may11 = np.percentile(pings.filter(lambda x: x.get('meta', {}).get('submissionDate') == '20170511') .map(lambda x: sys.getsizeof(str(x))).collect(), 90)
# In[104]:
# the 90th percentile for ping size on apr11 and may11 in bytes
threshold_apr11, threshold_may11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment