-
-
Save benmiroglio/39eb20e92a6fb0ad8055284009b2cd0c to your computer and use it in GitHub Desktop.
Bug 1364243
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
# The following analysis is in response to: | |
# | |
# *...It would be really nice to have someone who understands telemetry help us figure out what is actually going on here. For example, it would be good to know if there was a dropoff in telemetry ping submissions from nightly after the 128ms BHR patch landed on May 2.* | |
# | |
# via [Bug 1364243](https://bugzilla.mozilla.org/show_bug.cgi?id=1364243#c5) | |
# | |
# and requests around ping size in nightly | |
# | |
# | |
# **TLDR**: | |
# * I don't see any dropoff in ping submissions after May 2nd in Nightly | |
# * 90th percentile ping size (in bytes) on Apr 11th = **382420.8**, May 11th = **408748.0**) | |
# In[59]: | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
import numpy as np | |
from operator import add | |
import seaborn as sns | |
import sys | |
from moztelemetry import Dataset | |
sns.set(style='whitegrid') | |
get_ipython().magic(u'matplotlib inline') | |
plt.rcParams["figure.figsize"] = (15, 10) | |
# ## Main Ping Submissions On Nightly | |
# In[22]: | |
# load main_summary for relevent dates and channel | |
# nightly is small enough to look at 100% of data :) | |
ms = sqlContext.read.option("mergeSchema", True) .parquet("s3://telemetry-parquet/main_summary/v4") .filter("submission_date_s3 >= '20170410'") .filter("submission_date_s3 <= '20170517'") .filter("app_name = 'Firefox'") .filter("normalized_channel = 'nightly'") | |
# In[23]: | |
# count pings by day and ping reason in case a certain | |
# reason is especially affected | |
daily_time_series_by_reason = ms.groupBy(["submission_date_s3", "reason"]).count().toPandas() | |
# In[24]: | |
# convert date for plotting | |
daily_time_series_by_reason.submission_date_s3 = pd.to_datetime(daily_time_series_by_reason.submission_date_s3, format='%Y%m%d') | |
# In[25]: | |
plt.rcParams["figure.figsize"] = (15, 10) | |
# iterate through reasons to plot multi-line time series | |
fig, ax = plt.subplots(1,1) | |
for reason, data in daily_time_series_by_reason.groupby('reason'): | |
data.plot(x='submission_date_s3', y='count', ax=ax, label=str(reason)) | |
# add marker for may 2nd and some formatting | |
marker = pd.to_datetime('20170502', format='%Y%m%d') | |
plt.axvline(x=marker,color='grey', linestyle='dashed') | |
plt.text(marker, 55000, ' May 2nd') | |
plt.legend(bbox_to_anchor=(1.2, 1)) | |
plt.title("Total Ping Submissions in Nightly by Day and Reason") | |
plt.ylabel('Count') | |
# I've allowed for a long window to realize the day-of-week seasonality apparent in the chart. Although the line for `shutdown` main pings drops after May 2nd, this is expected behavior as demonstrated by previous weeks. Nothing after May 2nd looks out of the ordinary. | |
# # Ping Size | |
# | |
# Now let's look at how many pings fall into extreme ping size buckets from April 11th to present. The most efficient way to see any difference is to first anchor the value at say, the ping size at the 90th percentile on April 10th, and see how many pings exceed that threshold going forward. The raw counts aren't super informative since we are using a 50% sample, focus should be directed toward the general trend. | |
# In[97]: | |
pings = Dataset.from_source("telemetry").where(docType='main').where(submissionDate=lambda x: x>'20170410' and x<'20170516').where(appUpdateChannel='nightly').where(appName='Firefox').records(sc, sample=.5) | |
# In[98]: | |
threshold_apr11 = np.percentile(pings.filter(lambda x: x.get('meta', {}).get('submissionDate') == '20170411') .map(lambda x: sys.getsizeof(str(x))).collect(), 90) | |
# In[99]: | |
daily_count = pings.filter(lambda x: sys.getsizeof(str(x)) >= threshold_apr11) .map(lambda x: (x.get('meta', {}).get('submissionDate'), 1)) .reduceByKey(add).collect() | |
# In[100]: | |
df = pd.DataFrame(daily_count) | |
df.columns = ['submission_date', 'count'] | |
# In[101]: | |
df['submission_date'] = pd.to_datetime(df['submission_date'], format='%Y%m%d') | |
# In[106]: | |
title="Number of pings above April 11th's 90th percentile ping size" | |
df.plot(x='submission_date', title=title, legend=None) | |
plt.ylabel('Count') | |
# The above plot shows the number of pings whose size exceeds the size at the 90th percentile on April 11th. There looks to be a steady increase, however no clear correlation to May 2nd. The increases could be attributed to ongoing main ping additions, but this is just speculation. | |
# In[103]: | |
threshold_may11 = np.percentile(pings.filter(lambda x: x.get('meta', {}).get('submissionDate') == '20170511') .map(lambda x: sys.getsizeof(str(x))).collect(), 90) | |
# In[104]: | |
# the 90th percentile for ping size on apr11 and may11 in bytes | |
threshold_apr11, threshold_may11 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment