Created
March 8, 2017 19:28
-
-
Save bsmedberg/e60910c5c3793055524db04b1f5aae50 to your computer and use it in GitHub Desktop.
Build Reversion
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
--- | |
title: "Longitudinal Dataset Tutorial" | |
authors: | |
- vitillo | |
tags: | |
- tutorial | |
- examples | |
- dataset | |
- longitudinal | |
created_at: 2016-03-10 | |
updated_at: 2016-06-24 | |
tldr: Tutorial of how to use the Longitudinal Dataset | |
--- | |
# ### Longitudinal Dataset Tutorial | |
# The longitudinal dataset is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of values, one per Telemetry submission received for that profile. | |
# | |
# The dataset is going to be regenerated from scratch every week, this allows us to apply non backward compatible changes to the schema and not worry about merging procedures. | |
# | |
# The current version of the longitudinal dataset has been build with all main pings received from 1% of profiles across all channels after mid November, which is shortly after Unified Telemetry landed. Future version will store up to 180 days of data. | |
# In[1]: | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
import numpy as np | |
import plotly.plotly as py | |
get_ipython().magic(u'pylab inline') | |
# In[2]: | |
sc.defaultParallelism | |
# The longitudinal dataset can be accessed as a Spark [DataFrame](https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame), which is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python. | |
# In[4]: | |
frame = sqlContext.sql("SELECT client_id, profile_subsession_counter, build, settings FROM longitudinal") | |
# Number of profiles: | |
# In[5]: | |
frame.count() | |
# The dataset contains all histograms but it doesn't yet include all metrics stored in the various sections of the pings. See the [code](https://github.com/vitillo/telemetry-batch-view/blob/longitudinal/src/main/scala/streams/Longitudinal.scala#L68) that generates the dataset for a complete list of supported metrics. More metrics are going to be included in future versions of the dataset, inclusion of specific metrics can be prioritized by filing a bug. | |
# ### Scalar metrics | |
# A Spark bug is slowing down the *first* and *take* methods on a dataframe. A way around that for now is to first convert the dataframe to a rdd and then invoke *first* or *take*, e.g.: | |
# In[6]: | |
first = frame.rdd.first() | |
#filter("normalized_channel = 'release'")\ | |
# .select("build", | |
# "system", | |
# "gc_ms", | |
# "fxa_configured", | |
# "browser_set_default_always_check", | |
# "browser_set_default_dialog_prompt_rawcount") | |
# As mentioned earlier on, each field of the dataframe is an array containing one value per submission per client. The submissions are chronologically sorted. | |
# In[9]: | |
first.build | |
# In[12]: | |
def get_version(v): | |
try: | |
return int(v.split(".")[0]) | |
except ValueError: | |
return None | |
# In[13]: | |
print get_version("") | |
print get_version("12.3") | |
print get_version("12") | |
print get_version("abcd") | |
# In[26]: | |
count_total = sc.accumulator(0) | |
count_backwardsversion = sc.accumulator(0) | |
count_channelswitch = sc.accumulator(0) | |
count_backwardsversion_releaseonly = sc.accumulator(0) | |
def mapper(row): | |
channel_switch = False | |
release_only = True | |
backwardsversion = False | |
# sessions are sorted by subsessionStartDate and then profileSubsessionCounter, newest-first. | |
last_version = 99 | |
last_channel = None | |
for settings in row.settings: | |
channel = settings.update.channel | |
if channel != "release": | |
release_only = False | |
if last_channel is None: | |
last_channel = channel | |
elif last_channel != channel: | |
channel_switch = True | |
for build in row.build: | |
version = get_version(build.version) | |
if version is not None: | |
if version > last_version: | |
backwardsversion = True | |
last_version = version | |
count_total.add(1) | |
if backwardsversion: | |
count_backwardsversion.add(1) | |
if release_only: | |
count_backwardsversion_releaseonly.add(1) | |
if channel_switch: | |
count_channelswitch.add(1) | |
frame.rdd.foreach(mapper) | |
total = float(count_total.value) | |
print "users that switched channels at all: {:.2f}%".format(count_channelswitch.value / total * 100) | |
print "users that reverted to an older version: {:.2f}%".format(count_backwardsversion.value / total * 100) | |
print "users that reverted to an older version, staying on the release channel: {:.2f}".format(count_backwardsversion_releaseonly.value / total * 100) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment