Skip to content

Instantly share code, notes, and snippets.

@georgf
Last active August 12, 2016 13:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save georgf/0ffe4f915861be180909037a7204d7b9 to your computer and use it in GitHub Desktop.
Save georgf/0ffe4f915861be180909037a7204d7b9 to your computer and use it in GitHub Desktop.
mobile repeated profile date
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# ### Bug 1291265 - Check for repeated client counts in new_records in Fennec dashboard data
# In[1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py
get_ipython().magic(u'pylab inline')
# In[2]:
sc.defaultParallelism
# Load the mobile clients parquet file for performant analysis.
# In[3]:
dataset = sqlContext.read.load("s3n://net-mozaws-prod-us-west-2-pipeline-analysis/mobile/mobile_clients", "parquet")
dataset.count()
# In[4]:
dataset.rdd.first()
# ### Filter out pings sent on d0
# In[6]:
d0 = dataset.filter("channel = 'release'") .filter("os = 'Android'") .filter("submissiondate = profiledate")
# In[10]:
d0.count()
# In[11]:
round(float(d0.count()) / dataset.count(), 3)
# In[25]:
d0.rdd.first()
# ### Check for repeated d0 per client
# First count on how many different days we saw clients submitting d0 pings.
# In[20]:
d0counts = d0.groupBy(['clientid', 'submissiondate']) .count() .groupBy('clientid') .count()
# In[21]:
d0counts.rdd.take(3)
# Now, how many of these submitted d0 pings on more than one day?
# In[23]:
d0counts.filter("count > 1") .count()
# ### Check for repeated profile dates
# Ok, that is really low, that does not seem to be a problem.
# Following up from here, how many clients do actually submit more than one profiledate?
# Group clients profiledate submissions together.
# In[30]:
repeatCounts = dataset.filter("channel = 'release'") .filter("os = 'Android'") .groupBy(['clientid', 'profiledate']) .count()
# In[31]:
repeatCounts.rdd.take(3)
# Now check how many of them submitted more than one profiledate value.
# In[32]:
repeatCounts.groupBy('clientid') .count() .filter('count > 1') .count()
# In[ ]:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment